2019 - Computer Proteomics With Jupyter and Python
2019 - Computer Proteomics With Jupyter and Python
Abstract
Proteomics based on mass spectrometry produces complex data in large quantities. The need for flexible
computational pipelines, in the context of big data, in proteomics and other areas of science, has prompted
the development of computational platforms and libraries that facilitate data analysis and data process-
ing. In this respect, Python appears to be one of the winners among programming languages in terms
of popularity and development. This chapter shows how to perform basic tasks using Python and dedi-
cated libraries in a Jupyter framework: from basic search result summarizations to the creation of MS1
chromatograms.
1 Introduction
Caroline A. Evans et al. (eds.), Mass Spectrometry of Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 1977,
https://doi.org/10.1007/978-1-4939-9232-4_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019
237
238 Lars Malmström
2 Materials
3 Methods
3.1 JupyterHub Docker containers are to software what shipping containers are to
as a Docker Container goods—a number of different computers/operating systems can
run various tool packaged in a standardized “box.” Developers can
package several different software packages into a Docker con-
tainer, and these can then be executed on computers of different
sizes and host operating systems. The technologies on which this
chapter depends were developed with servers in mind, and this is,
of course, great as servers are larger and hence can handle the
workload better. However, the overhead of setting up and running
servers is too high for somebody to try something out or to run a
tutorial as in this example. The second advantage of Docker con-
tainers is that they can be versioned, and this means you can go
back to a previous version if needed. The only drawback is that
installing the Docker software that run Docker containers requires
a bit of effort, especially if you are using the Windows operating
system in any other version that Windows 10 Professional (although
Microsoft might have mitigated this shortcoming by the time this
gets to press). See links in the Materials section to find excellent
tutorials on how to install docker on the most common operating
systems. Docker also has an excellent community that can support
you with any issue you encounter. For the example provided here,
configure Docker to use 6GB of RAM or more.
The concept of Docker is shown in Fig. 1. A developer builds
a container (orange box) on his or her computer, shown in red.
That container is then published on, for example, Docker Hub,
where each version of that container is available for download.
Users using different operating systems, Windows (blue), Mac
(dark green), or Linux (light green) for example, can pull the con-
tainer from Docker Hub.
1. Download the container from Docker Hub on the command
line using
docker pull malmstroem/pyproteomics
2. Start the container, listen to port 8000, run the command
jupyterhub with the command (make sure you replace the
string [host folder] with a folder on your host computer)
docker run --name jhub -p 8000:8000
-v [host folder]:/host malmstroem/
pyproteomics:latest jupyterhub
The -v flag makes a disk or part of a disk available inside the
container. The -p flag makes it possible for a so-called port of
the container to be linked to a port of the host. Here, the con-
tainer port 8000 is mapped to the host port 8000 and this
allows direct interaction with the program inside the container
that is configured to communicate over port 8000 (see Note 2).
240 Lars Malmström
A
docker push
DockerHub
4 containers
docker pull Developer’s computer
3.2 Getting Started Jupyter is a technology that enables users to execute code (such as
with Jupyter Python or R) on a remote server, using a rich web application.
Notebooks Jupyter is organized around documents called “notebooks” and
notebooks in the current working directory are displayed in a table
format in the main JupyterHub page, see Fig. 2a. There are two
important functions available as buttons in the right part of the
interface: “Upload” and “New.” “Upload” allows the user to
upload any file from their computer to the server and “New” dis-
plays a menu giving the user the opportunity to create a new note-
book (several programming languages, referred to as “kernels,” are
generally available), to create a text file, to create a new folder, or
to launch a shell (see Note 3).
1. Create a new Python 3 notebook by clicking “New” and by
selecting “Python 3.” This creates a Python 3 notebook in a
new tab of the web browser; by default, the notebook is named
“Untitled.”
2. Rename the notebook by clicking on “Untitled.”
3. Click on the first cell and type
a = 5
b = 3
Computational Proteomics with Jupyter 241
Fig. 2 (a) The main page of the Jupyter interface lists notebooks, files, and folders in the main panel. Above the
main panel are options to see running notebooks and terminals as well as buttons to upload files, and create
new notebooks, terminals, folders, and files. (b) The notebook consists of cells, marked with In[] and a number
in the bracket that corresponds to how many cells have been executed in the notebook. The output of the cell
is found below, marked with Out[]. (c) In the “Running” tab, terminals and notebooks that are running on the
server are listed. They can be shut down to free up resources on the server
6. Then hit the SHIFT and ENTER keys to evaluate the cell. The
cell gets labeled “In [3]” and produces an output that is labeled
“Out [3]” (Fig. 2b).
7. Go back the previous tab in the web browser; the notebook
now appears in the list of files and Jupyter shows that the
Python 3 kernel is running. Alternatively, click on the
“Running” tab in the Jupyter page to check what’s going on
(Fig. 2c). You can kill the kernel but clicking on “Shutdown”
although it’s generally best to halt the kernel directly from the
notebook by selecting “Close and Halt” in the “File” menu.
3.3 Getting Started Python is a programming language that is both powerful and sim-
with Python ple to use; it has gained popularity in the bioinformatics, pro-
Programming teomics, and machine learning communities. It is beyond the scope
of this chapter to talk in depth about Python and a large number
of software packages that are installed. I’ll illustrate the use of a
standard library. The library “pandas” allows Python to read and
manipulate data in a similar ways to Excel can manipulate data in a
spreadsheet. I’ll illustrate here how pandas can be used to manipu-
late the peptide-spectrum matches (PSMs) contained in the “pro-
tein_peptide.xlsx” file.
1. Create a new Python 3 notebook by clicking “New” and by
selecting “Python 3.” This creates a Python 3 notebook in a
new tab of the web browser; by default, the notebook is named
“Untitled.” Rename the notebook by clicking on “Untitled.”
2. In a Python cell, import the library “pandas” (conventionally,
for legibility purposes, the library prefix imported under the
“pd” name)
import pandas as pd
3. Import the PSMs
df = pd.read_excel("protein_peptide.xlsx", 0)
This assigns the data as a data frame to the variable “df.” A
data frame is an array whose columns has names.
4. You can get an idea of the contents of the data by looking at a
few entries
df.head()
5. Alternatively, you can focus on one specific column using
df.peptide.head()
Here “df.peptide” refers to the column “peptide” of the
data frame “df.”
6. You can summarize the number of peptides by applying the
methods groupby() and count() to the data frame to obtain
the final pivot table
df.groupby("protein").count().
Computational Proteomics with Jupyter 243
3.4 Searching In this section, I intend to demonstrate a small subset of all the
a Database powerful Python packages that exist. This is by no means compre-
hensive, but I hope to show how to process, access and visualize
data and results from proteomics workflows. The data was pro-
duced using a Thermo QExactive+ to measure mouse blood plasma
[1]. For convenience, all commands mentioned below are pro-
vided in the “demo” notebook in the Docker container.
1. In a new notebook, import libraries [2]
import ursgal
import pandas as pd
2. Create a UController class and perform a search using the
database “mouse_panther12.fasta” and the spectra contained
in “Erik_S1407_239.mzML” [3, 4]
uc = ursgal.UController(
params = {'database': 'mouse_panther12.
fasta'},
profile = 'QExactive+')
result = uc.search(input_file = 'Erik_
S1407_239.mzML',
engine = 'xtandem_vengeance')
This will produce a CSV file containing the PSMs.
3. The CSV file can be parsed and analyzed using pandas
unif = pd.read_csv('Erik_S1407_239_xtandem_
vengeance_pmap_unified.csv')
For instance, we can get the dimensions of the data using
unif.shape[1]
What columns are present in the file (Spectrum ID, Spectrum
Title, Expected m/z, etc.) can be obtained
unif.columns
A given column can be used a separate entity using something
like
unif['Mass Difference']
4. The demo notebook also includes the code that allows one to
plot the histogram of X!Tandem e-values contained in the
“X!Tandem:expect” column of the data frame “unif.” High-
confidence PSMs can thus be selected from the dataset using
hi_conf = unif[unif['X\!Tandem:expect'] <
2.382e-15]
The innermost unif instruction creates a list of booleans that
is used by the outermost unif instruction to select the search
results in the data frame such that the e-value is less than
2.382 × 10−15.
244 Lars Malmström
3.5 Spectrum Using pandas, one can easily find, among the high-confidence
Visualization identifications, the most common peptide, RHPDYSVSLLLR
from Mouse Serum Albumin. That step is described in the demo
notebook. Here, we set about visualizing and visually validating
one spectrum assigned to that peptide.
1. First let’s assign the peptide of interest’s sequence and the
charge state we’re interested in
peptide = ‘RHPDYSVSLLLR’
charge = 3
2. Import a few libraries
from pyteomics import mgf, mass
import pylab
3. We define a function whose purpose is to compute all theoreti-
cal fragments produced by the fragmentation of a peptide (see
Note 4)
def fragments(peptide): # ,
types=(‘b’, ‘y’), maxcharge=1
“”” This function calculates frag-
ments for a given peptide “””
for I in range(1, len(peptide)-1):
for ion_type in (‘b’,’y’):
if ion_type[0] in ‘abc’:
yield mass.fast_
mass(peptide[:i], ion_type=ion_type,
charge=1)
else:
yield mass.fast_
mass(peptide[i:], ion_type=ion_type,
charge=1)
4. Let’s read the first and only spectrum from the file “Erik_
S1407_239.25966.25966.3.mgf” (MGF stands for Mascot
Generic File)
with mgf.read(‘Erik_
S1407_239.25966.25966.3.mgf’) as mgf_file:
spectrum = next(mgf_file)
The variable “spectrum” is what is technically known as a
dictionary with four keys: “m/z array,” “intensity array,”
“charge array,” and “params.” We’ll use those when produc-
ing the final plot; “spectrum[‘params’]” is itself a dictionary
that gives access to the charge state (key “charge”), the precur-
sor mass (“pepmass”), the retention time (“rtinseconds”), and
the so-called spectrum title (“title”), among other things (see
Note 5).
5. Let’s now plot the spectrum. A wider format is used for the
spectrum representation:
pylab.figure(figsize=(15,5))
Computational Proteomics with Jupyter 245
3.6 MS1 Here we take a closer look at the MS1 chromatogram for the same
Chromatogram peptide RHPDYSVSLLLR. To do this, we can explicitly write
some code to create the chromatogram with the parameters we
want. There are of course software packages such as OpenSWATH
that can accomplish this much more efficiently [5]. For this, I am
here using PyOpenMS [6], the Python binding for the large, fast,
and versatile C++ software suit called OpenMS [7], something that
can be useful in case NumPress [8] was used to reduce the size of
the mzML file. PyOpenMS is only available in Python 2, but as
stated above, we can simply execute this in a Python 2 kernel by
including “%%python2” at the top of the cell. (Another approach
is also illustrated in the demo notebook.)
1. In a Python 2 cell, load the necessary libraries
%%python2
import pyopenms
import pandas as pd
Import the data
msExp = pyopenms.MSExperiment()
inputfile = 'Erik_S1407_239.mzML'
pyopenms.FileHandler().
loadExperiment(inputfile, msExp)
246 Lars Malmström
Fig. 3 MS2 spectrum overlaid with the masses of the theoretical fragments of peptide RHPDYSVSLLLR for
expect inspection
4 Notes
References
1. Malmström E, Kilsgård O, Hauri S et al (2016) 5. Röst HL, Rosenberger G, Navarro P et al
Large-scale inference of protein tissue origin in (2014) OpenSWATH enables automated, tar-
gram-positive sepsis plasma using quantitative geted analysis of data-independent acquisition
targeted proteomics. Nat Commun 7:10261 MS data. Nat Biotechnol 32:219–223
2. Kremer LPM, Leufken J, Oyunchimeg P et al 6. Röst HL, Schmitt U, Aebersold R, Malmstrom
(2016) Ursgal, universal Python module com- L (2014) pyOpenMS: a Python-based interface
bining common bottom-up proteomics tools to the OpenMS mass-spectrometry algorithm
for large-scale analysis. J Proteome Res library. Proteomics 14:74–77
15:788–794 7. Röst HL, Sachsenberg T, Aiche S et al (2016)
3. Mi H, Huang X, Muruganujan A et al (2017) OpenMS: a flexible open-source software plat-
PANTHER version 11: expanded annotation form for mass spectrometry data analysis. Nat
data from Gene Ontology and Reactome path- Methods 13:741–748
ways, and data analysis tool enhancements. 8. Teleman J, Dowsey AW, Gonzalez-Galarza FF
Nucleic Acids Res 45:D183–D189 et al (2014) Numerical compression schemes
4. Craig R, Beavis RC (2004) TANDEM: match- for proteomics mass spectrometry data. Mol
ing proteins with tandem mass spectra. Cell Proteomics 13:1537–1542
Bioinformatics 20:1466–1467