0% found this document useful (0 votes)
40 views

2019 - Computer Proteomics With Jupyter and Python

Uploaded by

Angeline Jeba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

2019 - Computer Proteomics With Jupyter and Python

Uploaded by

Angeline Jeba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 15

Computational Proteomics with Jupyter and Python


Lars Malmström

Abstract
Proteomics based on mass spectrometry produces complex data in large quantities. The need for ­flexible
computational pipelines, in the context of big data, in proteomics and other areas of science, has prompted
the development of computational platforms and libraries that facilitate data analysis and data process-
ing. In this respect, Python appears to be one of the winners among programming languages in terms
of popularity and development. This chapter shows how to perform basic tasks using Python and dedi-
cated libraries in a Jupyter framework: from basic search result summarizations to the creation of MS1
chromatograms.

Key words Proteomics, Python, R, Jupyter, JupyterHub, Reproducible research

1  Introduction

Mass spectrometry-based proteomics is a powerful technology.


Three aspects that set it apart are speed, flexibility, and its applica-
bility to arbitrarily complex samples. Ion optics and high-­resolution
mass analyzers enable the user to measure anything that alters the
chemical composition of the molecules of interest, at least in prin-
ciple. Scientists use this ability to identify and quantify many aspects
of their samples that are otherwise difficult to study, such as post-
translational modifications, or they use chemistry to identify amino
acids that are surface exposed or measure protein-protein interac-
tions. As a result, the list of protocols and data acquisition methods
is long; while this is good, it also causes complexity of the software
used to analyze the data that results from these experiments. On
the one hand, we now have powerful software suits that are
desktop-­based and, to some degree, are relatively easy to use; on
the other hand, software suits exist that were designed bottom up
to deal with common protocols or very large experiments where a
single computer simply is not powerful enough to process the data
in a reasonable time. There are many use cases that fall in between
these two extremes, where researchers want to prototype a

Caroline A. Evans et al. (eds.), Mass Spectrometry of Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 1977,
https://doi.org/10.1007/978-1-4939-9232-4_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019

237
238 Lars Malmström

­ orkflow to analyze a new type of data or to study an interesting


w
or unusual aspect of a given dataset. The standardized solutions are
simply not flexible enough and writing software from scratch is not
an option due to time constraints or skill gaps. In this book chap-
ter, I will show a few examples of how mass spectrometrists with
some basic programming skills can take advantage of the many
powerful python libraries to explore their data. To allow the reader
to get some practical experience and to run some of these examples
themselves, with a minimal amount of time spent on installing soft-
ware and downloading data, I provide a Docker container with
needed software installed. All examples are demonstrated using
Jupyter Notebooks which also allow for relatively simple integra-
tion of JavaScript, making the integration of tools such as Bokeh
for interactive visualization. This chapter has two main sections, an
introduction to Docker containers and Jupyter, with specific
instructions on how to get the Docker container accompanying
this chapter running. The second section of the chapter will walk
through a few software packages and examples of how to extract
and visualize data using Python. Please feel free to skip down to
Python in Proteomics if you are not interested in the technical
aspects.

2  Materials

2.1  Instructions 1. Mac: https://docs.docker.com/docker-for-mac/install/.


to Install Docker 2. Windows 10 Pro: https://docs.docker.com/docker-for-win-
Containers dows/install/.
3. Older versions of Windows: https://docs.docker.com/tool-
box/toolbox_install_windows/.
4. Linux servers: https://docs.docker.com/engine/installation/.
5. Python https://www.python.org/ (see Note 1).
6. The pandas library: https://pandas.pydata.org/ (see Note 1).
7. The NumPy library: http://www.numpy.org/ (see Note 1).
8. The Ursgal library: https://github.com/ursgal/ursgal.
9. The Bokeh library: https://bokeh.pydata.org/.
10. The PyLab library: https://scipy.github.io/old-wiki/pages/
PyLab.
11. The Pyteomics library: https://pypi.org/project/pyteomics/.

12.
The PyOpenMS library: https://github.com/OpenMS/
OpenMS/wiki/pyOpenMS.
13. Jupyter: http://jupyter.org/ (see Note 1).
Computational Proteomics with Jupyter 239

3  Methods

3.1  JupyterHub Docker containers are to software what shipping containers are to
as a Docker Container goods—a number of different computers/operating systems can
run various tool packaged in a standardized “box.” Developers can
package several different software packages into a Docker con-
tainer, and these can then be executed on computers of different
sizes and host operating systems. The technologies on which this
chapter depends were developed with servers in mind, and this is,
of course, great as servers are larger and hence can handle the
workload better. However, the overhead of setting up and running
servers is too high for somebody to try something out or to run a
tutorial as in this example. The second advantage of Docker con-
tainers is that they can be versioned, and this means you can go
back to a previous version if needed. The only drawback is that
installing the Docker software that run Docker containers requires
a bit of effort, especially if you are using the Windows operating
system in any other version that Windows 10 Professional (although
Microsoft might have mitigated this shortcoming by the time this
gets to press). See links in the Materials section to find excellent
tutorials on how to install docker on the most common operating
systems. Docker also has an excellent community that can support
you with any issue you encounter. For the example provided here,
configure Docker to use 6GB of RAM or more.
The concept of Docker is shown in Fig. 1. A developer builds
a container (orange box) on his or her computer, shown in red.
That container is then published on, for example, Docker Hub,
where each version of that container is available for download.
Users using different operating systems, Windows (blue), Mac
(dark green), or Linux (light green) for example, can pull the con-
tainer from Docker Hub.
1. Download the container from Docker Hub on the command
line using
docker pull malmstroem/pyproteomics
2. Start the container, listen to port 8000, run the command
jupyterhub with the command (make sure you replace the
string [host folder] with a folder on your host computer)
docker run --name jhub -p 8000:8000
-v [host folder]:/host malmstroem/
pyproteomics:latest jupyterhub
The -v flag makes a disk or part of a disk available inside the
container. The -p flag makes it possible for a so-called port of
the container to be linked to a port of the host. Here, the con-
tainer port 8000 is mapped to the host port 8000 and this
allows direct interaction with the program inside the container
that is configured to communicate over port 8000 (see Note 2).
240 Lars Malmström

A
docker push

DockerHub
4 containers
docker pull Developer’s computer

Windows Mac Linux

Fig. 1 General schematic illustrating the principle of Docker containers

3. Load the page http://127.0.0.1:8000 in your browser and


log in using “user” and “password” as the user name and
password.
4. Click on the demo notebook to get started.

3.2  Getting Started Jupyter is a technology that enables users to execute code (such as
with Jupyter Python or R) on a remote server, using a rich web application.
Notebooks Jupyter is organized around documents called “notebooks” and
notebooks in the current working directory are displayed in a table
format in the main JupyterHub page, see Fig. 2a. There are two
important functions available as buttons in the right part of the
interface: “Upload” and “New.” “Upload” allows the user to
upload any file from their computer to the server and “New” dis-
plays a menu giving the user the opportunity to create a new note-
book (several programming languages, referred to as “kernels,” are
generally available), to create a text file, to create a new folder, or
to launch a shell (see Note 3).
1. Create a new Python 3 notebook by clicking “New” and by
selecting “Python 3.” This creates a Python 3 notebook in a
new tab of the web browser; by default, the notebook is named
“Untitled.”
2. Rename the notebook by clicking on “Untitled.”
3. Click on the first cell and type
a = 5
b = 3
Computational Proteomics with Jupyter 241

Fig. 2 (a) The main page of the Jupyter interface lists notebooks, files, and folders in the main panel. Above the
main panel are options to see running notebooks and terminals as well as buttons to upload files, and create
new notebooks, terminals, folders, and files. (b) The notebook consists of cells, marked with In[] and a number
in the bracket that corresponds to how many cells have been executed in the notebook. The output of the cell
is found below, marked with Out[]. (c) In the “Running” tab, terminals and notebooks that are running on the
server are listed. They can be shut down to free up resources on the server

Then simultaneously hit SHIFT and ENTER to evaluate


the cell; the cell should get a label “In [1]” to indicate that it’s
been sent to the Python kernel for processing. As these are two
variable assignments, no output is produced.
4. The cursor should now be in a new cell waiting for more input;
type
a + b
then hit the SHIFT and ENTER keys to evaluate the cell.
The cell gets labeled “In [2]” and produces an output that is
labeled “Out [2].”
5. The cursor should now be in a new cell waiting for more input;
type
a * b
242 Lars Malmström

6. Then hit the SHIFT and ENTER keys to evaluate the cell. The
cell gets labeled “In [3]” and produces an output that is labeled
“Out [3]” (Fig. 2b).
7. Go back the previous tab in the web browser; the notebook
now appears in the list of files and Jupyter shows that the
Python 3 kernel is running. Alternatively, click on the
“Running” tab in the Jupyter page to check what’s going on
(Fig. 2c). You can kill the kernel but clicking on “Shutdown”
although it’s generally best to halt the kernel directly from the
notebook by selecting “Close and Halt” in the “File” menu.

3.3  Getting Started Python is a programming language that is both powerful and sim-
with Python ple to use; it has gained popularity in the bioinformatics, pro-
Programming teomics, and machine learning communities. It is beyond the scope
of this chapter to talk in depth about Python and a large number
of software packages that are installed. I’ll illustrate the use of a
standard library. The library “pandas” allows Python to read and
manipulate data in a similar ways to Excel can manipulate data in a
spreadsheet. I’ll illustrate here how pandas can be used to manipu-
late the peptide-spectrum matches (PSMs) contained in the “pro-
tein_peptide.xlsx” file.
1. Create a new Python 3 notebook by clicking “New” and by
selecting “Python 3.” This creates a Python 3 notebook in a
new tab of the web browser; by default, the notebook is named
“Untitled.” Rename the notebook by clicking on “Untitled.”
2. In a Python cell, import the library “pandas” (conventionally,
for legibility purposes, the library prefix imported under the
“pd” name)
import pandas as pd
3. Import the PSMs
df = pd.read_excel("protein_peptide.xlsx", 0)
This assigns the data as a data frame to the variable “df.” A
data frame is an array whose columns has names.
4. You can get an idea of the contents of the data by looking at a
few entries
df.head()
5. Alternatively, you can focus on one specific column using
df.peptide.head()
Here “df.peptide” refers to the column “peptide” of the
data frame “df.”
6. You can summarize the number of peptides by applying the
methods groupby() and count() to the data frame to obtain
the final pivot table
df.groupby("protein").count().
Computational Proteomics with Jupyter 243

3.4  Searching In this section, I intend to demonstrate a small subset of all the
a Database powerful Python packages that exist. This is by no means compre-
hensive, but I hope to show how to process, access and visualize
data and results from proteomics workflows. The data was pro-
duced using a Thermo QExactive+ to measure mouse blood plasma
[1]. For convenience, all commands mentioned below are pro-
vided in the “demo” notebook in the Docker container.
1. In a new notebook, import libraries [2]
import ursgal
import pandas as pd
2. Create a UController class and perform a search using the
database “mouse_panther12.fasta” and the spectra contained
in “Erik_S1407_239.mzML” [3, 4]
uc = ursgal.UController(
params = {'database': 'mouse_panther12.
fasta'},
profile = 'QExactive+')
result = uc.search(input_file = 'Erik_
S1407_239.mzML',
engine = 'xtandem_vengeance')
This will produce a CSV file containing the PSMs.
3. The CSV file can be parsed and analyzed using pandas
unif = pd.read_csv('Erik_S1407_239_xtandem_
vengeance_pmap_unified.csv')
For instance, we can get the dimensions of the data using
unif.shape[1]
What columns are present in the file (Spectrum ID, Spectrum
Title, Expected m/z, etc.) can be obtained
unif.columns
A given column can be used a separate entity using something
like
unif['Mass Difference']
4. The demo notebook also includes the code that allows one to
plot the histogram of X!Tandem e-values contained in the
“X!Tandem:expect” column of the data frame “unif.” High-­
confidence PSMs can thus be selected from the dataset using
hi_conf = unif[unif['X\!Tandem:expect'] <
2.382e-15]
The innermost unif instruction creates a list of booleans that
is used by the outermost unif instruction to select the search
results in the data frame such that the e-value is less than
2.382 × 10−15.
244 Lars Malmström

3.5  Spectrum Using pandas, one can easily find, among the high-confidence
Visualization identifications, the most common peptide, RHPDYSVSLLLR
from Mouse Serum Albumin. That step is described in the demo
notebook. Here, we set about visualizing and visually validating
one spectrum assigned to that peptide.
1. First let’s assign the peptide of interest’s sequence and the
charge state we’re interested in
peptide = ‘RHPDYSVSLLLR’
charge = 3
2. Import a few libraries
from pyteomics import mgf, mass
import pylab
3. We define a function whose purpose is to compute all theoreti-
cal fragments produced by the fragmentation of a peptide (see
Note 4)
def fragments(peptide): # ,
types=(‘b’, ‘y’), maxcharge=1
“”” This function calculates frag-
ments for a given peptide “””
for I in range(1, len(peptide)-1):
for ion_type in (‘b’,’y’):
if ion_type[0] in ‘abc’:
yield mass.fast_
mass(peptide[:i], ion_type=ion_type,
charge=1)
else:
yield mass.fast_
mass(peptide[i:], ion_type=ion_type,
charge=1)
4. Let’s read the first and only spectrum from the file “Erik_
S1407_239.25966.25966.3.mgf” (MGF stands for Mascot
Generic File)
with mgf.read(‘Erik_
S1407_239.25966.25966.3.mgf’) as mgf_file:
spectrum = next(mgf_file)
The variable “spectrum” is what is technically known as a
dictionary with four keys: “m/z array,” “intensity array,”
“charge array,” and “params.” We’ll use those when produc-
ing the final plot; “spectrum[‘params’]” is itself a dictionary
that gives access to the charge state (key “charge”), the precur-
sor mass (“pepmass”), the retention time (“rtinseconds”), and
the so-called spectrum title (“title”), among other things (see
Note  5).
5. Let’s now plot the spectrum. A wider format is used for the
spectrum representation:
pylab.figure(figsize=(15,5))
Computational Proteomics with Jupyter 245

The plot is given a title


pylab.title(‘Spectrum Erik_
S1407_239.25966.25966.3 with theoretical
fragments of ‘ + peptide)
Axes are labeled
pylab.xlabel(‘m/z’)
pylab.ylabel(‘Intensity’)
We compute the theoretical fragments of the peptides
theor_spectrum = list(fragments(peptide))
First red bars are placed where the theoretical fragments are
expected:
pylab.bar(theor_spectrum,
[spectrum[‘intensity array’].max()] *
len(theor_spectrum),
width=0.1, edgecolor=’red’, alpha=0.7)
Then the observed spectrum is added as back bars:
pylab.bar(spectrum[‘m/z array’],
spectrum[‘intensity array’],
width=0.1, linewidth=2,
edgecolor=’black’)
The final plot gets shown (Fig. 3) using
pylab.show()

3.6  MS1 Here we take a closer look at the MS1 chromatogram for the same
Chromatogram peptide RHPDYSVSLLLR. To do this, we can explicitly write
some code to create the chromatogram with the parameters we
want. There are of course software packages such as OpenSWATH
that can accomplish this much more efficiently [5]. For this, I am
here using PyOpenMS [6], the Python binding for the large, fast,
and versatile C++ software suit called OpenMS [7], something that
can be useful in case NumPress [8] was used to reduce the size of
the mzML file. PyOpenMS is only available in Python 2, but as
stated above, we can simply execute this in a Python 2 kernel by
including “%%python2” at the top of the cell. (Another approach
is also illustrated in the demo notebook.)
1. In a Python 2 cell, load the necessary libraries
%%python2
import pyopenms
import pandas as pd
Import the data
msExp = pyopenms.MSExperiment()
inputfile = 'Erik_S1407_239.mzML'
pyopenms.FileHandler().
loadExperiment(inputfile, msExp)
246 Lars Malmström

Fig. 3 MS2 spectrum overlaid with the masses of the theoretical fragments of peptide RHPDYSVSLLLR for
expect inspection

Collect all peaks in MS1 spectra that have a mass equal to


485.94045, the mono-isotopic mass of the peptide of interest
when it has a 3+ charge (see Note 6):
entries = []
for spectrum in msExp:
if not spectrum.getMSLevel() == 1:
continue
for peak in spectrum:
if abs(peak.getMZ() - 485.94045) >
0.01:
continue
entries.append({'rt': spectrum.
getRT(), 'mz': peak.getIntensity()})
The resulting data are saved as a tab-separated file
chromtable = pd.DataFrame(entries)
chromtable.to_csv('ms1chromatogram.tsv',
index=False, sep="\t")
2. The chromatogram needs to be processed to be of any use.
One way to do this is to bin the peaks by retention time and to
only keep the maximum intensity for each bin. In a new
(Python 3) cell, we can write the following.
Import the NumPy library
import numpy as np
Import the data as a dataframe
ms1 = pd.read_table('ms1chromatogram.tsv')
The “delta” variable is the size of the retention bin.
delta = 25
Computational Proteomics with Jupyter 247

Fig. 4 MS1 chromatogram for the same peptide RHPDYSVSLLLR

Initialize the chromatogram


imax = int(ms1['rt'].max() // delta)
chromato = np.array([[rt * delta, 0.] for
rt. in range(imax + 1)])
Read the data and update the chromatogram
for index, row in ms1.iterrows():
rt = row['rt']
mz = row['mz']
i = int(rt // delta)
chromato[i][1] = max(chromato[i] [1], mz)
3. The visualization can be achieved using PyLab (Fig. 4)
pylab.plot(chromato[:, 0], chromato[:, 1])
pylab.xlabel(“Retention time”)
pylab.ylabel(“Maximum intensity in 25 s bins”)
pylab.show()

3.7  Stopping 1. Stop the docker with the following command


the Docker Container
docker stop jhub
2. Delete the container and all its files with
docker rm jhub
248 Lars Malmström

4  Notes

1. Conda is an open source package management system for


Windows, macOS, and Linux, which allows the user to quickly
install and update packages for Python among other things. It
includes Jupyter but unfortunately the ursgal library is not
included; URL: https://conda.io/docs/index.html.
2. Another way to interact with the Docker container is to issue a
“docker exec” command in the shell to execute a command
inside the Docker container.
3. Useful tips are available from https://www.dataquest.io/
blog/jupyter-notebook-tips-tricks-shortcuts/.
4. The example in this chapter specifically looks at B and Y ion
series; for other series, the reader can have a look at the
Pyteomics documentation https://pyteomics.readthedocs.io/
en/latest/examples/example_msms.html.
5. If one wanted to iterate through all spectra of an MGF file, you
could do as follows:
for spectrum in mgf.read(mgf_name):
# process spectrum.
6. One can for instance calculate this mass using the “Fragment
Ion Calculator” http://db.systemsbiology.net:8080/pro-
teomicsToolkit/FragIonServlet.html.

References
1. Malmström E, Kilsgård O, Hauri S et al (2016) 5. Röst HL, Rosenberger G, Navarro P et al
Large-scale inference of protein tissue origin in (2014) OpenSWATH enables automated, tar-
gram-positive sepsis plasma using quantitative geted analysis of data-independent acquisition
targeted proteomics. Nat Commun 7:10261 MS data. Nat Biotechnol 32:219–223
2. Kremer LPM, Leufken J, Oyunchimeg P et al 6. Röst HL, Schmitt U, Aebersold R, Malmstrom
(2016) Ursgal, universal Python module com- L (2014) pyOpenMS: a Python-based interface
bining common bottom-up proteomics tools to the OpenMS mass-spectrometry algorithm
for large-scale analysis. J Proteome Res library. Proteomics 14:74–77
15:788–794 7. Röst HL, Sachsenberg T, Aiche S et al (2016)
3. Mi H, Huang X, Muruganujan A et al (2017) OpenMS: a flexible open-source software plat-
PANTHER version 11: expanded annotation form for mass spectrometry data analysis. Nat
data from Gene Ontology and Reactome path- Methods 13:741–748
ways, and data analysis tool enhancements. 8. Teleman J, Dowsey AW, Gonzalez-Galarza FF
Nucleic Acids Res 45:D183–D189 et al (2014) Numerical compression schemes
4. Craig R, Beavis RC (2004) TANDEM: match- for proteomics mass spectrometry data. Mol
ing proteins with tandem mass spectra. Cell Proteomics 13:1537–1542
Bioinformatics 20:1466–1467

You might also like