0% found this document useful (0 votes)
13 views8 pages

CS-E4860 Projects 2017

Uploaded by

Arnob Tanjim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

CS-E4860 Projects 2017

Uploaded by

Arnob Tanjim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Project topics for the course

Special Course in Bioinformatics II:


Machine Learning in Bioinformatics
Eric Bach, Céline Brouard, Anna Cichonska, Markus Heinonen, Huibin Shen,
Juho Rousu

March 27, 2017

1 Retention time prediction using kernel methods


Instructor: Eric Bach ([email protected])

Background: In untargeted metabolomics studies complex biological sample with possibly


thousands of molecules are encountered. Tandem mass spectrometry (MS/MS) is a widely
used technique to extract patterns from biological samples to identify the molecules in it.
However, the sensitivity of a mass spectrometer depends on the ability to reduce the com-
plexity of the biological sample, e.g. to prevent MS/MS spectra representing more than
one molecule. Liquid chromatography (LC) is a technique to do such complexity reduction.
If a properly prepared biological sample is provided to a LC column the molecules in the
sample will interact differently with the columns stationary phase. This makes the molecules
separating as a function of time depending on their molecular properties. Some molecules
are passing faster through the column than others. The time at which a molecule leaves
the column is called the retention time.
The retention time can serve as an orthogonal information for the metabolite identification,
e.g. it can exclude molecular candidates which are expected to have a different retention time
[Aic+15] or make distinction of diastereoisomers possible [SNV15]. Unfortunately, retention
time measurements are only available for a small number of molecules and not compara-
ble between different chromatographic systems. On the other hand, for example the set
of molecular candidates for the identification of one molecule (given its MS/MS spectra)
can possible contain thousands of molecules. Therefore, machine learning algorithms have
been applied to predict retention times given the structure of a molecule [Aic+15; Fal+16].

1
Goal: In this project the student will implement and apply two different kernelized regres-
sion approaches to predict the retention time of molecules given their structure.

Methods and materials: For the project the student will be provided with a data set
containing the retention time measurements for 596 molecules. The molecular descrip-
tors and fingerprints will be given to the student. The student will implement the Kernel
Ridge Regression (KRR) and the Magnitude-preserving kernel regression (MPKR) [CMR07].
The student will apply both approaches to predict the retention times for the molecular struc-
tures in the data set. The student will compare the performance of KRR and MPKR and
investigate, whether the magnitude-preserving error term leads to better retention time pre-
diction.

Prerequisite: Basic knowledge of machine learning (especially kernel methods) & param-
eter estimation (i.e. cross-validation), linear algebra, programming skills in R, MATLAB
or Python. Some basic knowledge of molecular biology and chemoinformatics is beneficial.

References
[Aic+15] Fabian Aicheler et al. ”Retention Time Prediction Improves Identification in Non-
targeted Lipidomics Approaches”. In: Analytical chemistry 87.15 (2015), pp. 76987704.
[CMR07] Corinna Cortes et al. ”Magnitude-preserving Ranking Algorithms”. In: Proceed-
ings of the 24th International Conference on Machine Learning. ICML 07. ACM, 2007.
url: http://doi.acm.org/10.1145/1273496.1273518.
[Fal+16] Federico Falchi et al. ”Kernel-Based, Partial Least Squares Quantitative Structure-
Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool
for Metabolite Identification”. In: Analytical Chemistry (2016).
[SNV15] Jan Stanstrup et al. ”PredRet: Prediction of Retention Time by Direct Mapping be-
tween Multiple Chromatographic Systems”. In: Analytical Chemistry 87.18 (2015). PMID:
26289378, pp. 94219428. url: http://dx.doi.org/10.1021/acs.analchem.5b02287.

2
2 Metabolite identification from tandem mass spectra
Instructor: Céline Brouard ([email protected])

Background: Metabolites are small molecules involved in the biological processes of organ-
isms. Metabolite identification is an important problem in molecular biology. This problem
consists in identifying the molecular structures of the unknown metabolites that are present
in a biological sample. Information on these unknown metabolites can be obtained using
tandem mass spectrometry. Recent progress in metabolite identification has been obtained
using machine learning-based methods.

Goal: The goal of this project is to implement the CSI:FingerID method described
in the lecture. This method will be applied on the dataset used in the last CASMI1 (Critical
Assessment of Small Molecule Identification) contest. The idea of this contest is to evaluate
different metabolite identification methods on a common dataset. A set of training exam-
ples is provided and for each given tandem mass spectrum, the correct molecular structure
has to be determined among a set of potential molecular candidates.

Materials and Methods: In this project, the student will implement the CSI:FingerID
method. During the learning phase, the training MS/MS spectra are used to train a set
of Support Vector Machine classifiers to predict molecular properties. The parameter C
in SVM will be tuned using k-fold cross-validation on the training set, independently for each
molecular property. In the prediction phase, the fingerprints of the unknown metabolites
are predicted from their MS/MS spectra. The predicted fingerprints are then compared
to fingerprints of candidate molecular structures for a best match.
The training dataset contains 234 tandem mass spectra and the challenge dataset consists
of 127 tandem mass spectra. A list of candidates is provided for each challenge spectrum.
For each molecule, fingerprints have been retrieved from PubChem and OpenBabel. In input,
kernels on tandem mass spectra will be provided.

Required background knowledge/skills: Programming skills (preferably MATLAB,


or R), basic knowledge of machine learning, understanding the basic principles of support
vector machines. Some knowledge of molecular biology will be beneficial.

References
[1] Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification
and molecular fingerprint prediction through machine learning. Bioinformatics, 28 (18):2333-
2341.
[2] Shen, H,, Dührkop, K., Böcker, S. and Rousu, J. (2014). Metabolite identification through
1
http://www.casmi-contest.org

3
multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157-i164.
[3] Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molec-
ular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the
National Academy of Sciences, 112(41):12580-12585.

3 Multiple kernel learning for drug-protein interaction


prediction
Instructor: Anna Cichonska ([email protected])

Background: Drug-like chemical compounds execute their actions mainly by modulat-


ing cellular targets, such as proteins. Experimental determination of interactions between
chemical compounds and protein targets is time consuming and expensive, and therefore,
in the recent years, a lot of effort has been placed on the development of computational
methods that could provide fast, large-scale and systematic pre-screening of chemical probes.
In particular, a lot of work has been devoted to compound-based interaction prediction
methods, including quantitative structure-activity relationship (QSAR) models, which aim
to relate structural properties of the chemical molecules to their bioactivity profiles. An-
other class of computational methods, so called target-based methods, focus on evaluating
similarities between amino acid sequences or three-dimensional structures of protein targets.
In these supervised learning approaches, models are trained using available bioactivity data,
together with either compound or protein information, which allows then predicting either
new targets of a given drug or new drugs targeting a given protein.
As a more recent class of computational modelling approaches, systems-based frameworks
take advantage of the information available on both compounds and proteins. A key assump-
tion is that similar drug compounds interact with similar proteins, and therefore a proper
representation and use of similarities, equivalent to a kernel choice, is a first critical prerequi-
site for the achievement of high-quality drug-protein interaction (DPI) predictions. Classical
kernel-based methods rely on a single kernel. However, such approaches are unlikely to be
optimal when a growing variety of biological and molecular data sources become available
simultaneously. Multiple kernel learning (MKL) methods, which search for an optimal combi-
nation of several kernels, enabling the use of different information sources simultaneously and
learning their importance for the prediction task, are therefore receiving increasing attention.
Typically, binary-valued DPI prediction setup is employed. However, molecular interactions
are not simple on-off relationships and predicting real-valued binding affinities is more ap-
pealing.

4
Goal: The goal of the project is to compute several protein kernels as well as drug kernels,
and then use them in MKL regression framework to predict drug-protein binding affinities.

Materials and Methods: The data set consists of 50 drug compounds and 50 protein
targets, which is a subset of the data from Metz et al. (2011) experimental study. DPIs
are represented as real values reflecting how tightly a compound binds to a protein. The
student will calculate Tanimoto kernels for drug compounds based on several fingerprints
implemented in ChemmineR R package. For proteins, Smith-Waterman amino acid sequence
alignment as well as Generic String kernel will be adopted. The student can also choose
to compute other molecular descriptors. Then, pairwise kernels that directly relate drug-
protein pairs will be constructed by taking Kronecker product of each pair of drug kernel and
protein kernel. The student will use pairwise kernels with two-stage MKL algorithm ALIGNF.
In the first stage, kernel mixture weights are determined based on maximising the centred
alignment, i.e. matrix similarity measure, between the combined kernel and the ideal, so-
called target kernel derived from the label values. In the second stage, combined kernel is used
with Kernel Ridge Regression (KRR) as a prediction model. The student will be provided
a script for calculating kernel mixture weights (first stage) but should implement KRR (sec-
ond stage). UNIMKL algorithm will form a baseline model, where all kernel mixture weights
are equal to 1/P , P being the number of input kernels. The student will implement nested
cross validation to tune the regularisation parameters λ of KRR and asses the predictive
performance of the model.

Prerequisite: Programming skills (MATLAB, R, Python), basic knowledge of machine


learning. Some knowledge of chemoinformatics will be beneficial.

References
[1] Ding H, Takigawa I, Mamitsuka H, Zhu S. Similarity-based machine learning methods for
predicting drug-target interactions: a brief review. Briefings in Bioinformatics 2014; 15(5):
734–47.
[2] Cichonska A, Rousu J, Aittokallio T. Identification of drug candidates and repurposing
opportunities through compoundtarget interaction networks. Expert Opinion on Drug Dis-
covery 2015; 10(12): 1333–45.
[3] Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target
interaction networks from the integration of chemical and genomic spaces. Bioinformatics
2008; 24(13): i232–40.
[4] Giguere S, Marchand M, Laviolette F, Drouin A, Corbeil J. Learning a peptide-protein
binding affinity predictor with kernel ridge regression. BMC Bioinformatics 2013; 14(1): 82.
[5] Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered
alignment. Journal of Machine Learning Research 2012; 13(Mar): 795-828.
[6] Metz JT, Johnson EF, Soni NB et al. Navigating the kinome. Nature Chemical Biology
2011; 7(4): 200–2.

5
4 Differential gene expression analysis
Instructor: Markus Heinonen ([email protected])

Background: In differential gene expression analysis statistical methods are applied


to find which genes are over or under expressed with respect to control baseline expres-
sion levels. These results are subsequently analysed for biological significance by inspect
the functional annotations of these genes to gain insight into cellular processes of interests.
In static differential testing expression matrices are compared using well-defined statistics.
In dynamic differential testing time series or interpolation models over time are compared
using frequentist or Bayesian statistics. In both cases a large-scale view of the expression
patterns of thousands of genes emerges.
The key question in differential analysis is choice of model for the expression patterns. Genes
commonly exhibit non-stationarity, where the underlying dynamics can change abruptly
by perturbation or regulation. The sparse and often irregularly sampled data warrants care-
ful modeling of the signals. Typically the underlying model family for interpolation and
data representation are Gaussian processes. The differential expression can be tested against
a constant level, between two conditions, or between multiple conditions.

Goal: The goal of the project is to model gene expression time series with Gaussian pro-
cesses and apply differential testing to find differentially regulated genes between conditions.

Materials and methods: In this project the response of Botrytis infection on Arabidop-
sis plant gene expression is analysed. The gene expression time-series are modeled using
Gaussian processes and two-sample interval testing is carried out to find out differentially
expressed genes in the infection response, and when these genes are differentially expressed.
The analysis results in a temporal cascade of gene differential expressions.
The plant gene expression measurements are large-scale and of high quality with numerous
biological and technical replicates. The data is located in the GEO database at http://
www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39598. The dataset consists 22 time
points (2,4,..,48 hours) for infected and normal plant cells, for 4 biological replicates (plants)
and 3 technical replicates for almost 10,000 gene probes. The GP modeling can be performed
on any GP implementation (eg. gpml/gpstuff on Matlab, gptk/gpfit on R, pygp/gpy on
Python). A suitable learning criteria, such as marginal likelihood or cross-validation should
be used. An appropriate kernel prior should be chosen as well, with the Gaussian kernel being
a common choice. A two-sample testing should be implemented according to the Bayesian
EMLL framework (see slides). Finally, the differentially expressed genes can be studied
by many ways. These include visualisation over time, clustering of their expression patterns,
or by considering their functional classifications (such as GO terms, KEGG pathways, Inter-
Pro families or PANTHER functional classification), which are found in several databases,
for instance the DAVID and BioGPS web servers.
Optionally, the student can experiment with non-stationary GPs, where the observation
noise or signal variance is time dependent. The GPstuff package contains an implementation

6
of nonstationary GPs. The goal is to analyse which gene expression time series warrant
a non-stationary GP, and to analyse the model improvement and runtime effects from adding
non-stationarity [See Tolvanen et al 2014].
More detailed instructions will be available from the instructor.

Required background knowledge/skills: Programming skills (Matlab, R, python),


basic statistics, basic Bayesian statistics and machine learning. Some knowledge of biology
will be useful.

References

• Heinonen et al (2015): Detecting time periods of differential gene expression using


Gaussian processes: An application to endothelial cells exposed to radiotherapy dose
fraction. Bioinformatics, 31:728–735.

• Rasmussen & Williams (2006): Gaussian processes for machine learning [sections 2, 4.2
and 5.4].

• Windram et al (2012) Arabidopsis Defense against Botrytis cinerea: Chronology and


Regulation Deciphered by High-Resolution Temporal Transcriptomic Analysis. The
Plant Cell, 24:3530–3557.

• Stegle et al (2010): A Robust Bayesian Two-Sample Test for Detecting Intervals of


Differential Gene Expression in Microarray Time Series. Journal of Computational
Biology, 17:355–367.

• Tolvanen et al (2014). Expectation propagation for nonstationary heteroscedastic Gaus-


sian process regression. in IEEE MLSP.

7
5 Learning molecular representation with an autoen-
coder
Instructor: Huibin Shen ([email protected])

Background: The current representations of molecule including a binary vector representa-


tion such as molecular fingerprint, a string representation such as InChi
or SMILES, or 2d/3d graph. Many applications related to molecules are based on some
kind of representation.
The popular deep learning is at the core to learn a better representation for the data.
The number of molecules in nowadays compound database is in the scale of millions. With
the heated deep learning approach, to learn a compact and continuous vector representation
is possible.

Goal: In this project, we will use an variational autoencoder to learn such representation
and test the representation in a metabolite identification pipeline. We will first test the
autoencoder on a subset of 5M molecules with fingerprint representation or SMILES string
representation. The code and data is already available. The student will run the code
on GPU nodes on triton.

Prerequisite: Python and Basic knowledge about machine learning and deep learning.

References
[1] Gómez-Bombarelli, R., Duvenaud, D., Hernndez-Lobato, J. M., Aguilera-Iparraguirre, J.,
Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. (2016). Automatic chemical design using
a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415.
[2] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural
network for modelling sentences. arXiv preprint arXiv:1404.2188.
[3] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114.

You might also like