CS-E4860 Projects 2017
CS-E4860 Projects 2017
1
Goal: In this project the student will implement and apply two different kernelized regres-
sion approaches to predict the retention time of molecules given their structure.
Methods and materials: For the project the student will be provided with a data set
containing the retention time measurements for 596 molecules. The molecular descrip-
tors and fingerprints will be given to the student. The student will implement the Kernel
Ridge Regression (KRR) and the Magnitude-preserving kernel regression (MPKR) [CMR07].
The student will apply both approaches to predict the retention times for the molecular struc-
tures in the data set. The student will compare the performance of KRR and MPKR and
investigate, whether the magnitude-preserving error term leads to better retention time pre-
diction.
Prerequisite: Basic knowledge of machine learning (especially kernel methods) & param-
eter estimation (i.e. cross-validation), linear algebra, programming skills in R, MATLAB
or Python. Some basic knowledge of molecular biology and chemoinformatics is beneficial.
References
[Aic+15] Fabian Aicheler et al. ”Retention Time Prediction Improves Identification in Non-
targeted Lipidomics Approaches”. In: Analytical chemistry 87.15 (2015), pp. 76987704.
[CMR07] Corinna Cortes et al. ”Magnitude-preserving Ranking Algorithms”. In: Proceed-
ings of the 24th International Conference on Machine Learning. ICML 07. ACM, 2007.
url: http://doi.acm.org/10.1145/1273496.1273518.
[Fal+16] Federico Falchi et al. ”Kernel-Based, Partial Least Squares Quantitative Structure-
Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool
for Metabolite Identification”. In: Analytical Chemistry (2016).
[SNV15] Jan Stanstrup et al. ”PredRet: Prediction of Retention Time by Direct Mapping be-
tween Multiple Chromatographic Systems”. In: Analytical Chemistry 87.18 (2015). PMID:
26289378, pp. 94219428. url: http://dx.doi.org/10.1021/acs.analchem.5b02287.
2
2 Metabolite identification from tandem mass spectra
Instructor: Céline Brouard ([email protected])
Background: Metabolites are small molecules involved in the biological processes of organ-
isms. Metabolite identification is an important problem in molecular biology. This problem
consists in identifying the molecular structures of the unknown metabolites that are present
in a biological sample. Information on these unknown metabolites can be obtained using
tandem mass spectrometry. Recent progress in metabolite identification has been obtained
using machine learning-based methods.
Goal: The goal of this project is to implement the CSI:FingerID method described
in the lecture. This method will be applied on the dataset used in the last CASMI1 (Critical
Assessment of Small Molecule Identification) contest. The idea of this contest is to evaluate
different metabolite identification methods on a common dataset. A set of training exam-
ples is provided and for each given tandem mass spectrum, the correct molecular structure
has to be determined among a set of potential molecular candidates.
Materials and Methods: In this project, the student will implement the CSI:FingerID
method. During the learning phase, the training MS/MS spectra are used to train a set
of Support Vector Machine classifiers to predict molecular properties. The parameter C
in SVM will be tuned using k-fold cross-validation on the training set, independently for each
molecular property. In the prediction phase, the fingerprints of the unknown metabolites
are predicted from their MS/MS spectra. The predicted fingerprints are then compared
to fingerprints of candidate molecular structures for a best match.
The training dataset contains 234 tandem mass spectra and the challenge dataset consists
of 127 tandem mass spectra. A list of candidates is provided for each challenge spectrum.
For each molecule, fingerprints have been retrieved from PubChem and OpenBabel. In input,
kernels on tandem mass spectra will be provided.
References
[1] Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification
and molecular fingerprint prediction through machine learning. Bioinformatics, 28 (18):2333-
2341.
[2] Shen, H,, Dührkop, K., Böcker, S. and Rousu, J. (2014). Metabolite identification through
1
http://www.casmi-contest.org
3
multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157-i164.
[3] Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molec-
ular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the
National Academy of Sciences, 112(41):12580-12585.
4
Goal: The goal of the project is to compute several protein kernels as well as drug kernels,
and then use them in MKL regression framework to predict drug-protein binding affinities.
Materials and Methods: The data set consists of 50 drug compounds and 50 protein
targets, which is a subset of the data from Metz et al. (2011) experimental study. DPIs
are represented as real values reflecting how tightly a compound binds to a protein. The
student will calculate Tanimoto kernels for drug compounds based on several fingerprints
implemented in ChemmineR R package. For proteins, Smith-Waterman amino acid sequence
alignment as well as Generic String kernel will be adopted. The student can also choose
to compute other molecular descriptors. Then, pairwise kernels that directly relate drug-
protein pairs will be constructed by taking Kronecker product of each pair of drug kernel and
protein kernel. The student will use pairwise kernels with two-stage MKL algorithm ALIGNF.
In the first stage, kernel mixture weights are determined based on maximising the centred
alignment, i.e. matrix similarity measure, between the combined kernel and the ideal, so-
called target kernel derived from the label values. In the second stage, combined kernel is used
with Kernel Ridge Regression (KRR) as a prediction model. The student will be provided
a script for calculating kernel mixture weights (first stage) but should implement KRR (sec-
ond stage). UNIMKL algorithm will form a baseline model, where all kernel mixture weights
are equal to 1/P , P being the number of input kernels. The student will implement nested
cross validation to tune the regularisation parameters λ of KRR and asses the predictive
performance of the model.
References
[1] Ding H, Takigawa I, Mamitsuka H, Zhu S. Similarity-based machine learning methods for
predicting drug-target interactions: a brief review. Briefings in Bioinformatics 2014; 15(5):
734–47.
[2] Cichonska A, Rousu J, Aittokallio T. Identification of drug candidates and repurposing
opportunities through compoundtarget interaction networks. Expert Opinion on Drug Dis-
covery 2015; 10(12): 1333–45.
[3] Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target
interaction networks from the integration of chemical and genomic spaces. Bioinformatics
2008; 24(13): i232–40.
[4] Giguere S, Marchand M, Laviolette F, Drouin A, Corbeil J. Learning a peptide-protein
binding affinity predictor with kernel ridge regression. BMC Bioinformatics 2013; 14(1): 82.
[5] Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered
alignment. Journal of Machine Learning Research 2012; 13(Mar): 795-828.
[6] Metz JT, Johnson EF, Soni NB et al. Navigating the kinome. Nature Chemical Biology
2011; 7(4): 200–2.
5
4 Differential gene expression analysis
Instructor: Markus Heinonen ([email protected])
Goal: The goal of the project is to model gene expression time series with Gaussian pro-
cesses and apply differential testing to find differentially regulated genes between conditions.
Materials and methods: In this project the response of Botrytis infection on Arabidop-
sis plant gene expression is analysed. The gene expression time-series are modeled using
Gaussian processes and two-sample interval testing is carried out to find out differentially
expressed genes in the infection response, and when these genes are differentially expressed.
The analysis results in a temporal cascade of gene differential expressions.
The plant gene expression measurements are large-scale and of high quality with numerous
biological and technical replicates. The data is located in the GEO database at http://
www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39598. The dataset consists 22 time
points (2,4,..,48 hours) for infected and normal plant cells, for 4 biological replicates (plants)
and 3 technical replicates for almost 10,000 gene probes. The GP modeling can be performed
on any GP implementation (eg. gpml/gpstuff on Matlab, gptk/gpfit on R, pygp/gpy on
Python). A suitable learning criteria, such as marginal likelihood or cross-validation should
be used. An appropriate kernel prior should be chosen as well, with the Gaussian kernel being
a common choice. A two-sample testing should be implemented according to the Bayesian
EMLL framework (see slides). Finally, the differentially expressed genes can be studied
by many ways. These include visualisation over time, clustering of their expression patterns,
or by considering their functional classifications (such as GO terms, KEGG pathways, Inter-
Pro families or PANTHER functional classification), which are found in several databases,
for instance the DAVID and BioGPS web servers.
Optionally, the student can experiment with non-stationary GPs, where the observation
noise or signal variance is time dependent. The GPstuff package contains an implementation
6
of nonstationary GPs. The goal is to analyse which gene expression time series warrant
a non-stationary GP, and to analyse the model improvement and runtime effects from adding
non-stationarity [See Tolvanen et al 2014].
More detailed instructions will be available from the instructor.
References
• Rasmussen & Williams (2006): Gaussian processes for machine learning [sections 2, 4.2
and 5.4].
7
5 Learning molecular representation with an autoen-
coder
Instructor: Huibin Shen ([email protected])
Goal: In this project, we will use an variational autoencoder to learn such representation
and test the representation in a metabolite identification pipeline. We will first test the
autoencoder on a subset of 5M molecules with fingerprint representation or SMILES string
representation. The code and data is already available. The student will run the code
on GPU nodes on triton.
Prerequisite: Python and Basic knowledge about machine learning and deep learning.
References
[1] Gómez-Bombarelli, R., Duvenaud, D., Hernndez-Lobato, J. M., Aguilera-Iparraguirre, J.,
Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. (2016). Automatic chemical design using
a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415.
[2] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural
network for modelling sentences. arXiv preprint arXiv:1404.2188.
[3] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114.