0% found this document useful (0 votes)
145 views16 pages

A Guide To Machine Learning For Biologists

Uploaded by

Sonia Maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views16 pages

A Guide To Machine Learning For Biologists

Uploaded by

Sonia Maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Reviews

A guide to machine learning


for biologists
Joe G. Greener 1,2
, Shaun M. Kandathil 1,2
, Lewis Moffat1 and David T. Jones 1 ✉

Abstract | The expanding scale and inherent complexity of biological data have encouraged
a growing use of machine learning in biology to build informative and predictive models of the
underlying biological processes. All machine learning techniques fit models to data; however,
the specific methods are quite varied and can at first glance seem bewildering. In this Review,
we aim to provide readers with a gentle introduction to a few key machine learning techniques,
including the most recently developed and widely used techniques involving deep neural
networks. We describe how different techniques may be suited to specific types of biological data,
and also discuss some best practices and points to consider when one is embarking on experiments
involving machine learning. Some emerging directions in machine learning methodology are
also discussed.

Deep learning
Humans make sense of the world around them by it is used in nearly every field of biology. However, only
Machine learning methods observing it, and learning to predict what might happen in the past few years has the field taken a more critical
based on neural networks. next. Consider a child learning to catch a ball: the child look at the available strategies and begun to assess which
The adjective ‘deep’ refers (usually) knows nothing about the physical laws that methods are most appropriate in different scenarios,
to the use of many hidden
layers in the network, two
govern the motion of a thrown ball; however, by a pro- or even whether they are appropriate at all.
hidden layers as a minimum cess of observation, trial and error, the child adjusts his This Review aims to inform biologists on how they
but usually many more than or her understanding of the ball’s motion, and how to can start to understand and use machine learning tech-
that. Deep learning is a subset move his or her body, until he or she is able to catch it niques. We do not intend to present a thorough literature
of machine learning, and
reliably. In other words, the child has learned how to review of articles using machine learning for biological
hence of artificial intelligence
more broadly.
catch the ball by building a sufficiently accurate and problems1, or to describe the detailed mathematics of
useful ‘model’ of the process, by repeatedly testing this various machine learning methods2,3. Instead, we focus
Artificial neural networks model against the data and by making corrections to on linking particular techniques to different types of bio-
A collection of connected the model to make it better. logical data (similar reviews are available for specific
nodes loosely representing
neuron connectivity in a
‘Machine learning’ refers broadly to the process of fit- biological disciplines; see, for example, refs4–11). We also
biological brain. Each node is ting predictive models to data or of identifying informa- attempt to distil some best practices of how to practi-
part of a layer and represents tive groupings within data. The field of machine learning cally go about the process of training and improving a
a number calculated from the essentially attempts to approximate or imitate humans’ model. The complexity of biological data presents pitfalls
previous layer. The connections,
ability to recognize patterns, albeit in an objective man- as well as opportunities for their analysis using machine
or edges, allow a signal to flow
from the input layer to the
ner, using computation. Machine learning is particularly learning techniques. To address these, we discuss the
output layer via hidden layers. useful when the dataset one wishes to analyse is too large widespread issues that affect the validity of studies, with
(many individual data points) or too complex (contains guidance on how to avoid them. The bulk of the Review
a large number of features) for human analysis and/or is devoted to the description of a number of machine
when it is desired to automate the process of data analy- learning techniques, and in each case we provide exam-
sis to establish a reproducible and time-​efficient pipeline. ples of the appropriate application of the method and
Data from biological experiments frequently possess how to interpret the results. The methods discussed
1
Department of Computer these properties; biological datasets have grown enor- include traditional machine learning methods, as these
Science, University College
London, London, UK.
mously in both size and complexity in the past few dec- are still the best choices in many cases, and deep learning
ades, and it is becoming increasingly important not only with artificial neural networks, which are emerging as
2
These authors contributed
equally: Joe G. Greener, to have some practical means of making sense of this the most effective methods for many tasks. We finish
Shaun M. Kandathil. data abundance but also to have a sound understand- by describing what the future holds for incorporating
✉e-​mail: [email protected] ing of the techniques that are used. Machine learning machine learning in data analysis pipelines in biology.
https://doi.org/10.1038/ has been used in biology for a number of decades, but There are two goals when one is using machine learn-
s41580-021-00407-0 it has steadily grown in importance to the point where ing in biology. The first is to make accurate predictions

40 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

where experimental data are lacking, and use these large amounts of unlabelled data. This can improve
predictions to guide future research efforts. However, performance in cases where labelled data are costly
as scientists we seek to understand the world, and so to obtain.
the second goal is to use machine learning to further
our understanding of biology. Throughout this guide Classification, regression and clustering problems. When
we discuss how these two goals often come into con- a problem involves assigning data points to a set of dis-
flict in machine learning, and how to extract under- crete categories (for example, ‘cancerous’ or ‘not can-
standing from models that are often treated as ‘black cerous’), the problem is called a ‘classification problem’,
boxes’ because their inner workings are difficult to and any algorithm that performs such classification can
understand12. be said to be a classifier. By contrast, regression models
output a continuous set of values, such as predicting the
Key concepts free energy change of folding after mutating a residue
We first introduce a number of key concepts in machine in a protein17. Continuous values can be thresholded
learning. Where possible, we illustrate these concepts or otherwise discretized, meaning that it is often pos-
Ground truth
with examples taken from biological literature. sible to reformulate regression problems as classifi-
The true value that the output cation problems. For example, the free energy change
of a machine learning model General terms. A dataset comprises a number of data mentioned above can be binned into ranges of values
is compared with to train the points or instances, each of which can be thought of that are favourable or unfavourable for protein stability.
model and test performance.
as a single observation from an experiment. Each data Clustering methods are used to predict groupings of
These data usually come from
experimental data (for example, point is described by a (usually fixed) number of fea- similar data points in a dataset, and are usually based on
accessibility of a region of tures. Examples of such features include length, time, some measure of similarity between data points. They
DNA to transcription factors) concentration and gene expression level. A machine are unsupervised methods that do not require that the
or expert human annotation learning task is an objective specification for what we examples in a dataset have labels. For example, in a
(for example healthy or
pathological medical image).
want a machine learning model to accomplish. For gene expression study, clustering could find subsets of
example, for an experiment investigating the expres- patients with similar gene expression.
Encoding sion of genes over time, we might want to predict the
Any scheme for numerically rate of conversion of a specific metabolite into another Classes and labels. The discrete set of values returned
representing (often categorical)
species. In this case, the features ‘gene expression level’ by a classifier can be made to be mutually exclusive, in
data in a form suitable for use
in a machine learning model. and ‘time’ could be termed input features or simply which case they are called ‘classes’. Where these values
An encoding can be a fixed inputs for the model, and ‘conversion rate’ would be need not be mutually exclusive, they are termed ‘labels’.
numerical representation the desired output of the model; that is, the quantity For example, a residue in a protein structure can be in
(for example, one-​hot or we are interested in predicting. A model can have any only one of multiple secondary structure classes, but
continuous encoding) or can
be defined using parameters
number of input and output features. Features can be could simultaneously be assigned the non-​exclusive
that are trained along with either continuous (taking continuous numerical values) labels of being α-​helical and transmembrane. Classes
the rest of a model. or categorical (taking only discrete values). Quite often, and labels are usually represented by an encoding
categorical features are simply binary and are either true (for example, a one-​hot encoding).
One-​hot encoding
(1) or false (0).
An encoding scheme that
represents a fixed set of n Loss or cost functions. The output or outputs of a
categorical inputs using n Supervised and unsupervised learning. ‘Supervised machine learning model are never ideal and will diverge
unique n-​dimensional vectors, machine learning’ refers to the fitting of a model to from the ground truth. The mathematical functions
each with one element set data (or a subset of data) that have been labelled — that measure this deviation or in more general terms that
to 1 and the rest set to 0.
For example, the set of three
where there exists some ground truth property, which measure the amount of ‘disagreement’ between the
letters (A,B,C) could be is usually experimentally measured or assigned by obtained and ideal outputs are referred to as ‘loss func-
represented by the three humans. Examples include protein secondary struc- tions’ or ‘cost functions’. In supervised learning settings,
vectors [1,0,0], [0,1,0] ture prediction13 and prediction of genome accessibi­ the loss function would be a measure of deviation
and [0,0,1], respectively.
lity to genome-​regulatory factors14. In both cases, the of the output relative to the ground truth output. Examples
Mean squared error ground truth is derived ultimately from laboratory include mean squared error loss for regression problems
A loss function that calculates observations, but often these raw data are preprocessed and binary cross entropy for classification problems.
the average squared difference in some way. In the case of secondary structure, for
between the predicted values example, the ground truth data are derived from ana- Parameters and hyperparameters. Models are essentially
and the ground truth. This
function heavily penalizes
lysing protein crystal structure data in the Protein Data mathematical functions that operate on some set of input
outliers because it increases Bank, and in the latter case, the ground truth comes features and produce one or more output values or fea-
rapidly as the difference from data derived from DNA-​sequencing experiments. tures. To be able to learn on training data, models con-
between a predicted value By contrast, unsupervised learning methods are able to tain adjustable parameters whose values can be changed
and the ground truth grows.
identify patterns in unlabelled data, without the need over the training process to achieve the best performance
Binary cross entropy to provide the system with the ground truth informa- of the model (see later). In a simple regression model,
The most common loss tion in the form of predetermined labels, such as finding for example, each feature has a parameter that is multi-
function for training a binary subsets of patients with similar expression levels in a plied by the feature value, and these are added together
classifier; that is, for tasks gene expression study15 or predicting mutation effects to make the prediction. Hyperparameters are adjustable
aimed at answering a question
with only two choices (such
from gene sequence co-​variation16. Sometimes the two values that are not considered part of the model itself
as cancer versus non-​cancer); approaches are combined in semi-supervised learning, in that they are not updated during training, but which
sometimes called ‘log loss’. where small amounts of labelled data are combined with still have an impact on the training of the model and its

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 41

0123456789();:
Reviews

performance. A common example of a hyperparame- Overfitting and underfitting. The purpose of fitting a
ter is the learning rate, which controls the rate or speed model to training data is to capture the ‘true’ relation-
with which the model’s parameters are altered during ship between the variables in the data, such that the
training. model has predictive power on unseen (non-​training)
data. Models that are either overfitted or underfitted
Training, validation and testing. Before being used will produce poor predictions on data not in the train-
to make predictions, models require training, which ing set (Fig. 2d) . An overfitted model will produce
involves automatically adjusting the parameters of a excellent results on data in the training set (usually as
model to improve its performance. In a supervised a result of having too many parameters), but will pro-
learning setting, this involves modifying the parameters duce poor results on unseen data. The overfitted model
so the model performs well on a training dataset, by in Fig. 2d passes exactly through every training point,
minimizing the average value of the loss or cost func- and so its prediction error on the training set will be
tion (described earlier). Usually, a separate validation zero. However, it is evident that this model has ‘memo-
dataset is used to monitor but not influence the train- rized’ the training data and is unlikely to produce good
ing process so as to detect potential overfitting (see the results on unseen data. By contrast, an underfitted model
next section). In unsupervised settings, a cost func- fails to adequately capture the relationships between the
tion is still minimized, although it does not operate on variables in the data. This could be due to an incorrect
ground truth outputs. Once a model is trained, it can choice of model type, incomplete or incorrect assump-
be tested on data not used for training. See Box 1 for a tions about the data, too few parameters in the model
guide to the overall process of training and how to split and/or an incomplete training process. The underfitted
the data appropriately between training and testing sets. model depicted in Fig. 2d is inadequate for the data it
A flowchart to help the overall process is shown in Fig. 1, is trying to fit; in this case it is evident that the vari-
and some of the concepts in model training are shown ables have a non-​linear relationship, which cannot be
in Fig. 2. adequa­tely described with a simple linear model and so
a non-​linear model would be more appropriate.
Box 1 | Doing machine learning
Inductive bias and the bias–variance trade-​off. The
Here we outline the steps that should be taken when one is training a machine ‘inductive bias’ of a model refers to the set of assump-
learning model. There is surprisingly little guidance available on the model selection tions made in the learning algorithm that leads it to
and training process146,147, with descriptions of the stepping stones and failed models favour a particular solution to a learning problem over
rarely making it into published research articles. The first step, before touching any others. It can be thought of as the model’s preference
machine learning code, should be to fully understand the data (inputs) and prediction
for a particular type of solution to a learning problem
task (outputs) at hand. This means a biological understanding of the question, such
as knowing the origin of the data and the sources of noise, and having an idea of how
over others. This preference is often programmed into
the output could theoretically be predicted from the input using biological principles. the model using its specific mathematical form and/or
For example, it can be reasoned that different amino acids might have preferences for by using a particular loss function. For example, the
particular secondary structures in proteins, so it makes sense to predict secondary inductive bias of recurrent neural networks (RNNs; dis-
structure from amino acid frequencies at each position in a protein sequence. It is also cussed later) is that there are sequential dependencies in
important to know how the inputs and outputs are stored computationally. Are they the input data such as the concentration of a metabolite
normalized to prevent one feature having an unduly large influence on prediction? over time. This dependence is explicitly accounted for in
Are they encoded as binary variables or continuously? Are there duplicate entries? the mathematical form of an RNN. Different inductive
Are there missing data elements? biases in different model types make them more suitable
Next, the data should be split to allow training, validation and testing. There are a
and usually better performing for specific types of data.
number of ways to do this, two of which are shown in Fig. 2a. The training set is used to
directly update the parameters of the model being trained. The validation set, usually
Another important concept is the trade-​off between bias
around 10% of the available data, is used to monitor training, select hyperparameters and variance. A model with a high bias can be said to
and prevent the model overfitting to the training data. Often k-​fold cross-​validation is have stronger constraints on the trained model, whereas
used: the training set is split into k evenly sized partitions (for example, five or ten) to a model with low bias makes fewer assumptions about
form k different training and validation sets, and the performance is compared across the property being modelled, and can, in theory, model a
each partition to select the best hyperparameters. The test set, sometimes called wide variety of function types. The variance of a model
the ‘hold-​out set’, typically also around 10% of the available data, is used to assess the describes how much the trained model changes in
performance of the model on data not used for training or validation (that is, estimate response to training it on different training datasets.
its expected real-​world performance). The test set should be used only once, at the In general, we desire models with very low bias and low
very end of the study, or as infrequently as possible27,38 to avoid tuning the model to
variance, although these objectives are often in conflict
fit the test set. See the section Data leakage for issues to consider when making a fair
test set.
as a model with low bias will often learn different signals
The next step is model selection, which depends on the nature of the data and the on different training sets. Controlling the bias–variance
prediction task, and is summarized in Fig. 1. The training set is used to train the model trade-​off is key to avoiding overfitting or underfitting.
following best practices of the software framework being used. Most methods have a
handful of hyperparameters that need to be tuned to achieve the best performance. Traditional machine learning
This can be done using random search or grid search, and can be combined with k-​fold We now discuss several key machine learning methods,
cross-​validation as outlined above27. Model ensembling should be considered, where the with an emphasis on their particular strengths and
outputs of a number of similar models are simply averaged to give a relatively reliable weaknesses. A comparison of different machine learn-
way to boost overall accuracy of the modelling task. Finally, the accuracy of the model ing approaches is shown in Table 1. We begin with a
on the test set (see above) should be assessed.
discussion of methods not based on neural networks,

42 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

Define task Obtain data Form test set Select model Train Tune Test
(if supervised) (if supervised)

Sufficient data? No Get more data


Yes

Graph Connections between Small, fixed number of


convolutional Yes entities? No features or no data labels
network No Yes
Dimensionality
2D/3D Spatial or Just visualizing reduction
Predicting
convolutional Yes image data? class or value? Value
neural network Regression
No Class methods

Recurrent neural Sequential


network/1D Yes data? Labelled data? Clustering
No
convolutional No Yes
neural network

Multilayer Support vector machine/


perceptron random forest/gradient boosting

Fig. 1 | Choosing and training a machine learning method. The overall procedure for training a machine learning
method is shown along the top. A decision tree to assist researchers in selecting a model is given below. This flowchart
is intended to be used as a visual guide linking the concepts outlined in this Review. However, a simple overview such as
this cannot cover every case. For example, the number of data points required for machine learning to become applicable
depends on the number of features available for each data point, with more features requiring more data points, and
also depends on the model being used. There are also deep learning models that work on unlabelled data.

sometimes called ‘traditional machine learning’. Figure 3 a particular biological prediction task, it is often still pru-
shows some of the methods of traditional machine dent to train a traditional method to compare it against
learning. Various software packages can be used to train a neural network-​based model, if possible30.
such models, including scikit-​learn in Python18, caret in Traditional methods typically expect that each exam-
R19 and MLJ in Julia20. ple in the dataset has the same number of features, so this
When one is developing machine learning methods is not always possible. An obvious biological example of
for use with biological data, traditional machine learning this is when protein, RNA or DNA sequences are being
should generally be seen as the first area to explore in used and each example has a different length. To use tra-
finding the most appropriate method for a given task. ditional methods with these data, the data can be altered
Deep learning can be a powerful tool, and is undeniably so they are all the same size using simple techniques such
trendy currently. However, it is still limited in the appli- as padding and windowing. ‘Padding’ means taking
cation areas in which it excels: when large amounts of each example and adding additional values containing
Linear regression data are available (for example, millions of data points); zero until it is the same size as the largest example in
A model that assumes that when each data point has many features; and when the the dataset. By contrast, windowing shortens indivi­dual
the output can be calculated features are highly structured (the features have clear examples to a given size (for example, using only the
from a linear combination
of inputs; that is, each input
relationships with one another, such as adjacent pixels first 100 residues of each protein in a dataset of protein
feature is multiplied by a in images)21. Data such as DNA, RNA and protein sequences with lengths ranging from 100 upwards).
single parameter and these sequences22,23 and microscopy images24,25 are examples of
values are added. It is easy biological data where these requirements can be met and Use of classification and regression models. For regres-
to interpret how these models
deep learning has been successfully applied. However, sion problems such as those shown in Fig. 3a, ridge
make their predictions.
the requirement for large amounts of data can make regression (linear regression with a regularization term)
Kernel functions deep learning a poor choice even when the other two is often a good starting point for developing a model, as
Transformations applied to requirements are met. it can provide a fast and well-​understood benchmark for
each data point to map the Traditional methods, in comparison to deep learn- a given task. Other variants of linear regression such as
original points into a space in
which they become separable
ing, are much faster to develop and test on a given LASSO regression31 and elastic net regression32 are also
with respect to their class. problem. Developing the architecture of a deep neural worth considering when there is a desire for a model to
network and then training it can be a time-consuming rely on a minimal number of features within the available
Non-​linear regression and computationally expensive task to undertake26 com- data. Unfortunately, the relationships between features in
A model where the output is
pared with traditional models such as support vector the data are often non-​linear, and so use of a model such
calculated from a non-​linear
combination of inputs; that is, machines (SVMs) and random forests27. Although some as an SVM is often a more appropriate choice for these
the input features can be approaches exist, with deep neural networks it is still cases33. SVMs are a powerful type of regression and clas-
combined during prediction not trivial to estimate feature importance28 (that is, how sification model that uses kernel functions to transform a
using operations such as important each feature is for contributing to the predic- non-​separable problem into a separable problem that is
multiplication. These models
can describe more complex
tion) or the confidence of predictions of the model1,28,29, easier to solve. SVMs can be used to perform both linear
phenomena than linear both of which are often essential in biological settings. regression and non-​linear regression depending on the
regression. Even if deep learning appears technically feasible for kernel function used34–37. A good approach to developing

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 43

0123456789();:
Reviews

k nearest neighbours
a model is to train a linear SVM and an SVM with a particular method will be to a particular problem a priori
A classification approach radial basis function kernel (a general-​purpose non-​ can be deceptive, and instead taking an empiri­c al,
where a data point is classified linear type of SVM) to quantify what gain, if any, can trial-​and-​error approach to finding the best model is
on the basis of the known be had from a non-​linear model. Non-​linear approaches generally the most prudent approach. With modern
(ground truth) classes of the k
can provide more powerful models but at the cost of machine learning suites such as scikit-​learn18, changing
most similar points in the
training set using a majority easy interpretation of which features are influencing the between these model variants often requires changing
voting rule. k is a parameter model, a trade-​off mentioned in the introduction. just one line of code, so a good overall strategy for select-
that can be tuned. Can also Many of the models that are commonly used in ing the best method is to train and optimize a variety
be used for regression by
regression are also used for classification. Training a of the aforementioned methods and choose the one with
averaging the property value
over the k nearest neighbours.
linear SVM and an SVM with a radial basis function the best performance on the validation set before finally
kernel is also a good default starting point for a classifi­ comparing their performance on a separate test set.
cation task. An additional method that can be tried is
k nearest neighbours classification38. Being one of the Use of clustering models. The use of clustering algo-
simplest classification methods, k nearest neighbours rithms (Fig. 3e) is pervasive within biology42,43. k-​means is
classification provides a useful baseline performance a strong general purpose approach to clustering that, like
marker against which other more complex models, such many other clustering algorithms, requires the number
as SVMs, can be compared. Another class of robust of clusters to be set as a hyperparameter44. DBSCAN is
non-​linear methods is ensemble-​based models such as an alternative method that does not require the number
random forests39 and XGBoost40,41. Both methods are of clusters to be predefined, but has the trade-​off that
powerful non-​linear models that have the added bene- other hyperparameters have to be set45. Dimensionality
fits of providing feature importance estimates and often reduction can also be performed before clustering to
requiring minimal hyperparameter tuning. Due to the improve performance for datasets with a large number
assignment of feature importance values and the deci- of features.
sion tree structure, these models are a good choice if
understanding which features contributed the most to a Dimensionality reduction. Dimensionality reduction
prediction is essential for biological understanding. techniques are used to transform data with a large
For both classification and regression, the many number of attributes (or dimensions) into a lower-​
available models tend to have a bewildering variety of dimensional form while preserving the different rela-
flavours and variants. Trying to predict how well suited a tionships between the data points as much as possible.

a Used to train Used to assess b One-hot encoding c Continuous encoding


model performance
Helix (1.0, 0.0, 0.0)
Training (0.00, 0.57, 1.00)
Validation Testing Sheet (0.0, 1.0, 0.0)

Coil (0.0, 0.0, 1.0) (0.96, 0.42, 1.00)


k-fold cross-validation Category Encoding Pixel RGB values Encoding

d e Learning rate f Early stopping


Data point

Model Too low


Loss

Loss

Too high
Underfit Good fit Overfit Validation set
Good
Training set

Training time Time

Fig. 2 | Training machine learning methods. a | Available data are often whereas learning the noise in the training data is called ‘overfitting’.
split into training, validation and test sets. The training set is directly used Underfitting can be caused by using a model without sufficient complexity
to train the model, the validation set is used to monitor training and the test to describe the signal. Overfitting can be caused by using a model with too
set is used to assess the performance of the model. k-​fold cross-​validation many parameters or by continuing training after it has learned the true
with a test set can also be used. b | One-​hot encoding is a common approach relationship between the variables. e | The learning rate of the model
for representing categorical inputs where a single choice is permitted from determines how quickly learned parameters are adjusted when training a
a number of possibilities, in this case three possible protein secondary neural network or some traditional methods such as gradient boosting.
structure classes. The result of the encoding is a vector with three numbers, A low learning rate can lead to slow training, which is time-​consuming and
all equal to 0 except the occupied class, which is set to 1. This vector is used requires considerable computing power. By contrast, a high learning rate
by the machine learning model. c | Continuous encoding represents can lead to quick convergence on a non-​optimal solution and poor
numerical inputs, in this case the red, green and blue (RGB) values of a pixel performance of the model. f | Early stopping is the process of terminating
in an image. Again the result is a vector with three numbers, corresponding training at the point where the loss function on the validation set starts to
to the amount of red, green and blue in the pixel. d | Failing to learn the increase, even if the loss function on the training set is still decreasing. Use
underlying relationship between the variables is called ‘underfitting’, of early stopping can prevent overfitting.

44 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

Table 1 | Comparison of different machine learning methods


Method Type of data Example applications Advantages Disadvantages
Ridge (and Labelled Protein-variant effect Easy to interpret Cannot learn complex feature
LASSO/elastic) Fixed number of features prediction122 Easy to train
relationships
regression Chemical/biochemical Overfits with a large number
Good benchmark
reaction kinetics123 of features
Support vector Labelled Protein function Can perform both linear and Scaling to large datasets
machine prediction124 non-​linear classification and is often difficult
Fixed number of features
Transmembrane-protein regression
topology prediction125
Random forest Labelled Prediction of Learns how important each Less appropriate for regression
disease-​associated genome feature is to the prediction
Fixed number of features Many decision trees are hard
mutations126 Individual decision trees are to interpret
Scoring of protein–ligand human readable, allowing
interactions39 interpretation of how a decision
is made
Less sensitive to feature scaling
and normalization so easier to
train and tune
Gradient Labelled Gene expression profiling127 Learns how important each Can struggle to learn underlying
boosting (for feature is to the prediction signal if noise is present
Fixed number of features
example, Decision trees are Less appropriate for regression
XGBoost) human-​readable, allowing
interpretation of how a decision
is made
Less sensitive to feature scaling
and normalization so easier to
train and tune
Clustering Unlabelled Differential gene expression For low-​dimensional data, good Scaling to large datasets is
analysis15 clustering is easily identifiable difficult for some methods
Fixed number of features
Model selection in protein Cluster validation metrics are Noisy datasets sometimes yield
structure prediction128 available to assess performance contradictory results
Dimensionality Unlabelled Single-​cell transcriptomics49 Provides visual representation Hard to preserve both global
reduction of data and local differences in data
Large and fixed number of Analysis of
features molecular-dynamics Goodness-​of-​fit evaluations Scaling to large numbers of
trajectories129 usually available to assess samples is difficult for some
performance methods
Multilayer Labelled Protein secondary structure Can fit datasets with fewer Easy to overfit
perceptron prediction13 layers than architectures such as
Fixed number of features Large number of parameters
Drug toxicity prediction54 convolutional neural networks,
making it easier and faster to Hard to interpret
train
Convolutional Spatial data arranged Protein residue–residue Variable input size Receptive field, the amount
neural network in a grid; for example, contact and distance Learns patterns irrespective
of the input that is considered
2D image (pixels) or prediction23 of location in input
when predicting the output for
3D volumes (voxels) Medical image recognition24 each pixel, can be limited
Allows variable input size Hard to train deeper
architectures that use many
layers to increase the receptive
field and make more complex
predictions
Recurrent Sequential data Protein engineering68 Variable input size Long training times
neural network (for example, biological Predicting clinical events66
Sequences are found in many High computing memory
sequences or time-​series areas of biology requirements
data)
Allows variable input size
Graph Data characterized by Predicting drug properties77 Variable graph sizes supported, High computing memory
convolutional connections between Interpreting molecular
which is important because most requirements for large,
network entities (spatial, structures73,74
graphs in biology have variable densely connected graphs
interaction or association) size
Knowledge extraction130 Hard to train deeper
Allows variable input size Learns patterns by following architectures
graph connectivity so predictor
uses most relevant associations

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 45

0123456789();:
Reviews

Table 1 (cont.) | Comparison of different machine learning methods


Method Type of data Example applications Advantages Disadvantages
Autoencoders Labelled or unlabelled Protein and gene Latent space provides Latent space specific to data
data engineering82 low-​dimensional representation in training set and may not be
Fixed or variable input size Prediction of DNA that can be used to visualize appropriate to other datasets
depending on architecture methylation81 input data
Testing newly generated
Neural population Can generate new samples, samples often requires wet
dynamics131 which is useful in areas such laboratory experiments
as protein design
Each method has types of data and applications to which it is best suited, along with advantages and disadvantages when compared with other methods.

For example, data points that are similar (for example, but does highlight one reason why interest in neural net-
two homologous protein sequences) should also be sim- works has persisted for decades. However, this guarantee
ilar in their lower-​dimensional form, whereas dissimilar does not provide a way of finding the optimal parame-
data points (for example, unrelated protein sequences) ters of a neural network model that will produce the best
should remain dissimilar46,47. Two or three dimensions approximation for a given dataset. There is also no guar-
are often chosen to allow visualization of the data on a antee that the model will provide accurate predictions
set of axes, although larger numbers of dimensions have for new data51.
uses in machine learning too. These techniques com- Artificial neurons are the building blocks of all neural
prise both linear and non-​linear transformations of the network models. An artificial neuron is simply a mathe-
data. Examples common in biology include principal matical function that maps (converts) inputs to outputs in
component analysis (PCA) as shown in Fig. 3d, Uniform a specific way. A single artificial neuron takes in any num-
Manifold Approximation and Projection (UMAP) and ber of input values, applies a specific mathematical func-
t-​distributed stochastic neighbour embedding (t-​SNE)48. tion to them and returns an output value. The function
The technique to use depends on the situation: PCA used is usually represented as
retains global relationships between data points and is
interpretable because each component is a linear combi- n 
y = σ ∑ (wixi ) + b, (1)
nation of input features, meaning it is easy to understand i =1 
which features give rise to variety in the data. t-​SNE
more strongly preserves local relationships between data where xi represents a single input variable or feature
points and is a flexible method that can reveal structure (there are n such inputs), wi represents a learnable weight
in complex datasets. Applications include single-​cell for that input, b represents a learnable bias term and
transcriptomics for t-​SNE49 and molecular dynamics σ represents a non-​linear activation function that takes
trajectory analysis for principal component analysis. a single input and returns a single output. To create a
network, artificial neurons are arranged in layers, with
Artificial neural networks the output of one layer being the input to the next. The
Artificial neural network models get their name from nodes of the network can be thought of as holding the y
the fact that the form of the mathematical model that is values from the above equation, which become the x val-
being fit is inspired by the connectivity and behaviour of ues for the next layer. We describe various approaches
neurons in the brain and was originally designed to learn for arranging artificial neurons in the following subsec-
about brain function50. However, the neural networks tions, which are called ‘neural network architectures’. It is
in common use in data science are obsolete as brain also common to combine the different architecture types;
models, and are now just machine learning models that for example, in a convolutional neural network
can offer state-​of-​the-​art performance in certain appli- (CNN) used for classification, fully connected layers are
cations. Interest in neural network models has grown usually used to produce the final classification output.
in recent decades owing to rapid advances in the archi-
tectures and training of deep neural networks26. In this Multilayer perceptrons. The most basic layout of a neu-
section, we describe basic neural networks, as well as ral network model is that of layers of artificial neurons
varieties that are widely used in biological studies. Some arranged in a fully connected fashion, as shown in Fig. 4a.
of these are shown in Fig. 4. In this layout, a fixed number of ‘input neurons’ represent
the input feature values calculated from the data that are
Basic principles of neural networks. A key property fed to the network, and each connection between a pair
of neural networks is that they are universal func- of neurons represents one trainable weight para­meter.
tion approximators, which means that, with very few These weights are the main adjustable parameters in a
assumptions, a correctly configured neural network can neural network, and optimizing these weights is what
approximate any mathematical function to an arbitrary is meant by neural network training. At the other end
level of accuracy. In other words, if any process (biolo­ of the network, a number of output neurons represent
gical or otherwise) can be thought of as some function the final output values from the network. Such a net-
of a set of variables, then that process can be modelled work, when correctly configured, can be used to make
to any arbitrary degree of accuracy, governed by just the complex, hierarchical decisions about the input, as each
size or complexity of the model. The above definition of neuron in a given layer receives inputs from all neurons
universal approximation is not mathematically rigorous, in the previous layer. Layers of neurons in this simple

46 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

a Regression b SVM c Gradient boosting


Ordered
Model protein Margin
Data point

e n

Drug property 2
la
rp
pe
hy
Weight

g
tin
ra
pa
Se
Disordered
protein
Drug property 1
Height Active
Features Inactive
d PCA e Clustering
PC1

Gene 2 expression
PC2
Weight

Cell types

Height Gene 1 expression

Fig. 3 | Traditional machine learning methods. a | Regression finds the relationship between a dependent variable (the
observed property) and one or more independent variables (features); for example, predicting the weight of a person from
the person’s height. b | A support vector machine (SVM) transforms the original input data such that in their transformed
versions (called the ‘latent representation’) data belonging to separate categories are divided by a clear gap that is made as
wide as possible. In this case we show a prediction of whether a protein is ordered or disordered, with the axes representing
dimensions of the transformed data. c | Gradient boosting uses an ensemble of weak prediction models, typically decision
trees, to make predictions. For example, active drugs can be predicted from molecular descriptors such as molecular
weight and the presence of particular chemical groups. Individual predictors are combined in a stage-​wise manner to
make the final prediction. d | Principal component analysis (PCA) finds a series of feature combinations that best describe
the data while being orthogonal to each other. It is commonly used for dimensionality reduction. In the case of the height
and weight of a person, the first principal component (PC1), corresponding to a linear combination of height and weight,
describes the strong positive correlation, whereas PC2 might describe other variables that do not correlate strongly with
those, such as percentage body fat or muscle mass. e | Clustering uses one of various algorithms to group sets of similar
objects (for example, grouping cell types on the basis of gene expression profiles).

arrangement are often called ‘multilayer perceptrons’ local area would be a small patch of pixels in the image.
and were the first networks useful for bioinformatics The outputs of a convolutional layer are also image-​like
applications52,53. They are still widely used in a number of arrays, carrying the result of ‘sliding’ the filter over the
biological modelling applications today due to their ease entire input and computing an output at each position.
and speed of training13,54. In many other applications, Crucially, the same filter is used across all pixels, allow-
however, these simple architectures have been surpassed ing the filters to learn local structure in the input data.
by newer model architectures discussed below, although It is common in deeper CNNs to use skip connections
some of these newer architectures still often make use of that allow the input signal to bypass one or more layers
fully connected layers as subcomponents. in addition to passing through the processing units in
the layer. This type of network is called a ‘residual net-
Convolutional neural networks. CNNs are ideally suited work’ and allows training to converge more quickly on
for image-​like data, where the data possess some type of accurate solutions.
local structure, and where the recognition of such struc- CNNs can be configured to operate effectively on
ture is a key objective of the analysis. With the example data of different spatial structure. For example, a 1D
of images, this local structure could relate to specific CNN would have filters that slide in just one direction
types of objects in a field of view (for example, cells in (say from left to right); this type of CNN would be suit-
a microscopy image), represented by specific local pat- able for data that have only one spatial dimension (such
terns of colours and/or edges in spatially close pixels in as text or biological sequences). 2D CNNs operate on
an input image. data with two spatial dimensions, such as digital images.
CNNs are composed of one or more convolutional 3D CNNs operate on volumetric data, such as magnetic
layers (see Fig. 4b), in which the output is the result of resonance imaging scans.
applying a small, one-​layer fully connected neural net- CNNs have seen significant success in biology for
work, called a ‘filter’ or ‘kernel’, to local groups of fea- a variety of data types. Recent advances in protein
tures in the input. In the case of image-​like inputs, this structure prediction have used information on the

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 47

0123456789();:
Reviews

co-​evolution of residue pairs in related protein sequences RNNs can be thought of as a block of neural network
to extract information on residue pair contacts and dis- layers that take as input the data corresponding to each
tances, allowing predictions of 3D protein structures to entry (or time step) in a sequence and produce an output
be built at unprecedented accuracy23,55. In this case, the for each entry that is dependent on entries that have pre-
network learns to pick out direct coupling interactions, viously been processed. They can also be used to gener-
and accurate predictions can be made even for sequences ate a representation of the whole sequence that is passed
with few or no related sequences56. CNNs have also to later layers of the network to generate the output. This
been applied successfully to identify variants in genetic is useful as sequences of any length can be converted
sequence data57, 3D genome folding58, DNA–protein to a fixed-​size representation and input to a multilayer
interactions22,59, cryogenic electron microscopy image perceptron. Obvious examples for the use of RNNs in
analysis60,61 and image classification in medically impor- biology include analysis of gene or protein sequences,
tant contexts (such as detection of malignancy), where with tasks including identifying promoter regions from
they often now rival expert human performance24,62. gene sequences, predicting protein secondary structure
or modelling gene expression levels over time; in the
Recurrent neural networks. RNNs are most suited to last case, the value at a given time point would count as
data that are in the form of ordered sequences, such that one entry in a sequence. The more advanced long short-​
there exists (at least notionally) some dependence or term memory or gated recurrent unit variants of RNNs
correlation between one point in the sequence and the have many uses in biology, including protein structure
next. Probably their main application outside biology is prediction63,64, peptide design65 and predicting clinical
in natural language processing, where text is treated as diagnosis from health records66. These more advanced
a sequence of words or characters. As shown in Fig. 4c, methods are often used in combination with CNNs,

a Multilayer perceptron b Convolutional neural network c Recurrent neural network


DNA sequence
Filter A C T C
Input
Molecular Toxicity
properties prediction Hidden state
Microscopy
Output image Output
Input 0.4 0.8 0.7 0.3
Next layer Transcription factor binding
Hidden layers

d Graph convolutional network e Autoencoder


E
neu ncode oder
ral n r
etw Dec etwork
a n
l
Protein R ork neur R
Protein–protein
G G
interaction Protein Protein
sequence A A sequence
Updated node D E
features
I Latent I
Layer Next layer representation

Fig. 4 | Neural network methods. a | A multilayer perceptron consists of nodes (shown as circles) that represent
numbers: an input value, an output value or an internal (hidden) value. Nodes are arranged in layers with connections,
indicating learned parameters, between every node of a layer and every node of the next layer. For example, molecular
properties can be used to predict drug toxicity as the prediction can be made from some complicated combination of
independent input features. b | A convolutional neural network (CNN) uses filters that move across the input layer and
are used to calculate the values in the next layer. The filters operating across the whole layer mean that parameters are
shared, allowing similar entities to be detected regardless of location. A 2D CNN is shown operating on a microscopy
image, but 1D and 3D CNNs also find applications in biology. The dimensionality in this case refers to how many spatial
dimensions there are in the data, and the connectivity within the CNNs can be configured accordingly. For example,
biological sequences can be considered 1D and magnetic resonance imaging data can be considered 3D. c | A recurrent
neural network (RNN) processes each part of a sequential input using the same learned parameters, giving an output
and an updated hidden state for every input. The hidden state is used to carry information about the preceding parts
of the sequence. In this case the probability of transcription factor binding is predicted for each base in a DNA sequence.
The RNN is expanded to show how each output is generated using the same layers; this should not be confused with
using different layers for each output. d | A graph convolutional network uses information from connected nodes in a
graph, such as a protein–protein interaction network, to update node properties in the network by combining predictions
from all neighbouring nodes. The updated node properties form the next layer in the network and predict the desired
property in the output layer. e | An autoencoder consists of an encoder neural network, which converts an input into
a lower-​dimensional latent representation, and a decoder neural network, which converts this latent representation
back to the original input form. For example, protein sequences can be encoded and the latent representation used to
generate novel protein sequences. In the example, four of the five residues are the same as the input after encoding and
decoding by the autoencoder, indicating an accuracy of 80% on this sequence.

48 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

which can increase accuracy67. RNNs can be very robust training graph convolutional networks includes PyTorch
in analysing sequence-​based data. For example, RNNs Geometric79 and Graph Nets72.
trained on millions of protein sequences have shown an
ability to capture evolutionary and structural informa- Autoencoders. As the name suggests, autoencoders are
tion, and can be applied to a variety of supervised tasks, neural network architectures designed to self-​encode
including tasks related to the design of novel protein (autoencode) a collection of data points by represent-
sequences68. ing them as points in a new space of predetermined
dimensionality, usually far fewer than the number of
Role of attention mechanisms and the use of trans- input dimensions. One neural network (the encoder) is
formers. A problem identified with RNNs is the diffi- trained to convert the input into a compact internal rep-
culty they have in examining specific parts of an input resentation, called a ‘latent vector’ or ‘latent representa-
sequence, which is important in order to generate tion’, representing a single point in the new space. The
a highly accurate output. The addition of an atten- second part of an autoencoder, called the ‘decoder’, takes
tion mechanism to RNNs, which allows the model to the latent vector as input and is trained to produce as
access all parts of the input sequence when calculating output the original data with the original dimensional-
each output, was introduced to alleviate this problem. ity (Fig. 4e). Another way of looking at this is that the
Recently it was shown that the RNN is not even required encoder tries to compress the input, and the decoder
at all, and that attention alone can be used by itself; the tries to decompress it. The encoder, latent representation
resulting models, called ‘transformers’, have obtained and decoder are trained at the same time. Although this
state-​of-​the-​art results on many natural language pro- sounds like a pointless exercise, where the output just
cessing benchmarks69. Transformer models have recently mimics the input, the idea is to learn a new representa-
shown greater accuracy than RNNs for tasks on bio- tion of the input data that compactly encodes desirable
logical sequences, but it remains to be seen whether features, such as similarity between the data points, while
these methods, which are often trained on billions of still retaining the ability to reconstruct the original data
sequences using thousands of graphics processing units, using the learned latent representation. Applications
will be able to outperform existing alignment-​based include predicting how closely related two data points
methods of sequence analysis in bioinformatics70. The are and enforcing some structure on the latent space that
outstanding success of AlphaFold2 in the 14th Critical is useful for further prediction tasks. Another benefit
Assessment of Protein Structure Prediction (CASP14) of the encoder–decoder architecture is that, once trained,
experiment, a blind assessment of computational the decoder can be used in isolation to generate pre-
approaches to predict protein structure from sequence, dictions of new, synthetic data samples which can be
suggests that models using attention also hold promise tested in the laboratory and contribute to synthetic
for tasks in structural biology71. biology efforts80. Autoencoders have been applied to a
range of biological problems, including predicting DNA
Graph convolutional networks. Graph convolutional methylation state81, the engineering of gene and protein
networks are particularly suitable for data that, while sequences82,83 and single-​cell RNA-​sequencing analysis84.
not having any obvious visible structure like an image,
are nonetheless composed of entities connected by arbi- Training and improving neural networks. The general
trary specified relationships, or interactions72. Examples procedure for training machine learning models is out-
of such data relevant to biology include molecules (com- lined in Box 1. However, as neural networks are struc-
posed of atoms and bonds)73–76 and protein–protein turally much more complex than the traditional machine
interaction networks (composed of proteins and inter- learning algorithms, there are some concerns that are
actions)77. A graph, in computational terms, is just a rep- specific to neural networks. Having picked a neural net-
resentation of such data, with each graph having a set of work as an appropriate model for the intended appli-
vertices or nodes, and a set of edges that represent vari­ cation (Fig. 1), it is often a good idea to train it on just
ous types of relationships or connections between the a single training example (for example, a single image
nodes. With use of the examples given above, represen­ or gene sequence). This trained model is not useful for
tations of atoms or proteins might be classed as node making predictions, but the training is good at revealing
features, and bonds or interactions might be classed as programming errors. The training loss function should
edge features. Graph convolutional networks use the very quickly go to zero as the network simply memorizes
structure of the resulting graph to determine the flow the input; if it does not, there is likely an error in the
of information in the neural network model. As shown code, or the algorithm is not complex enough to model
in Fig. 4d, adjacent nodes are considered when the fea- the input data. Once the network has passed this basic
tures of each node are updated throughout the network, debugging test, training on the whole training set can
with the node features in the last layer being used as the proceed, where the training loss function is minimized.
output (for example, interacting residues on a protein) This may require tuning of hyperparameters such as the
or combined to form an output for the whole graph (for learning rate (Fig. 2e). By monitoring loss on the train-
example, fold type of the protein). Graphs representing ing and validation sets, overfitting of the network can
different associations can combine different sources of be detected where the training loss continues to drop
information when making predictions, such as com- lower and the loss on the validation set starts to increase.
bining drug–gene and food–gene relationship graphs Training is usually stopped at that point, a process
to predict foods for cancer prevention78. Software for known as early stopping (Fig. 2f). Overfitting of a neural

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 49

0123456789();:
Reviews

Regularization
network (or any machine learning model), visualized in these are more likely to produce robust predictions. When
Restricting the values of Fig. 2d, means that the model is starting to simply mem- larger quantities are available, one can start to consider
parameters to prevent the orize features of the training set and thus starting to lose more highly parameterized models such as deep neural
model from overfitting to its ability to generalize to new data. Early stopping is a networks. In supervised machine learning, the rela­tive
the training data. For example,
penalizing high parameter
good way of preventing this, but other techniques can be proportions of each ground truth label in the dataset
values in regression models used during training, such as regularization of the model should also be considered, with more data required
reduces the flexibility of the or dropout techniques, where nodes in the network are for machine learning to work if some labels are rare87.
model and can stop it fitting randomly ignored to force the network to learn a more
to noise in the training data.
robust prediction strategy involving multiple nodes. Data leakage. Although the scale and complexity of
Cloud computing
Popular software packages used to train neural net- biological data may make them seem ideal for machine
On-​demand computing works include PyTorch85 and Tensorflow86. Training learning, there are some important considerations that
services, including processing neural networks is computationally demanding, usually need to be borne in mind21,88,89. One key concern is
power and data storage, requiring a graphics processing unit or tensor processing how to validate the performance of a model. The com-
typically available via the
Internet. A pay-​as-​you-​go
unit with sufficient memory, as these devices can pro- mon setup of training, validation and test sets can lead
model is usually used. Use vide a 10 to 100 times speedup over use of the standard to problems such as researchers repeatedly testing on
of cloud computing minimizes central processing unit. This speedup is required when the same test set with a variety of models to obtain the
up-​front IT infrastructure costs. training the larger models that have shown success in greatest accuracy, and hence risking overestimating per-
recent years, and when training is performed on large formance on it without generalizing to other test sets
Hidden Markov model
A statistical model that can be
datasets. However, running an already trained model is or new data. However, biological data present a further
used to describe the evolution usually considerably faster and is often feasible on just an non-​trivial question: in a large dataset with related
of observable events that ordinary central processing unit. Cloud computing solu- entries (for example, as a result of familial relationships,
depend on factors that are tions from common providers exist for those without or evolutionary relationships), how does one ensure that
not directly observable.
It has various uses in biology,
access to a graphics processing unit for training, and it is two closely related entries do not end up split between
including representing protein worth noting that for small tasks, Colaboratory (Colab) the training set and the test set? If this occurs, then the
sequence families. allows Python code to be tested on either graphics pro- ability of the model to remember specific cases is tested,
cessing units or tensor processing units free of charge. rather than its ability to predict the property in question.
Using Colab is an excellent way of getting started with This is one example of an issue often called ‘data leak-
Python-​based deep learning. age’ and leads to results that appear better than they are,
which is perhaps one reason researchers are reluctant to
Challenges for biological applications be rigo­rous about the issue. Other types of data leakage
Perhaps the single biggest challenge of modelling biologi­ are possible (for example, using any data or features dur-
cal data is the sheer variety1. Biologists work with data ing training that would not be available during testing).
such as gene and protein sequences, gene expression Here we focus on the problem of having related samples
levels over time, evolutionary trees, microscopy images, in the training and testing sets.
3D structures and interaction networks, to name but What we mean by ‘related’ here depends on the nature
a few. We have summarized some best practices and of the study. It might be a case of sampling data from the
important considerations for specific biological data same patient or the same organism. However, the classic
types in Table 2. Owing to the diversity of data types situation where data leakage occurs in biology is seen in
encountered, biological data often require somewhat studies on protein sequences and structures. Typically,
bespoke solutions for handling them effectively, and this but usually not correctly, researchers try to ensure that
makes it difficult to recommend off-​the-​shelf tools or no protein in the training set has sequence identity above
even gene­ral guidelines for the use of machine learning a certain threshold to any protein in the test set, usually
in these problem domains, as the choice of model, train- at a threshold of 30% or 25%. This is enough to exclude
ing procedure and test data will depend heavily on the many homologous pairs of proteins, but it has been
exact questions one wants to answer. Nevertheless, there known for decades that some homologous proteins can
are some common issues that need to be considered for have virtually no sequence similarity90,91, which would
the successful use of machine learning in biology, but also mean that simply filtering by sequence identity would be
more generally. insufficient to prevent data leakage. This is particularly
important for models that operate on sequence align-
Data availability. Biology is somewhat unique in that ments or sequence profiles as input, as although two
there exist some problem domains that have very large individual protein sequences may not share any obvi-
quantities of data publicly available, whereas other prob- ous similarity, their profiles could be virtually identical.
lem domains have very small quantities. An example is This means that for a machine learning model, these
the relative abundance of biological sequence data in two profiles would essentially be the same data point —
public databases such as GenBank and UniProt, whereas both will be describing the same protein family. For pro-
reliable data on protein interactions are much harder to tein sequences, one solution to avoid this problem is to
come by. The quantity of data available for a given prob- search the test data with a sensitive hidden Markov model
lem has a profound impact on the choice of techniques profile comparison tool such as HH-​suite, which can
that can effectively be used. As a very rough guideline, find sequences distantly related to the training data92.
when only small amounts of data (hundreds of or a few In the common case where protein structure is being
thousand examples) are available, one is essentially forced used as input or output, structural classifications such as
to use more traditional machine learning methods, as CATH93 or ECOD94 can be used to exclude similar folds

50 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

Table 2 | Recommendations for the use of machine learning strategies for different biological data types
Input data Example prediction tasks Recommended models Challenges
Gene sequence DNA accessibility 14
1D CNNs Repetitive regions in genome
3D genome organization58 RNNs Sparse regions of interest
Enhancer–promoter interactions40 Transformers Very long sequences
Protein sequence Protein structure 23,55
2D CNNs and residual networks using Metagenome data stored in many places
co-variation data and therefore hard to access
Protein function132
Multilayer perceptrons with windowing Data leakage (from homology) can make
Protein–protein interaction133
validation difficult
Transformers
Protein 3D Protein model refinement134 GCNs using molecular graph Lack of data, particularly on protein
structure complexes
Protein model quality 3D CNNs using coordinates
assessment135 Traditional methods using structural features
Lack of data on disordered proteins
Change in stability upon mutation136
Clustering
Gene expression Intergenic interactions or Clustering Unclear link between co-​expression and
co-​expression137 function
CNNs
Organization of transcription High dimensionality
Autoencoders
machinery138
High noise
Mass spectrometry Detecting peaks in spectra139 CNNs using spectral data Lack of standardized benchmarks141
Metabolite annotation140 Traditional methods using derived features Normalizationa required between
different datasets
Images Medical image recognition24,62 2D CNNs and residual networks Systematic differences in data collection
affect prediction
Cryo-​EM image reconstruction60,142 Autoencoders
Hard to obtain large datasets of
RNA-​sequencing profiles143 Traditional methods using image features
consistent data
Molecular Antibiotic activity73 GCNs using molecular graph Experimental data available for only a
structure Drug toxicity 54
Traditional methods or multilayer
tiny fraction of possible small molecules
perceptrons using molecular properties
Protein-​ligand docking39
RNNs using text-​based representations
Novel drug generation144
of molecular structure such as SMILES
Autoencoders
Protein–protein Polypharmacology side effects77 GCNs Interaction networks can be incomplete
interaction Protein function 145
Graph embedding Cellular location affects whether
network proteins interact
High number of possible combinations
Each type of biological data has prediction tasks in which it has been used effectively, machine learning models that are appropriate and specific challenges when
using machine learning. Some challenges, such as high dimensionality, affect most biological data types. CNN, convolutional neural network; cryo-​EM, cryogenic
electron microscopy; GCN, graph convolutional network; RNN, recurrent neural network. a‘Normalization’ means rescaling or otherwise transforming variables
from different datasets with the intention that their contributions should carry roughly equal weight and their ranges are comparable on a joint scale. The most
common way of achieving this is by subtracting the means of each variable and dividing by their standard deviations, which can also be called ‘standardization’.
This is required because different instruments, experimental protocols and so on can produce systematic differences in measurements of the same quantities,
rendering it difficult or impossible to compare trends between experiments.

or homologous proteins. Similar issues affect studies benchmarking to be performed before a paper can be
predicting protein–ligand binding affinity95. considered for publication. Without proper testing, the
To be clear, data leakage is not an intrinsic issue with performance of a model will very likely not be represent-
any particular type of data, but rather it is a problem ative of real-​world performance on unseen data, which
with how the data are used when training and evaluating undermines user confidence in the model. Worse, authors
machine learning models. One would certainly expect a of future studies may be misled into thinking that inad-
trained model to produce very good results on data that equate testing is defensible simply because it has already
are similar to the training set. The issue of data leakage appeared in (possibly several) peer-​reviewed articles,
becomes a problem when a model that appears accurate even though it is not. As mentioned in Box 2, authors,
on some benchmark set performs poorly on new data peer reviewers and journal editors all have a responsi-
that are actually different from the training set; in other bility for ensuring that data leakage has been avoided.
words, the model does not generalize, likely because Knowingly leaving these kinds of errors in place is really
it has not modelled the true relationship between the little better than fabricating data at the end of the day.
variables, but rather remembered hidden associations
present in the data. Interpretability of models. It is usually the case that bio­
Because of frequent complaints from reviewers, logists want to know why a particular model is making
some journals are now starting to require rigorous a particular prediction (that is, what features of the input

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 51

0123456789();:
Reviews

Saliency map
data the model is responding to and how) and why it The low cost of training non-​neural network meth-
In the context of machine works in some cases but not others. In other words, bio­ ods means that it is advisable to perform ablation studies,
learning, an image generated logists are often interested in discovering mechanisms where the effect on performance of removing defined
to show which pixels in an and the factors responsible for modelling output, rather features of the input is measured. Ablation studies can
input image contribute to the
prediction made by a model.
than just accurate modelling, as mentioned previously. reveal which features are most useful for the modelling
It is useful in interpreting The ability to interpret a model depends on the machine task at hand, and are one way to possibly discover more
models. learning method used and the input data. Interpretation robust, efficient and interpretable models.
is usually easier for non-​neural network methods, as Interpreting a neural network (particularly a deep
these have feature sets more amenable to direct mean- neural network) is generally much harder due to the
ingful interpretation and generally have fewer learnable frequently large number of input features and param-
parameters. In the case of, say, a simple linear regres- eters in the model. It is still possible to identify, for
sion model, the parameter assigned to each input feature example, regions in an input image most responsible
gives a direct indication of how that feature affects for a particular classification by building a saliency map28.
the prediction. Although saliency maps show which regions of an image
are important, it can be more difficult to pinpoint which
properties of the data in these locations were responsi-
Box 2 | Evaluating articles that use machine learning ble for the prediction, particularly when the inputs are
Here are some questions to consider when reading or reviewing articles that use not easily interpretable by humans, such as images and
machine learning on biological data. It is useful to bear these considerations in mind, text. Nevertheless, saliency maps and similar representa-
even if the answers are not apparent, and these questions can be used as the basis for tions can be useful as a ‘sanity check’ to ensure that the
a discussion with collaborators with the required expertise. A surprising number of model is indeed looking at the relevant parts of an image.
articles do not fulfil these criteria148. This can help avoid situations where models make unin-
Is the dataset adequately described? tended connections, such as classifying medical images
Complete steps to assemble the dataset should be provided, ideally with the dataset on the basis of hospital or department labels in the cor-
or summary data (for example, biological database IDs) available at a persistent URL. ner of the image rather than the medical content of the
In our experience, a thorough description of the machine learning method but with image itself96. Generating adversarial examples, synthetic
only a cursory reference to the data is a red flag. If a standard dataset or a dataset from inputs that cause a neural network to produce confident
another study is being used, then this should be adequately justified in the article. incorrect predictions, can also be a good way of pro-
Is the test set valid? viding information on which features are being most
Based on the discussion in the section Challenges for biological applications, check that used for prediction97. For example, CNNs often use tex-
the test set is sufficient to benchmark the property under investigation. There should be tures (such as stripes in animal fur) to classify objects in
no data leakage between the training set and the test set, the test set should be of large images, where humans would primarily use the shapes51.
enough to give reliable results and the test set should mirror the range of examples a
standard user of the tool would be likely to use it on. The composition and size of the Privacy-​preserving machine learning. Some biological
training and test sets should be discussed in detail. Authors have a responsibility to
data, most notably human genomics data and commer-
ensure that all steps have been taken to avoid data leakage, and these steps should be
described in the article, along with the rationale behind them. Journal editors and peer
cially sensitive pharmaceutical data, have data privacy
reviewers should also ensure that these tasks have been performed to a good standard, implications. There have been a number of efforts to
and certainly should never just assume that they have been. allow sharing of data and distributed training of machine
learning models in the context of data privacy. For exam-
Is the model choice justified?
Reasons should be given for the choice of machine learning method. Neural networks
ple, modern cryptographic techniques allow training of a
should be used because they are appropriate for the data and question in hand, and not drug–target interaction model where the data and results
just because everyone else is using them. Discussion of models that were tried and did are provably secure98. Simulated, synthetic participants
not work should be encouraged as it may help others; too often a complex model is that closely resemble real participants in a clinical trial
presented without any discussion of the inevitable trial and error that will have been can lead to results that are accurate for real participants
required to end up with that model. without revealing identifying data99. Algorithms have
Has the method been compared with other methods? been developed for efficient federated model training
A novel method should be compared with existing methods that show good with data stored in different places100.
performance and are used in the community. Ideally methods using a variety of model
types should be compared, which can aid in interpreting results. It is surprising how The need for interdisciplinary collaborations. Unless
many complex models can be matched in performance by simple regression methods. publicly available data are being used, it is rare that one
Are the results too good to be true? research group will have the expertise and resources to
Claims of greater than 99% accuracy are not uncommon in machine learning articles both collect data for a machine learning study and also
in biology. Usually, this is a sign of a problem with the testing rather than an amazing apply the most appropriate machine learning method
breakthrough. Both authors and reviewers should take note of this point. effectively. It is common for experimental biologists to
Is the method available? collaborate with computer scientists, with such collabo­
At the very least, someone who wants to use a trained model from an article should be rations often obtaining excellent results. It is, however,
able to run a prediction using a Web service or binary file. Ideally, at least source code important in such collaborations that each side has some
and the trained model should be available at a persistent URL and under a common working knowledge of the other. In particular, computer
licence149,150. Also making the training code available is the ideal scenario, as this further scientists should make an effort to understand the data,
increases the reproducibility of the article and allows other researchers to build on the such as the expected degree of noise and reproducibility,
method without essentially having to start from scratch. Journals should bear some and biologists should understand the limitations of the
responsibility here to ensure that this becomes the norm.
machine learning algorithms being used. Building such

52 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

Automatic differentiation
understanding takes time and effort, but is important to In this way, researchers working on similar problems
A set of techniques to prevent the often unintentional dissemination of poor can use these models without the need for training,
automatically calculate the models and misleading results. and a variety of different models can be used with
gradient of a function in a only minimal effort required for switching between
computer program. Used to
Future directions them114. The field has also seen an expansion of auto-
train neural networks, where
it is called ‘backpropagation’. The increasing use of machine learning in biological mated machine learning pipelines, which train and tune
studies looks set to continue for the foreseeable future. a variety of models without user input and return the
Gradients This increased uptake has been enabled by important best performing model. These may assist non-​experts
The rate of change of one
advances in methodology, software and hardware, all of in training models115. However, these resources cannot
property as another property
changes. In neural networks,
which keep on developing. A number of large technology replace a thorough understanding of the method being
the set of gradients of the loss companies are using their technical expertise and con- used, which is important for choosing the appropri-
function with respect to the siderable resources to assist academic researchers or ate inductive bias and interpreting the predictions of
neural network parameters, even perform their own research in biology with innova- the model. It remains to be seen whether in the future
computed via a process known
tive machine learning strategies. To date, however, most automated machine learning will be reliable and flexi-
as backpropagation, is used to
adjust the parameters and thus success has come from applying algorithms developed ble enough to allow experimentalists to routinely use
train the model. in other fields directly to biological data. For example, complex machine learning algorithms independently,
CNNs and RNNs were developed for applications such or whether machine learning expertise will remain a
as image analysis (for face recognition or in self-​driving necessity.
cars) and natural language processing, respectively. One As has been discussed, rigorous validation of models
of the most exciting prospects for machine learning in and comparison of different models is challenging
biology is algorithms tailored specifically to biological but remains necessary to identify the best perform-
data and biological questions101,102. If the known struc- ing models and inform future research directions. For
ture of a biological system can be exploited and neural the field to progress, it will be necessary to develop
networks used to learn the unknown parts, then increas- benchmark datasets and validation tasks116, such as
ingly heavily parameterized models can be replaced with ProteinNet117, ATOM3D118 and TAPE119, and for these
simpler ones that are more amenable to interpretation to become widely used. Of course, overoptimizing to a
and more robust on new data103. Applications include particular benchmark can occur, and it is important that
biological reaction systems and pharmacokinetics, resear­chers resist the temptation to do this to make their
where systems of known differential equations can be results seem better. Blind assessments such as CASP120
used. This will also assist in the move from predictive and the Critical Assessment of Functional Annotation121
machine learning to generative models that can create will continue to play an important role in assessing
new entities, such as designing proteins with novel which models perform best.
structures and functions104,105. Overall, the variety of biological data makes it hard
As the variety of useful architectures and input data to provide general guidelines for machine learning in
types increases, the paradigm of differentiable program- biology. Hence, we have aimed here to give biologists
ming is emerging from the field of deep learning106. an overview of the different methods available and to
Differentiable programming is the use of automatic provide them with some ideas about how to conduct
differentiation, the central concept in training neural net- effective machine learning with their data. Of course,
works, to calculate gradients and improve parameters in machine learning is not suited to every problem, and
any desired algorithm. This shows promise for physi- it is just as important to know when to avoid it: when
cal models of biological systems in protein structure there are not sufficient data, when understanding rather
prediction63,107, and for learning force field parameters than prediction is required or when it is unclear how to
for molecular dynamics simulations108,109. The develop- assess performance in a fair way. The boundaries of
ment of differentiable software packages such as JAX110 when machine learning is useful in biology are still being
and packages tailored to specific areas of biology such explored and will continue to change in accordance with
as Selene111, Janggu112 and JAX MD113 will assist the the nature and volume of available experimental data.
development of such methods. Undeniably, though, machine learning has had a huge
The progress in biological data analysis with impact on biology and will continue to do so.
machine learning has also been enabled by the deposi-
tion of trained models in publicly available repositories. Published online 13 September 2021

1. Ching, T. et al. Opportunities and obstacles for deep of neurodegenerative diseases. Nat. Rev. Neurol. 16, 10. Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of
learning in biology and medicine. J. R. Soc. Interface 440–456 (2020). druggable proteins using machine learning and systems
15, 20170387 (2018). 7. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-​learning- biology: a mini-​review. Front. Physiol. 6, 366 (2015).
This is a thorough review of applications of deep guided directed evolution for protein engineering. 11. Marblestone, A. H., Wayne, G. & Kording, K. P. Toward
learning to biology and medicine including many Nat. Methods 16, 687–694 (2019). an integration of deep learning and neuroscience.
references to the literature. 8. Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R. Front. Comput. Neurosci. 10, 94 (2016).
2. Mitchell, T. M. Machine Learning (McGraw Hill, 1997). & Drăghici, S. Machine learning and its applications 12. Jiménez-​Luna, J., Grisoni, F. & Schneider, G.
3. Goodfellow, I., Bengio Y. & Courville, A. Deep Learning to biology. PLoS Comput. Biol. 3, e116 (2007). Drug discovery with explainable artificial intelligence.
(MIT Press, 2016). This is an introduction to machine learning Nat. Mach. Intell. 2, 573–584 (2020).
4. Libbrecht, M. W. & Noble, W. S. Machine learning concepts and applications in biology with a focus 13. Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein
applications in genetics and genomics. Nat. Rev. Genet. on traditional machine learning methods. Analysis Workbench: 20 years on. Nucleic Acids Res.
16, 321–332 (2015). 9. Silva, J. C. F., Teixeira, R. M., Silva, F. F., 47, W402–W407 (2019).
5. Zou, J. et al. A primer on deep learning in genomics. Brommonschenkel, S. H. & Fontes, E. P. B. Machine 14. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning
Nat. Genet. 51, 12–18 (2019). learning approaches and their current application the regulatory code of the accessible genome with
6. Myszczynska, M. A. et al. Applications of in plant molecular biology: a systematic review. deep convolutional neural networks. Genome Res. 26,
machine learning to diagnosis and treatment Plant. Sci. 284, 37–47 (2019). 990–999 (2016).

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 53

0123456789();:
Reviews

15. Altman, N. & Krzywinski, M. Clustering. Nat. Methods 43. Steinegger, M. & Söding, J. MMseqs2 enables with sequence-​based deep representation learning.
14, 545–546 (2017). sensitive protein sequence searching for the analysis Nat. Methods 16, 1315–1322 (2019).
16. Hopf, T. A. et al. Mutation effects predicted from of massive data sets. Nat. Biotechnol. 35, 69. Vaswani, A. et al. Attention is all you need.
sequence co-​variation. Nat. Biotechnol. 35, 128–135 1026–1028 (2017). arXiv https://arxiv.org/abs/1706.03762 (2017).
(2017). 44. Jain, A. K. Data clustering: 50 years beyond K-​means. 70. Elnaggar, A. et al. ProtTrans: towards cracking the
17. Zhang, Z. et al. Predicting folding free energy changes Pattern Recognit. Lett. 31, 651–666 (2010). language of life’s code through self-supervised deep
upon single point mutations. Bioinformatics 28, 45. Ester M., Kriegel H.-P., Sander J., Xu X. A density-​based learning and high performance computing. arXiv
664–671 (2012). algorithm for discovering clusters in large spatial https://arxiv.org/abs/2007.06225 (2020).
18. Pedregosa, F. et al. Scikit-​learn: machine learning in databases with noise. KDD‘96 Proc. Second Int. Conf. 71. Jumper, J. et al. Highly accurate protein structure
python. J. Mach. Learn. Res. 12, 2825–2830 (2011). Knowl. Discov. Data Mining. 96, 226–231 (1996). prediction with AlphaFold. Nature 596, 583–589
19. Kuhn, M. Building predictive models in r using the 46. Nguyen, L. H. & Holmes, S. Ten quick tips for effective (2021).
caret package. J. Stat. Softw. 28, 1–26 (2008). dimensionality reduction. PLoS Comput. Biol. 15, 72. Battaglia, P. W. et al. Relational inductive biases, deep
20. Blaom, A. D. et al. MLJ: a Julia package for e1006907 (2019). learning, and graph networks. arXiv https://arxiv.org/
composable machine learning. J. Open Source Softw. 47. Moon, K. R. et al. Visualizing structure and transitions abs/1806.01261 (2018).
5, 2704 (2020). in high-​dimensional biological data. Nat. Biotechnol. 73. Stokes, J. M. et al. A deep learning approach to
21. Jones, D. T. Setting the standards for machine learning 37, 1482–1492 (2019). antibiotic discovery. Cell 181, 475–483 (2020).
in biology. Nat. Rev. Mol. Cell Biol. 20, 659–660 48. van der Maaten, L. & Hinton, G. Visualizing data using In this work, a deep learning model predicts
(2019). t-​SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). antibiotic activity, with one candidate showing
22. Alipanahi, B., Delong, A., Weirauch, M. T. & 49. Kobak, D. & Berens, P. The art of using t-​SNE for broad-​spectrum antibiotic activities in mice.
Frey, B. J. Predicting the sequence specificities single-​cell transcriptomics. Nat. Commun. 10, 5416 74. Gainza, P. et al. Deciphering interaction fingerprints
of DNA- and RNA-​binding proteins by deep learning. (2019). from protein molecular surfaces using geometric
Nat. Biotechnol. 33, 831–838 (2015). This article provides a discussion and tips for using deep learning. Nat. Methods 17, 184–192 (2020).
23. Senior, A. W. et al. Improved protein structure t-​SNE as a dimensionality reduction technique on 75. Strokach, A., Becerra, D., Corbi-​Verge, C., Perez-​Riba, A.
prediction using potentials from deep learning. single-​cell transcriptomics data. & Kim, P. M. Fast and flexible protein design using deep
Nature 577, 706–710 (2020). 50. Crick, F. The recent excitement about neural networks. graph neural networks. Cell Syst. 11, 402–411.e4
Technology company DeepMind entered the Nature 337, 129–132 (1989). (2020).
CASP13 assessment in protein structure prediction 51. Geirhos, R. et al. Shortcut learning in deep neural 76. Gligorijevic, V. et al. Structure-based function
and its method using deep learning was the most networks. Nat. Mach. Intell. 2, 665–673 (2020). prediction using graph convolutional networks.
accurate of the methods entered. This article discusses a common problem in deep Nat. Commun. 12, 3168 (2021).
24. Esteva, A. et al. Dermatologist-​level classification of learning called ‘shortcut learning’, where the 77. Zitnik, M., Agrawal, M. & Leskovec, J. Modeling
skin cancer with deep neural networks. Nature 542, model uses decision rules that do not transfer polypharmacy side effects with graph convolutional
115–118 (2017). to real-​world data. networks. Bioinformatics 34, i457–i466 (2018).
25. Tegunov, D. & Cramer, P. Real-​time cryo-​electron 52. Qian, N. & Sejnowski, T. J. Predicting the secondary 78. Veselkov, K. et al. HyperFoods: machine intelligent
microscopy data preprocessing with Warp. structure of globular proteins using neural network mapping of cancer-​beating molecules in foods.
Nat. Methods 16, 1146–1152 (2019). models. J. Mol. Biol. 202, 865–884 (1988). Sci. Rep. 9, 9237 (2019).
26. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. 53. deFigueiredo, R. J. et al. Neural-​network-based 79. Fey, M. & Lenssen, J. E. Fast graph representation
Nature 521, 436–444 (2015). classification of cognitively normal, demented, learning with PyTorch geometric. arXiv https://
This is a review of deep learning by some of the Alzheimer disease and vascular dementia from single arxiv.org/abs/1903.02428 (2019).
major figures in the deep learning revolution. photon emission with computed tomography image 80. Zhavoronkov, A. et al. Deep learning enables rapid
27. Hastie T., Tibshirani R., Friedman J. The elements data from brain. Proc. Natl Acad. Sci. USA 92, identification of potent DDR1 kinase inhibitors.
of statistical learning: data mining, inference, and 5530–5534 (1995). Nat. Biotechnol. 37, 1038–1040 (2019).
prediction. 2nd Edn. (Springer Science & Business 54. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. 81. Wang, Y. et al. Predicting DNA methylation state
Media; 2009). DeepTox: toxicity prediction using deep learning. of CpG dinucleotide using genome topological
28. Adebayo, J. et al. Sanity checks for saliency maps. Front. Environ. Sci. 3, 80 (2016). features and deep networks. Sci. Rep. 6, 19598
NeurIPS https://arxiv.org/abs/1810.03292 (2018). 55. Yang, J. et al. Improved protein structure prediction (2016).
29. Gal, Y. & Ghahramani, Z. Dropout as a Bayesian using predicted interresidue orientations. Proc. Natl 82. Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G.
approximation: representing model uncertainty in Acad. Sci. USA 117, 1496–1503 (2020). A generative neural network for maximizing fitness
deep learning. ICML 48, 1050–1059 (2016). 56. Xu, J., Mcpartlon, M. & Li, J. Improved protein and diversity of synthetic DNA and protein sequences.
30. Smith, A. M. et al. Standard machine learning structure prediction by deep learning irrespective Cell Syst. 11, 49–62.e16 (2020).
approaches outperform deep representation learning of co-evolution information. Nat. Mach. Intell. 3, 83. Greener, J. G., Moffat, L. & Jones, D. T. Design
on phenotype prediction from transcriptomics data. 601–609 (2021). of metalloproteins and novel protein folds using
BMC Bioinformatics 21, 119 (2020). 57. Poplin, R. et al. A universal SNP and small-​indel variational autoencoders. Sci. Rep. 8, 16189
31. Tibshirani, R. Regression shrinkage and selection variant caller using deep neural networks. (2018).
via the lasso. J. R. Stat. Soc. Ser. B. 58, 267–288 Nat. Biotechnol. 36, 983–987 (2018). 84. Wang, J. et al. scGNN is a novel graph neural
(1996). 58. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting network framework for single-​cell RNA-​Seq analyses.
32. Zou, H. & Hastie, T. Regularization and variable 3D genome folding from DNA sequence with Akita. Nat. Commun. 12, 1882 (2021).
selection via the elastic net. J. R. Stat. Soc. Ser. B. 67, Nat. Methods 17, 1111–1117 (2020). 85. Paszke, A. et al. PyTorch: an imperative style,
301–320 (2005). 59. Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. high-​performance deep learning library. Adv. Neural
33. Noble, W. S. What is a support vector machine? Convolutional neural network architectures for Inf. Process. Syst. 32, 8024–8035 (2019).
Nat. Biotechnol. 24, 1565–1567 (2006). predicting DNA-​protein binding. Bioinformatics 32, 86. Abadi M. et al. Tensorflow: a system for large-​scale
34. Ben-​Hur, A. & Weston, J. A user’s guide to support i121–i127 (2016). machine learning. 12th USENIX Symposium on
vector machines. Methods Mol. Biol. 609, 223–239 60. Yao, R., Qian, J. & Huang, Q. Deep-​learning with Operating Systems Design and Implementation.
(2010). synthetic data enables automated picking of cryo-​EM 265–283 (USENIX, 2016).
35. Ben-​Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B. particle images of biological macromolecules. 87. Wei, Q. & Dunbrack, R. L. Jr The role of balanced
& Rätsch, G. Support vector machines and kernels Bioinformatics 36, 1252–1259 (2020). training and testing data sets for binary classifiers
for computational biology. PLoS Comput. Biol. 4, 61. Si, D. et al. Deep learning to predict protein backbone in bioinformatics. PLoS ONE 8, e67863 (2013).
e1000173 (2008). structure from high-​resolution cryo-​EM density maps. 88. Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct
This is an introduction to SVMs with a focus Sci. Rep. 10, 4282 (2020). machine learning on protein sequences: a peer-​
on biological data and prediction tasks. 62. Poplin, R. et al. Prediction of cardiovascular risk reviewing perspective. Brief. Bioinform 17, 831–840
36. Kircher, M. et al. A general framework for estimating factors from retinal fundus photographs via deep (2016).
the relative pathogenicity of human genetic variants. learning. Nat. Biomed. Eng. 2, 158–164 (2018). This article discusses how peer reviewers can
Nat. Genet. 46, 310–315 (2014). 63. AlQuraishi, M. End-​to-end differentiable learning of assess machine learning methods in biology, and
37. Driscoll, M. K. et al. Robust and automated detection protein structure. Cell Syst. 8, 292–301.e3 (2019). by extension how scientists can design and conduct
of subcellular morphological motifs in 3D microscopy 64. Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. such studies properly.
images. Nat. Methods 16, 1037–1044 (2019). Capturing non-​local interactions by long short-​term 89. Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S.
38. Bzdok, D., Krzywinski, M. & Altman, N. Machine memory bidirectional recurrent neural networks for A pitfall for machine learning methods aiming to
learning: supervised methods. Nat. Methods 15, 5–6 improving prediction of protein secondary structure, predict across cell types. Genome Biol. 21, 282
(2018). backbone angles, contact numbers and solvent (2020).
39. Wang, C. & Zhang, Y. Improving scoring-​docking- accessibility. Bioinformatics 33, 2842–2849 (2017). 90. Chothia, C. & Lesk, A. M. The relation between the
screening powers of protein-​ligand scoring functions 65. Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent divergence of sequence and structure in proteins.
using random forest. J. Comput. Chem. 38, 169–177 neural network model for constructive peptide design. EMBO J. 5, 823–826 (1986).
(2017). J. Chem. Inf. Model. 58, 472–479 (2018). 91. Söding, J. & Remmert, M. Protein sequence
40. Zeng, W., Wu, M. & Jiang, R. Prediction of 66. Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. comparison and fold recognition: progress and
enhancer-​promoter interactions via natural language & Sun, J. Doctor AI: predicting clinical events via good-​practice benchmarking. Curr. Opin. Struct. Biol.
processing. BMC Genomics 19, 84 (2018). recurrent neural networks. JMLR Workshop Conf. 21, 404–411 (2011).
41. Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Proc. 56, 301–318 (2016). 92. Steinegger, M. et al. HH-​suite3 for fast remote
Moore, J. H. Data-​driven advice for applying machine 67. Quang, D. & Xie, X. DanQ: a hybrid convolutional homology detection and deep protein annotation.
learning to bioinformatics problems. Pac. Symp. and recurrent deep neural network for quantifying the BMC Bioinformatics 20, 473 (2019).
Biocomput. 23, 192–203 (2018). function of DNA sequences. Nucleic Acids Res. 44, 93. Sillitoe, I. et al. CATH: expanding the horizons of
42. Rappoport, N. & Shamir, R. Multi-​omic and multi-​view e107 (2016). structure-​based functional annotations for genome
clustering algorithms: review and cancer benchmark. 68. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. sequences. Nucleic Acids Res. 47, D280–D284
Nucleic Acids Res. 47, 1044 (2019). & Church, G. M. Unified rational protein engineering (2019).

54 | January 2022 | volume 23 www.nature.com/nrm

0123456789();:
Reviews

94. Cheng, H. et al. ECOD: an evolutionary classification 116. Livesey, B. J. & Marsh, J. A. Using deep mutational LC-​MS spectral peaks. Anal. Chem. 91, 12407–12413
of protein domains. PLoS Comput. Biol. 10, e1003926 scanning to benchmark variant effect predictors and (2019).
(2014). identify disease mutations. Mol. Syst. Biol. 16, e9380 140. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning
95. Li, Y. & Yang, J. Structural and sequence similarity (2020). tandem mass spectra into metabolite structure
makes a significant impact on machine-​learning- 117. AlQuraishi, M. ProteinNet: a standardized data information. Nat. Methods 16, 299–302 (2019).
based scoring functions for protein-​ligand interactions. set for machine learning of protein structure. 141. Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K.
J. Chem. Inf. Model. 57, 1007–1012 (2017). BMC Bioinformatics 20, 311 (2019). & Blank, L. M. Machine learning applications for mass
96. Zech, J. R. et al. Variable generalization performance 118. Townshend, R. J. L. et al. ATOM3D: tasks on molecules spectrometry-​based metabolomics. Metabolites 10,
of a deep learning model to detect pneumonia in chest in three dimensions. arXiv https://arxiv.org/abs/ 243 (2020).
radiographs: a cross-​sectional study. PLoS Med. 15, 2012.04035 (2020). 142. Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H.
e1002683 (2018). 119. Rao, R. et al. Evaluating protein transfer learning with CryoDRGN: reconstruction of heterogeneous cryo-​EM
97. Szegedy, C. et al. Intriguing properties of neural TAPE. Adv. Neural. Inf. Process. Syst. 32, 9689–9701 structures using neural networks. Nat. Methods 18,
networks. arXiv https://arxiv.org/abs/1312.6199 (2019). 176–185 (2021).
(2014). 120. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. 143. Schmauch, B. et al. A deep learning model to predict
98. Hie, B., Cho, H. & Berger, B. Realizing private and & Moult, J. Critical assessment of methods of protein RNA-​Seq expression of tumours from whole slide
practical pharmacological collaboration. Science 362, structure prediction (CASP) — round XIII. Proteins 87, images. Nat. Commun. 11, 3877 (2020).
347–350 (2018). 1011–1020 (2019). 144. Das, P. et al. Accelerated antimicrobial discovery via
99. Beaulieu-​Jones, B. K. et al. Privacy-​preserving 121. Zhou, N. et al. The CAFA challenge reports improved deep generative models and molecular dynamics
generative deep neural networks support clinical protein function prediction and new functional simulations. Nat. Biomed. Eng. 5, 613–623 (2021).
data sharing. Circ. Cardiovasc. Qual. Outcomes 12, annotations for hundreds of genes through 145. Gligorijevic, V., Barot, M. & Bonneau, R. deepNF:
e005122 (2019). experimental screens. Genome Biol. 20, 244 (2019). deep network fusion for protein function prediction.
100. Konečný, J., Brendan McMahan, H., Ramage, D. 122. Munro, D. & Singh, M. DeMaSk: a deep mutational Bioinformatics 34, 3873–3881 (2018).
& Richtárik, P. Federated optimization: distributed scanning substitution matrix and its use for variant 146. Karpathy A. A recipe for training neural networks.
machine learning for on-device intelligence. arXiv impact prediction. Bioinformatics 36, 5322–5329 https://karpathy.github.io/2019/04/25/recipe
https://arxiv.org/abs/1610.02527 (2016). (2020). (2019).
101. Pérez, A., Martínez-​Rosell, G. & De Fabritiis, G. 123. Haario, H. & Taavitsainen, V.-M. Combining soft and 147. Bengio, Y. Practical recommendations for gradient-​
Simulations meet machine learning in structural hard modelling in chemical kinetic models. Chemom. based training of deep architectures. Lecture Notes
biology. Curr. Opin. Struct. Biol. 49, 139–144 (2018). Intell. Lab. Syst. 44, 77–98 (1998). Comput. Sci. 7700, 437–478 (2012).
102. Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann 124. Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. 148. Roberts, M. et al. Common pitfalls and
generators: sampling equilibrium states of many-​body FFPred 3: feature-​based function prediction for all recommendations for using machine learning to
systems with deep learning. Science 365, 6457 gene ontology domains. Sci. Rep. 6, 31865 (2016). detect and prognosticate for COVID-19 using chest
(2019). 125. Nugent, T. & Jones, D. T. Transmembrane protein radiographs and CT scans. Nat. Mach. Intell. 3,
103. Shrikumar, A., Greenside, P. & Kundaje, A. Reverse- topology prediction using support vector machines. 199–217 (2021).
complement parameter sharing improves deep BMC Bioinformatics 10, 159 (2009). This study assesses 62 machine learning studies
learning models for genomics. bioRxiv https://www. 126. Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying that analyse medical images for COVID-19 and
biorxiv.org/content/10.1101/103663v1 (2017). disease-​associated nonsynonymous single nucleotide none is found to be of clinical use, indicating the
104. Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific polymorphisms. Nucleic Acids Res. 33, W480–W482 difficulties of training a useful model.
discoveries in molecular biology with deep generative (2005). 149. List, M., Ebert, P. & Albrecht, F. Ten simple rules for
models. Mol. Syst. Biol. 16, e9198 (2020). 127. Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression developing usable software in computational biology.
105. Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., value prediction based on XGBoost algorithm. Front. PLoS Comput. Biol. 13, e1005265 (2017).
Pellock, S. J. & Baker, D. De novo protein design by Genet. 10, 1077 (2019). 150. Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S.
deep network hallucination. bioRxiv https://doi.org/ 128. Zhang, Y. & Skolnick, J. SPICKER: a clustering approach The need for open source software in machine learning.
10.1101/2020.07.22.211482 (2020). to identify near-​native protein folds. J. Comput. Chem. J. Mach. Learn. Res. 8, 2443–2466 (2007).
106. Innes, M. et al. A differentiable programming system 30, 865–871 (2004).
to bridge machine learning and scientific computing. 129. Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E. Acknowledgements
arXiv https://arxiv.org/abs/1907.07587 (2019). Understanding protein flexibility through The authors thank members of the UCL Bioinformatics Group
107. Ingraham J., Riesselman A. J., Sander C., Marks D. S. dimensionality reduction. J. Comput. Biol. 10, for valuable discussions and comments. This work was sup-
Learning protein structure with a differentiable simulator. 617–634 (2003). ported by the European Research Council Advanced Grant
ICLR https://openreview.net/forum?id=Byg3y3C9Km 130. Schlichtkrull, M. et al. Modeling relational data with ProCovar (project ID 695558).
(2019). graph convolutional networks. arXiv https://arxiv.org/
108. Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R. abs/1703.06103 (2019). Author contributions
Trajectory-​based training enables protein simulations 131. Pandarinath, C. et al. Inferring single-​trial neural All authors researched data for the article, contributed sub-
with accurate folding and Boltzmann ensembles in population dynamics using sequential auto-​encoders. stantially to discussion of the content, wrote the article and
cpu-​hours. PLoS Comput. Biol. 14, e1006578 (2018). Nat. Methods 15, 805–815 (2018). reviewed the manuscript before submission.
109. Wang, Y., Fass, J. & Chodera, J. D. End-to-end 132. Antczak, M., Michaelis, M. & Wass, M. N.
differentiable molecular mechanics force field Environmental conditions shape the nature of a minimal Competing interests
construction. arXiv http://arxiv.org/abs/2010.01196 bacterial genome. Nat. Commun. 10, 3100 (2019). The authors declare no competing interests.
(2020). 133. Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-​based
110. Bradbury, J. et al. JAX: composable transformations of prediction of protein protein interaction using a Peer review information
Python+NumPy programs. GitHub http://github.com/ deep-​learning algorithm. BMC Bioinformatics 18, Nature Reviews Molecular Cell Biology thanks S. Draghici
google/jax (2018). 277 (2017). who co-​r eviewed with T. Nguyen; B. Chain; S. Haider;
111. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. 134. Hiranuma, N. et al. Improved protein structure F. Mahmood; and the other, anonymous, reviewer(s) for their
Selene: a PyTorch-​based deep learning library for refinement guided by deep learning based accuracy contribution to the peer review of this work.
sequence data. Nat. Methods 16, 315–318 (2019). estimation. Nat. Commun. 12, 1340 (2021).
This work provides a software library based on 135. Pagès, G., Charmettant, B. & Grudinin, S. Protein Publisher’s note
PyTorch providing functionality for biological model quality assessment using 3D oriented Springer Nature remains neutral with regard to jurisdictional
sequences. convolutional neural networks. Bioinformatics 35, claims in published maps and institutional affiliations.
112. Kopp, W., Monti, R., Tamburrini, A., Ohler, U. 3313–3319 (2019).
& Akalin, A. Deep learning for genomics using 136. Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: Related links
Janggu. Nat. Commun. 11, 3488 (2020). a server for predicting effects of mutations on protein Caret: https://topepo.github.io/caret
113. Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to- stability using an integrated computational approach. Colaboratory: https://research.google.com/colaboratory
end differentiable, hardware accelerated, molecular Nucleic Acids Res. 42, W314–W319 (2014). Graph Nets: https://github.com/deepmind/graph_nets
dynamics in pure Python. arXiv https://arxiv.org/ 137. Yuan, Y. & Bar-​Joseph, Z. Deep learning for inferring MLJ: https://alan-​turing-​institute.github.io/MLJ.jl/stable
abs/1912.04232 (2019). gene relationships from single-​cell expression data. PyTorch: https://pytorch.org
114. Avsec, Ž. et al. The Kipoi repository accelerates Proc. Natl Acad. Sci. USA 116, 27151–27158 (2019). PyTorch Geometric: https://pytorch-​geometric.readthedocs.
community exchange and reuse of predictive models 138. Chen, L., Cai, C., Chen, V. & Lu, X. Learning a io/en/latest
for genomics. Nat. Biotechnol. 37, 592–600 (2019). hierarchical representation of the yeast transcriptomic scikit-​learn: https://scikit-​learn.org/stable
115. Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. machinery using an autoencoder model. BMC Tensorflow: https://www.tensorflow.org
& Maier-​Hein, K. H. nnU-​Net: a self-​configuring Bioinformatics 17, S9 (2016).
method for deep learning-​based biomedical image 139. Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S.
segmentation. Nat Methods 18, 203–211 (2020). & Jain, M. Deep neural networks for classification of © Springer Nature Limited 2021

naTure RevIeWS | MolECulAR CEll BIology volume 23 | January 2022 | 55

0123456789();:

You might also like