Prediction Machines Applied Machine Learning For Therapeutic
Prediction Machines Applied Machine Learning For Therapeutic
Review
a r t i c l e i n f o a b s t r a c t
Article history: The rapid growth in technological advances and quantity of scientific data over the past decade has led to
Received 29 October 2020 several challenges including data storage and analysis. Accurate models of complex datasets were pre-
Revised 27 November 2020 viously difficult to develop and interpret. However, improvements in machine learning algorithms have
Accepted 27 November 2020
since enabled unparalleled classification and prediction capabilities. The application of machine learning
Available online 2 December 2020
can be seen throughout diverse industries due to their ease of use and interpretability. In this review, we
describe popular machine learning algorithms and highlight their application in pharmaceutical protein
Keywords:
Machine learning development. Machine learning models have now been applied to better understand the nonlinear
Protein stability concentration dependent viscosity of protein solutions, predict protein oxidation and deamidation rates,
Formulation classify sub-visible particles and compare the physical stability of proteins. We also applied several
Viscosity
machine learning algorithms using previously published data and describe models with improved
Excipient
predictions and classification. The authors hope that this review can be used as a resource to others and
encourage continued application of machine learning algorithms to problems in pharmaceutical protein
development.
Published by Elsevier Inc. on behalf of the American Pharmacists Association.
https://doi.org/10.1016/j.xphs.2020.11.034
0022-3549/Published by Elsevier Inc. on behalf of the American Pharmacists Association.
666 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
learning are the most well-known sub-categories with broad sci- important features are Pearson's correlation coefficient, chi-square
entific application. Common applications of machine learning statistics, mutual information and Spearman's rank coefficient.
include classification and regression. Regression analysis predicts Feature extraction combines variables and data into new features.
results with a continuous output by mapping input variables to a Principal component analysis, linear discriminant analysis, locally
continuous function. Regression models find numerical relation- linear embedding and t-distributed stochastic neighbor embedding
ships between predictor or explanatory variables and a continuous are methods used to extract features and reduce dimensions.9
response variable or outcome. Classification problems divide the Dimensionality reduction simplifies the dataset and improves
data into different classes or categories with discrete labels. model performance.10 The most common methods apply linear
Multiple steps are required to develop a functional machine transformations to project data onto new hyperplanes such as
learning model. Fig. 1 highlights a typical workflow for model principal components analysis (PCA). PCA is a linear technique that
development. The first step is to better understand and visualize describes most of the variance in the data.11 Linear discriminant
raw data. Common ways to explore data include creating visuali- analysis (LDA) is another linear method that projects data to new
zations (e.g., histograms, scatterplots, dendrograms), identifying axes to maximize class separability. During training, LDA learns the
associations, finding clusters, selecting features and reducing the most discriminative axes between the classes which can be sub-
dimensions of the data. Following initial exploration, most datasets sequently used to define a hyperplane to project the data. In
require pre-processing using data normalization, imputation, dis- contrast, nonlinear dimensionality reduction methods are used
cretization and/or smoothing. Pre-processing enables the further when the data does not lie on a linear space. Popular nonlinear
extraction of meaningful information and eventual transformation methods include multidimensional scaling (MDS), isometric
into numerical features useable for machine learning. Pre- feature mapping and locally linear embedding among others.10
processing techniques such as dimensionality reduction and Both MDS and isometric feature maps use techniques for evalu-
feature extraction removes redundant or noisy data, often increases ating similarity or dissimilarity of data as distances in geometric
accuracy of the model and improves the running time of learning space.
algorithms.8 The two most common categories of machine learning are su-
Dimensionality reduction techniques are used to reduce model pervised and unsupervised learning. Supervised learning includes
complexity and avoid overfitting. Two popular methods to reduce input data with known response variables to make predictions. In
the data dimensions are feature selection and feature extraction. contrast, unsupervised learning identifies patterns from datasets
Feature selection identifies the features which contribute most to consisting of input data without known responses. The choice of
prediction and removes the irrelevant or less important features. algorithm depends on user needs including the availability of re-
Several statistical correlation metrics used to select the most sponses, speed of training, memory usage, required accuracy and
ease of interpretability. Both supervised and unsupervised learning however, the function cannot be expressed as a linear combination
can be used for understanding complex relationships, quantitative of the parameters or coefficients. Parametric nonlinear models
regression analysis, classification tasks and cluster analysis.12 represent the relationship with model parameters whereas
Classification is a learning task where the goal is to predict the nonparametric models do not involve assumptions as to the form or
categorical class of an observation. The goal is to divide data into parameters. Popular nonparametric machine learning algorithms
different categories (e.g., is an email spam or not). The class labels include decision trees, k-nearest neighbors and support vector
are discrete values and may be either binary (two classes) or machines. Several machine learning algorithms are capable of
multiclass. Classification problems can be solved with many types performing linear or nonlinear classification or regression.12
of algorithms depending on the data and task. Several common The next step in model development is model training, assess-
classification algorithms include logistic regression, support vector ment and validation. The performance of a method is related to its
machines, decision trees, boosted trees, k-nearest neighbor, prediction capability on independent test data. Three common
random forest and neural networks.12 Regression analysis is methods to examine accuracy include evaluation of the re-
another learning method used for understanding the relationship substitution, cross-validation and out-of-bag error.12,14,15 For large
between dependent and independent variables for the purpose of datasets, some percentage of the data should be held out for vali-
predicting a quantitative response. Regression methods are dation and the remainder for training. Cross-validation is perhaps
generally considered supervised learning techniques in that they the simplest and most widely used method for estimating predic-
try to explain a dependent variable in terms of independent vari- tion error. Cross-validation involves partitioning a sample of data
ables. Many different regression algorithms are used for prediction into complementary subsets, performing analysis on one subset
including the well-known linear regression, multivariate regres- and validating the analysis on the other subset. If sufficient data is
sion, multiple regression, least absolute shrinkage and selection available, simple hold out validation can be used where training is
operator regression, nearest neighbor regression, Gaussian process performed on one set and evaluation on the test set. However, if
regression and regression trees.12,13 Several different machine little data is available, then validation and test sets may contain too
learning algorithms are described below and compared in Table 1. few samples to be statistically representative. The data can be
Models intended for regression or classification in which the partitioned and multiple rounds of training and evaluation
target value is expected to be a linear combination of the features is completed using different subsets. For example, k-fold validation is
a linear model. Linear regression models the relationship between an approach that splits data into k partitions of equal size. For each
dependent variables and independent variables using linear pre- partition i, the model is trained on the remaining k-1 partitions and
dictor functions whose unknown model parameters are estimated evaluated on partition i. The final score is the averages of the k
from the data. A linear regression model assumes that the regres- scores obtained. Similar cross validation strategies with subtle
sion function E(Y|X) is linear in the inputs X1, …,Xp and has the form differences in the algorithm include repeated k-fold, leave one out,
P
f ðXÞ ¼ b þ pj¼1 Xj bj .12 Here the bj's are unknown parameters or random permutations or shuffling, stratified k-fold and group k-
coefficients and the variables Xj can come from different sources fold. The performance of regression models are often evaluated by
including quantitative inputs, transformations, basis expansions or calculating the root mean square error (RMSE), coefficient of
numeric coding (dummy) of the levels of qualitative inputs. determination (R2), mean squared error (MSE), mean absolute error
Generally, we estimate the parameters b using a set of training data (MAE) and residuals plots. The residual is the difference between
ððx1 ; y1 Þ…ðxN ; yN ÞÞ using a method such as least squares in which the predicted and true responses and should be symmetrical
we identify the coefficients to minimize the residual sum of around 0. A classifier may be evaluated by a confusion matrix in
squares. Additional algorithms can be used including subset se- which the columns are the predicted class and the rows are the
lection (stepwise regression), shrinkage estimates (ridge regres- actual class. In the confusion matrix, TN is the number of negative
sion, lasso) and methods using derived input directions (principal examples correctly classified, FP is the number of negative exam-
components regression). Nonlinear regression models also describe ples incorrectly classified as negative and TP is the number of
the relationship between dependent and independent variables, positive examples correctly classified. Predictive accuracy is,
Table 1
Characteristics of Machine Learning Algorithms.
Decision Trees Medium to High Fast Supervised Easy to implement and interpret. Minimal Tend to overfit data with large number of
data preparation. features. Unstable to small variations.
Bias if specific classes dominate.
SVM Medium to High Medium Supervised Optimal for complex small or medium Compute and storage requirements
datasets. Effective in high dimensions. increase rapidly with number of training
Memory efficient. vectors. Choice of kernel important.
Nearest Neighbor Low to Medium Fast learning Supervised Simple to use and implement. Robust to Memory intensive. Susceptible to
Slow class Unsupervised outliers. overfitting. Data skewed if more data
points of one class. Poorly performing for
high dimensional data.
Ensemble Learning High Variable Supervised Learn nonlinear relationships. Robust to Difficult to interpret. May not be suitable for
Unsupervised outliers. High accuracy. Flexible. small samples.
Bayesian Algorithms Medium Medium Supervised Simple to use and implement. Can scale Tend to underperform compared to other
Unsupervised with dataset. algorithms.
Artificial Neural Networks High Variable Supervised Good performance. Tolerance of correlated Difficult to interpret. Susceptible to
Unsupervised inputs. Identifies complex relationships. irrelevant features. Overfitting.
Hierarchical clustering Low Fast Unsupervised Do not need to specify number of clusters. Susceptible to overfitting.
Easy to interpret. Fast and scalable
k-means clustering Low Fast Unsupervised Easy to implement, Fast and scalable Must specify number of clusters.
Susceptible to overfitting.
668 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
Accuracy ¼ (TP þ TN)/(TP þ FP þ TN þ FN). Predictive accuracy advantages of SVMs include versatility, memory efficiency and
might not be appropriate when the data is unbalanced. For effectiveness in high dimensions while the primary disadvantages
example, when a dataset contains much more data from one class of SVMs are that the final models may be difficult to interpret and
compared to the other class. The Receiver Operating Characteristic the algorithms are not trivial to implement without technical
(ROC) curve may be more appropriate for describing classifier expertise.12,21
performance and the Area Under the Curve (AUC) is an optimal
performance metric. Similarly, the F-value metric combines the Decision Trees
trade-offs between different values of TP, FP and FN and is useful for Decision Trees (DT) are classification and regression methods
learning from imbalanced datasets. that predict responses to data by following binary decisions in an
“upside down tree” structure from a singular starting point down
Supervised Learning through the branches to a leaf. In one study, Zhao et al. used de-
cision trees as a flowchart to guide small molecule pharmaceutical
Supervised learning is a sub-category of machine learning that product development decisions from processing routes, choice of
uses known inputs and outputs to build predictive models. The excipients and equipment sizing.23 Decision trees partition the
inputs are independent variables and often referred to as the pre- feature space into rectangles and fit relatively simple models to
dictors or features. A feature is a characteristic of the data, often an each space.12,24,25 The algorithm finds the best binary partition and
explanatory variable, represented as a vector. The outputs are called split points to grow the tree. The most common algorithms use
the responses or dependent variables. The output data types can be features that yield the largest information gain at each node. De-
quantitative or qualitative lending to different types of prediction cision trees are capable of modelling complex datasets and are easy
methods, however, the term supervised refers to examples where to understand and interpret because they can be directly visualized.
the desired output signals are already known but allows us to make The disadvantages of decision trees include ease of overfitting and
predictions about unseen or future data. Supervised learning al- instability with small variations in the data resulting in different
gorithms are trained with examples to identify the patterns in data trees.
that correlate with the desired output. After training, the algorithm
can then identify the correct label or response for newly presented Ensemble Methods
or unseen data. Several supervised learning algorithms can be used Ensemble learning methods build prediction models by
for both classification and regression tasks. combining multiple “base” models using a population of learners
from the training data and then combining the base models to form
K Nearest Neighbors a composite predictor.12 These methods are some of the most
K nearest neighbors (KNN) is a supervised learning technique popular and accurate machine learning approaches used today.
used for both classifying samples and performing regression based Examples of ensemble algorithms are bagging predictors, boosting,
on the features of a sample.12,13 This method has been widely used random forests and neural networks. Random forest methods can
in pharmaceutics for diverse applications including formulation be used for classification or regression by constructing multiple
prediction,16 counterfeit detection of tablets,17 pharmaceutical decision trees and the average prediction of the ensemble is pro-
fingerprinting18 and lead discovery,19 among others. KNN is vided as the averaged prediction of the individual classifiers. Each
perhaps the most straightforward of all supervised learning tree is built from a bootstrap sample from the training set. Random
methods although frequently not the most accurate. Using this forests are one of the most popular methods previously used in
method, the distances between new data points and the nearest K pharmaceutical protein development. Bagging predictors, also
neighbors of already classified data is measured and classified ac- called bootstrap aggregation, is a method that generates multiple
cording to the class of the closest neighbors. The algorithm locates versions of a predictor through bootstrapping to get an aggregate
the k-nearest neighbors within a specified distance to query data predictor.26 Bagging works well for unstable procedures and
points based on a user specified distance metric. If K ¼ 3, the three significantly reduces the variance leading to improved prediction.
nearest neighbors are measured. If the chosen value of K is small Nevertheless, improvements in prediction are often associated with
then the variance is generally high whereas the bias is high with a decreased interpretability. Boosting is similar in concept to other
large value of K. With large datasets KNN tends to be slow and the ensemble based methods in the use of many classifiers to produce a
data skewed if there are many more data points of one class. committee. However, Boosting combines multiple relatively weak
However, the ease of application and use of KNN makes it easy to and inaccurate classifiers to produce a powerful combined predic-
compare the results from other techniques. tion. The predictions from all of the weak classifiers are combined
through a weighted majority vote to produce the final prediction.27
Support Vector Machines
Support vector machine (SVM) are versatile supervised machine Unsupervised Learning
learning methods that can be used for linear or nonlinear classifi-
cation or regression.12,13 Support vector machines have found broad Unsupervised learning is a type of machine learning that makes
application in protein structure prediction, drug design and clas- predictions or inferences from input data without known re-
sification and cancer diagnosis.6,20e22 Support vectors divide data sponses. Unsupervised learning is used to find patterns in data or
according to which side of a hyperplane in feature space a data discover groupings of similar or dissimilar data. Several different
point exists. A hyperplane is a subspace of dimension D e 1 that learning methods are used to find patterns in data including cluster
preserves parallel relationships with a corresponding vector space analysis, self-organizing maps and neural networks.
of dimension D. The algorithms construct a hyper-plane or set of
hyper-planes capable of dividing D-dimensional space into halves. Cluster Analysis
Support vector machines produce nonlinear boundaries by con- Cluster analysis aims to group objects into subsets or “clusters”
structing a linear boundary in a large, transformed feature space. based on similarity or relatedness.12 Clustering is the problem of
The geometry of the hyperplane generates the largest margin finding homogeneous groups of data points in a given data set.
possible between the classes. The margin is the distance of the Cluster analysis separates data into groups (clusters) based on
separating hyperplane to the closest examples in the dataset. The shared characteristics using hieararchical and partitional
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681 669
algorithms.28 The popular K-means clustering algorithm partitions nodes operating in parallel is called a “layer”. A single layer of S
data into K mutually exclusive clusters and assigns each data point nodes is shown in Fig. 2c. Furthermore, networks may contain
to a cluster based on the distance from the data point to the mean several layers. Neural networks with two or three layers of con-
location of the cluster. Like many clustering algorithms, K-means nected nodes are ‘shallow’ networks while deep learning networks
requires the user to specify the number of clusters. Each observa- can have tens or hundreds of layers.
tion in the dataset is treated as an object that has a location in
space. Data based methods are available to estimate the number of Self-Organizing Maps
clusters using criteria such as gap values, silhouette values and Self-organizing maps (SOM) are increasingly popular unsuper-
various index value measures. Hierarchical clustering methods vised learning techniques that identify relationships between in-
create a tree or hierarchy over a variety of scales in which dividual data points and provide easy visualization of the results.30
the clusters at each level of the hierarchy are created by merging In the most basic form, it produces a similarity graph of input data
clusters at the next lower level. In contrast to K-means, Hierarchical through the combination of nonlinear projection and clustering
clustering methods do not require a pre-determined number methods in an ordered vector quantization graph. SOMs are
of clusters but they do require a measure of dissimilarity between considered a class of neural network algorithms that map high
groups of observations. The methods described below are also dimensional data onto a two-dimensional coordinate system.
capable of clustering. However, the architecture is different in that SOMs are comprised
of a single layer grid of nodes instead of a series of layers. They
Neural Networks classify input vectors according to how they are grouped in the
Neural networks are one of the largest and most successful input space and learn both the distribution and topology of the
classes of learning methods.29 They can be used for either super- input vectors they are trained on. The nodes are arranged in posi-
vised or unsupervised learning. Historically, neural networks are tions based on a specified topology (hexagonal, random, etc.) and
used for modelling complex relationships between inputs and the distances between the input vector and weight vector of the
outputs. Neural networks are commonly described as multiple first nodes are calculated from their positions with a distance
layers of inter-connected nodes or ‘neurons’ that receive and store function. The node with the smallest distance between the input
information in the form of summed weights and biases (Fig. 2). The vector and weight vector associated with the node is identified and
node enters the weighted sum into a transfer or activation function repeated for all nodes in the neighborhood. Self-organizing maps
and yields a scalar node output. The activation function determines have been used in diverse applications including speech recogni-
the behavior of the node. An example of a single input neuron is tion, process control, radar measurements and classification of
shown in Fig. 2a. The scalar input p is multiplied by the scalar protein structure.30
weight w to form wp. The other input 1 is multiplied by a bias b. The
summation output n, often referred to as the net input, goes into a Applied Machine Learning In Pharmaceutical Biotechnology
transfer function f, which produces the scalar node output a. Both w
and b are adjustable scalar parameters and the transfer function is Machine learning has been used in a wide range of drug dis-
determined by the user which may be a linear or nonlinear function covery and development applications including target identifica-
of n. Most network architectures are comprised of multiple nodes tion, lead discovery and optimization, biomarker identification,
with more than one input with individual inputs weighted by formulation optimization and comparability, process engineering
corresponding elements of a weight matrix W (Fig. 2b). Multiple and non-clinical and clinical safety.5,31e34 The following sections
Fig. 2. (a) Single-input neuron with scalar input p, scalar weight w, bias b, summer output n and transfer function f which produces the scalar neuron output a. Both w and b are
adjustable scalar parameters of the neuron. (b) Multiple-input neuron with R inputs and (c) a single layer network of S neurons.
670 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
highlight studies that apply the use of machine learning in protein difficulty with both manufacturability and drug delivery at high
design and formulation development. In addition, we present new concentrations and viscosities.47,48 Several groups have character-
analysis of previously published data using comprehensive ma- ized the viscoelastic behavior of diverse therapeutic monoclonal
chine learning techniques and identify novel relationships impor- antibodies as a function of protein concentration, temperature, pH,
tant to pharmaceutical development. buffer species, other proteins and excipients.47e52 In general, the
solution viscosity of monoclonal antibodies and other proteins is
Protein Design highly dependent on protein concentration and increases non-
linearly with increasing concentration. The influence of pH, ionic
The nexus of protein structure and function prediction is critical strength and excipient type on solution viscosity is variable and
to the development of protein based therapeutics. Machine protein dependent. Although the viscoelastic properties of proteins
learning methods have been used for decades to predict the have been thoroughly evaluated, very few studies have used ma-
physicochemical properties and secondary structure of pro- chine learning to predict or classify protein viscosity.
teins.35,36 The cumulative number of methods has increased and Mostly, linear regression methods have been used to predict the
accuracy improved when the amino acid sequence and/or proper- viscosity behavior of monoclonal antibody solutions. Several
ties are known. For example, novel identification of protein con- groups combined the use of linear regression and structural de-
tacts using co-evolution information significantly improved scriptors to predict mAb solution viscosity and to better understand
predictions.37e39 Multiple groups have also used machine learning the molecular mechanisms contributing to solution behavior.53e55
methods to design force fields for both atomistic and coarse grained Li and co-workers measured the viscosity of 11 antibodies of
models incorporating many body interactions.40e42 The AlphaFold various isotypes using a rheometer.53 A linear model was subse-
team from DeepMind applied deep neural networks to predict quently used to identify the slope and intercept from a graph of the
properties of proteins from its genetic sequence, using the distri- logarithm of viscosity and protein concentration. The slopes of the
bution of distances between pairs of amino acids and angles be- concentration dependent viscosity curves showed wide variation
tween bonds without using previously solved proteins as and was suggested to predict differences in protein-protein in-
templates.43 Others are now starting to design proteins de novo teractions. The best regression model included the isoelectric point
when the sequence and structure are unknown.44 They are (pIFV), aggregation propensity (PaggWaltzFV) and number of resi-
designing proteins that do not exist in nature. The Baker lab applied dues (NresFV) with an R2 of 0.93 and RMSE 1.32 103. Validation
an integrated computational and experimental approach to design included the leave one out method with R2 values ranging from
and test novel proteins with multiple functions and enhanced 0.88 to 0.96. Correlations between individual predictors and vis-
stability. They have designed protein switches, hyperstable pep- cosity were not found to be significant other than the isoelectric
tides, novel macrocycles, self-assembling helical protein filaments, point, however, the authors concluded that large negative values
nanoparticle vaccines, nucleocapsids capable of genome packaging, for net charge (ZFV) and zeta potential (zFV) might be important
proteins that target Influenza and botulinum neurotoxin with determinants of increased viscosity.53
ultrastability and high affinity.44 De novo protein design has the Tomar et al. also used linear regression methods to predict
potential to generate targeted protein therapeutics with variable mAb concentration dependent viscosity curves from protein
binding specificity, thermodynamic stability, solubility and manu- sequence and structure.54 The viscosity of 16 different antibodies
facturability. The use of machine learning methods will prove was measured using dynamic light scattering at various concen-
critical to continued improvements in this exciting area of phar- trations from 20 to 200 mg/mL. Several protein descriptors were
maceutical biotechnology. subsequently calculated from homology models and used to
predict the slope. Stepwise linear regression models identified
Protein Formulation Development several predictors important for prediction of the slope with an
adjusted R2 of 0.754 and RMSE 0.0034. The authors concluded
The use of automated high throughput methods in protein that hydrophobic surface area, electrostatics and net charge on
formulation development has increased the speed of drug devel- the VH and VL domains in the Fv regions play important roles in
opment as well as the amount of available data to help guide de- determining concentration dependent viscosity of the antibodies
cision making. Despite the use of increasingly advanced studied.54
physicochemical techniques and high throughput methods to In a separate study, principal component regression was used to
characterize protein structure and stability, many formulation sci- predict the viscosity of monoclonal antibody solutions from
entists continue to rely on antiquated methods for data analysis and sequence based molecular descriptors.55 The calculated descriptors
inadequate mathematical models to make critical decisions about included Fv charge symmetry, Fv net charge and hydrophobicity
optimal solution and process conditions. However, more groups are index. The authors propose that a greater negative Fv charge
now starting to use machine learning techniques and develop new symmetry is expected to lead to stronger attractive interactions
exploratory algorithms. In the following section, we highlight through Fv and Fv interactions via dipole moments. Correlations
several applications of machine learning in pharmaceutical protein were observed between linear plots of viscosity and the individual
formulation development. We also apply several advanced ma- descriptors with reported Pearson correlation coefficients (R)
chine learning algorithms to previously published data with im- of 0.8 (FvCSP and Fv charge) and 0.6 (hydrophobicity index).
provements in prediction. Principal components regression improved correlations with R
values ¼ 0.9 between measured and predicted viscosities, however,
Viscosity using the leave-one-out validation resulted in R values ¼ 0.8
The nonlinear concentration dependent viscosity of protein (R2 ¼ 0.64). Interestingly, sequence specific descriptors were also
solutions has been well studied. As early as 1914, Chick and Lubr- used to classify antibody plasma clearance as normal or fast
zynska elegantly described the viscoelastic behavior of egg- clearance using a simple cutoff value. The authors report that the
albumin, serum-albumin and euglobulin as a function of concen- binary classification model employed was accurate approximately
tration, salt content, temperature and pH.45,46 The solution vis- 70% of the time for normal and 77% for fast clearance.
cosity of therapeutic proteins, notably monoclonal antibodies, has For comparison to the previously described linear models, we
since generated significant interest over the past decade due to constructed several machine learning models using ensemble
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681 671
Table 2
Machine Learning Model Comparison for Concentration Dependent Solution Viscosity Predictions.
Support Vector Machine (SVM)a Gaussian Process Regression (GPR)a Ensemble Modelsa Neural Networka
Characterization of regression models developed by the authors of this review using data from references.53,55 For all models we used Bayesian optimization for hyper-
parameter tuning, principal components analysis (PCA) for feature selection (FS) and 5 fold cross validation.
a
Models developed by the authors TJK and CRM for this review.
methods, neural networks, support vector machines, decision trees, data from Sharma et al. self-organizing maps show that high vis-
Gaussian process regression, self-organizing maps and PCA using cosity is associated with a greater negative FvCSP and moderate
published data from the studies described above (Tables 2 and 3 hydrophobicity as observed in the original study with modest
and Figs. 3e4).53e55 A density matrix comprised of normalized correlation. We also used multiple methods to model plasma
molecular descriptors (input variables) and solution viscosity clearance in monkeys from the same study. Several other groups,
(output) was used to train regression or classification models to including the authors of the study from which this data was ob-
predict data using various machine learning algorithms. Combi- tained, were unable to identify correlations between pI or HI and
nations of hyper-parameter values were tuned using Bayesian clearance. Using the clearance data provided for 14 antibodies we
optimization to minimize the model mean squared error (MSE). developed a neural network model using Bayesian regularization
Final optimized model hyper-parameters are shown in Tables 2 and that predicts clearance from 10 provided sequence specific pre-
3 Both cross-validation and hold out validation were used. All dictors. We also compared several classification models to predict
models were relatively well performing. Neural networks out- fast or normal plasma clearance using a cutoff similar to the authors
performed most models and new patterns were identified. Neural of the original manuscript (Table 3 and Fig. 4). Using the 14 mAbs
networks were able to predict protein viscosity with an R2 ¼ 0.99. and predictors (HI LC1, HI LC2, HI LC3, HI HC1, HI HC2, HI HC3, HI
Principal components analysis and self-organizing maps highlight sum, Fv charge at pH 5.5 and Fv charge at pH 7.4) we were able to
the multivariable relationships (Figs. 3e4). All models predict so- classify clearance with impressive accuracy and AUC using most
lution viscosity relatively well including linear models; however, methods as compared to the original prediction accuracy of ~70%.
more advanced machine learning methods generally improve We also compared several classification models using only the
visualization and understanding of complex relationships between hydrophobic index and Fv charge of 61 mAbs from the original
variables and improve predictive accuracy. Most importantly, with dataset. Four of the models showed impressive classification ac-
much larger data sets and numbers of antibodies these relation- curacy >90% and AUC >0.9 (Table 3). The power of clustering
ships would be exceedingly difficult to visualize using linear techniques or self-organizing maps is that these nonlinear re-
methods but are rather easy to identify using neural networks. Our lationships are more easily visualized. Indeed, it is more likely that
analysis of the Li et al. data shows that high viscosity is correlated unique nonlinear combinations of these physicochemical proper-
with larger values for aggregation propensity and smaller and/or ties are correlated with concentration dependent solution behavior.
negative values of net charge, zeta potential and pI. The authors of Furthermore, predictive accuracy is improved using more advanced
the original study show that net charge, zeta potential and pI values methods. Nevertheless, it should be noted that these same methods
do not individually correlate with mAb viscosity. Similarly, using are not as useful when training samples are limited.
Table 3
Classification Models of Antibody Clearance.
Accuracy14 w/PCA FS Acc ¼ 80% Acc ¼ 86% Acc ¼ 86% Acc ¼ 79% Acc ¼ 86% Acc ¼ 79%
AUC ¼ 0.85 AUC ¼ 0.88 AUC ¼ 0.80 AUC ¼ 0.81 AUC ¼ 0.74 AUC ¼ 0.85
MCE ¼ 0.16 MCE ¼ 0.118 MCE ¼ 0.118 MCE ¼ 0.136 MCE ¼ 0.2
5 neighbors Max # splits 4 Gentle Boost Kernel distribution quadratic kernel f(x)
City block distance Max deviance Max # splits 13 Box kernel type Box constraint 9.65
Inverse distance weight # learners 23
Accuracy61 w/PCA FS Acc ¼ 93% Acc ¼ 87% Acc ¼ 92% Acc ¼ 95% Acc ¼ 82% Acc ¼ 95%
AUC ¼ 0.92 AUC ¼ 0.84 AUC ¼ 0.97 AUC ¼ 0.97 AUC ¼ 0.95 AUC ¼ 1.0
MCE ¼ 0.066 MCE ¼ 0.118 MCE ¼ 0.082 MCE ¼ 0.178 MCE ¼ 0.05
1 neighbors Max # splits 59 Logit Boost Gaussian distribution linear kernel f(x)
cosine distance metric Geni's diversity Max # splits 34 Box kernel type Box constraint 641
Inverse distance weight # learners 137
Characterization of various classification models developed by the authors of this review using data from reference.55 All models used Bayesian optimization for hypre-
parameter tuning, principal components analysis (PCA) for feature selection (FS) and 5 fold cross validation. Accuracy14 is the accuracy using the data from 14 monoclonal
antibodies while Accuracy61 utilizes the data from 61 monoclonal antibodies but a smaller number features.
Abbreviations: Acc, accuracy; AUC, area under the curve; MCE, minimum classification error.
672 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
Fig. 3. Concentration dependent solution viscosity predictions. The application of machine learning models to previously published data.53 (a, b) Regression plots of the neural
network outputs with respect to targets for (a) training and (b) training, validation and test sets. (c) Principal component biplot representing how each variable contributes to the
two principal components. The variables are represented by a vector shown in blue and the scores of observations shown as circles in space. (d) Self organizing maps for each
element of the model input. Maps are visualizations of the weights that connect each input to each of the neurons with the similarity of connection inputs shown as the same color.
The inputs include the net charge on Fv portion of mAb, Zeta potential, isoelectric point of Fv (pI), normalized aggregation propensity of an fv region computed by using TANGO
(Pagg Tango), normalized aggregation propensity of an Fv region computed by using Waltz (PaggWaltz), normalized hydrophobicity (HFv), Kyte-Doolittle hydrophobicity moment
(KDhydromom) and normalized hydrophobic surface area values (HpASA).
Chemical Stability et al. also evaluated over 50 different protein structures and 194
The types of chemical reactions and mechanisms contributing to Asn residues including several of the same proteins from Robinson
protein instability have been well documented. Efforts to predict to predict Asn deamidation using different structural descriptors.59
the rates of chemical degradation have primarily used protein Five different machine learning algorithms were applied and ac-
sequence and amino acid structure factors. Another important curacies compared. The random forest approach outperformed
aspect of models developed to evaluate chemical stability is the support vector machines, nearest neighbor, Naïve Bayes and neural
imbalance of data sets. Most of the data sets are unbalanced toward networks in predicting Asn deamidation. Several groups have also
the negative case (non-lability) and therefore most studies use compared different machine learning algorithms to classify Asn
different performance measures or weighting schemes for further and Asp hot spots and gain insight into the structural basis of
characterization. Pioneering studies by Robinson used the primary degradation.60e63 Delmar et al. constructed both a categorical
amino acid sequence and three dimensional environments of 264 (classification) and regression model for predicting deamidation
Asn residues for 23 proteins and 60 hemoglobin variants to predict rates using random forest models of monoclonal antibody pep-
the relative deamidation rates of most protein Asn residues.56,57 tides.60 Multiple structural parameters were included to inform the
The Robinson model used a non-parametric probability measure categorical and regression models. Accelerated degradation rates of
to test for patterns and if present the non-correlation index method 34 mAbs were used to calculate the deamidation half-life of 608
as a binary classification system with diagnostic ability.58 The au- asparagine residues and to train the regression model in conjunc-
thors calculate that for a diverse group of protein types, this tion with 39 additional CDR deamidation sites from separate
method is at least 94% reliable. Interestingly, several years later Jia studies. The side chain orientation, Nþ1 variable and backbone
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681 673
Fig. 4. Concentration dependent solution viscosity and plasma clearance predictions. The application of machine learning models to previously published data.55 (a) Regression
plots of the neural network outputs with respect to targets for combined training, validation and test sets. (b) Self organizing maps for each element of the model input. Maps are
visualizations of the weights that connect each input to each of the neurons with the similarity of connection inputs shown as the same color. The inputs include the Fv hy-
drophobicity index (Fv HI), Fv charge, Fv charge symmetry parameter (Fv CSP) and viscosity. The table shows the accuracy of various classification models used for antibody plasma
clearance using the calculated sequence parameters for a training set of 14 mAbs (Accuracy)14 and a set of 61 mAbs (Accuracy).61
dihedral angle were the most important predictors of deamidation surface area and distance between methionine and the nearest
rate. The classification model was able to achieve 100% accuracy aromatic residue.
and regression model was able to predict half-life with an R2 of
0.96.60 Physical Stability, Biophysical Characterization and Biosimilarity
Protein oxidation is another important covalent modification Pharmaceutical drug development requires comprehensive
capable of altering the physicochemical properties of proteins biophysical characterization for physical stability and compara-
leading to changes in stability and activity. Several groups have bility assessment. Several groups have used machine learning
used primary sequence and structural descriptors of polypeptides models to predict drug stability and biosimilarity during stressed
and monoclonal antibodies to build predictive models.64e67 Two conditions including extremes of temperature and pH. Perhaps the
studies use regression models to predict the percent change in most comprehensive application of statistical learning to protein
oxidized species while most of the studies focus on classification biophysical stability and comparability over the past decade has
tasks. Sankar et al. constructed both a classification and robust been the application of singular value decomposition and novel
regression model using methionine residues in the CDRs of 122 two dimensional visualization by the Middaugh lab.33,69e80 Multi-
distinct mAbs.67 Monoclonal antibodies were oxidatively stressed ple biophysical techniques have been used to characterize diverse
using AAPH and the relative percent oxidation for each methionine biomolecules and macromolecular complexes under various con-
site was determined using peptide mapping LC/MS. Several struc- ditions including pH and temperature resulting in large complex
ture and dynamic based features of the mAb Fv regions were datasets. The dimensionality of these large datasets is subsequently
calculated from both homology models and coarse grained repre- reduced for easy visual representation. This approach has been
sentations. The correlation between the predicted and experi- used to identify optimal formulation conditions of biomolecules,73
mental relative change in oxidized species was R ¼ 0.77 with a assess the biosimilarity and comparability of protein based thera-
RMSE ¼ 11.39. However, when the model was tested using an in- peutics,69,76 characterize gene delivery systems81 and compare
dependent validation set across 12 mAbs the regression model gave proteins from different manufacturers and expression systems.77
a correlation of 0.94 and RMSE of 8.18. An implicit classification In one study, several different biophysical techniques and sta-
model was subsequently developed to predict residues as liable or tistical learning methods were used to characterize and classify the
non-liable using a 25% change in oxidized species as the cutoff. The heterogeneous biopolymer crofelemer.82 Five different fractions of
classification model accuracy is 0.96 and specificity 0.98 with MCC crofelemer were subject to different temperatures for varying
0.83. The most important descriptors were determined to be the amounts of time and subsequently characterized using UVevisible
number of contacts the residue makes with its spatial neighbors, absorption spectroscopy, Fourier Transform Infrared spectroscopy
mean square fluctuation from the coarse grained elastic network (FTIR), circular dichroism (CD) and high performance liquid chro-
models and total and hydrophobic partition of the solvent acces- matography (HPLC). Mutual information and principal components
sible surface areas. A separate study by Veredas et al. compared analysis were used to identify chemical signatures that differentiate
multiple classification models of polypeptide oxidation using crofelemer samples. Importantly, several classification algorithms
different machine learning algorithms.64,68 The initial model including kNN, SVMs, decision tress and ensemble methods were
included 40 primary and 14 tertiary structure descriptors as inde- applied with classification accuracies >95% suggesting these ma-
pendent variables and a binary output of oxidized (>20%) or not chine learning techniques can identify fingerprints of complex drug
oxidized. The random forest model outperformed all other models mixtures. A similar approach was used to assess the biosimilarity of
and the most important parameters were the solvent accessible four IgG1 glycoforms.76 Twenty-four samples were described with
674 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
a total of 2066 features from four analytical methods under various important features contributing to protein aggregation. For
pH and temperature conditions. Several classifiers were tested example, Fang et al. used 9 separate classification algorithms to
including k-nearest neighbor, SVM, LDA, Naïve Bayes, decision define a baseline for feature selection and then applied two feature
trees, random forests and AdaBoost with the task to predict 1 out of selection methods based on support vector machine and Random
8 class values. In this study, several classifiers achieved perfect or Forest algorithms to rank 560 features.91 These classification
near-perfect accuracy without need for dimensionality reduction. models demonstrate superior predictive performance compared to
The authors conclude that the support vector machine classifier prior aggregation propensity methods.
was the best classifier for discriminating between the 8 IgG1 gly-
coform samples.76 Sub-Visible Particle Characterization
King et al. also developed support vector machine models from The presence of sub-visible particles in protein formulations has
experimental temperature and pH stability studies using 53 received significant attention over the past decade due to quality
monoclonal antibodies.83 Linear regression methods identified re- control and patient safety concerns. Particle characterization and
lationships between thermal and pH unfolding transitions classification of particle type has been challenging due to the het-
measured using circular dichroism, differential scanning calorim- erogeneity of particulate matter. Particle size, shape and
etry, and extrinsic and intrinsic fluorescence experiments. The morphology are thought to contribute to variable immunogenicity
correlation between acid and thermal stability using elementary although our understanding of the relationships between particle
linear regression methods was poor and the slopes below 0.04. characteristics and patient risk is incomplete. Several groups have
However, they used primary sequence information combined with published studies using machine learning methods to classify and
stability data to build a support vector machine model to predict predict particle types in protein formulations subject to different
stability based on sequence alone. Predictions of pH stability were stressors.92e96 The methods used are unique and highly successful.
more accurate (R 0.6) while predictions of thermal stability Maddux et al. successfully classified protein aggregate samples
remained inadequate despite the more sophisticated methods. according to the stress condition using a pairwise dissimilarity
More recently, neural networks and Partial Least Squares approach and multidimensional scaling.92 Monoclonal antibody
Regression models were used to predict monoclonal antibody formulations were subject to freezing and thawing, elevated tem-
melting temperature, aggregation onset and interaction parameter perature and combined shaking and pH stresses to facilitate par-
as a function of pH and salt concentration.84 The interaction ticulate formation. The divergences between individual training
parameter and onset of aggregation were calculated from dynamic sets and test sets calculated using aspect ratio and particle color
light scattering measurements and melting temperature from dif- were able to correctly determine which stress had been used to
ferential scanning fluorimetry from 20 to 95 C. Only the number of generate the particles in the test set of images. In a separate study,
amino acid species was used as protein related input parameters Daniels et al. used the divergence approach in combination with
(20 input values). The model was well performing for Tm and Tagg cluster analysis to distinguish images of particle populations found
prediction with R2 values of 0.98 and 0.94, respectively. The pre- in peginesatide samples associated with severe adverse reactions in
diction of Tm values was relatively robust with little difference in patients from images of particles in samples without severe
performance using different training sets. In contrast, the colloidal adverse reactions.94 The use of deep convolutional neural network
stability as measured by Tagg and kD were sensitive to the selected analysis has also gained significant attention over the past several
training. Neural network predictions were compared to PLS years for particle analysis. Calderon et al. used neural networks to
regression with the conclusion that neural networks were superior classify flow microscopy images of protein solutions subjected to
in performance. different stresses and processing conditions.95 This approach was
Machine learning methods have also been used to predict pro- able to identify whether particles in images had been generated by
tein structure and thermodynamic stability for several decades. exposing the protein drug to freeze-thaw stress, agitation stress or
Early studies used various statistical learning methods including recirculation through different filling pumps. The model was
neural network and support vector machines to predict changes in trained using 2 105 total samples and the predictive ability
protein thermodynamic stability upon site specific mutations or evaluated using 104 test images from each class. Using a small
amino acid neighborhoods.85 Other groups used similar methods number of images under the same conditions resulted in >95%
for classification of thermophilic, mesophilic, halophilic and correct classification for each class of stress induced particle for-
acidophilic proteins and to better understand the amino acid mation.95 Others have studied a similar problem using a modified
composition of enzymes stable at extreme conditions. For example, approach. Game-Gilbuena et al. utilized unsupervised and super-
Fang et al. developed a Random Forest predictive model for vised machine learning methods to generate image based classi-
discriminating acid stable and non-acid stable proteins using fiers that could identify the stress applied to an antibody
amino acid composition and physicochemical features.86 The AUC formulation.93 Principal components analysis identified sub groups
and ACC of RF models based on 21 features were 0.84 and 75% while from particle images which were subsequently used to train deci-
the most important features included composition of Thr, Tyr, Gln, sion trees, k-nearest neighbor, support vector machines and en-
Ile and Lys. Furthermore the volume of amino acid and aromatic sembles classifiers. The images were also used as direct inputs in
and aliphatic residues were important in discriminating acid stable training CNN classifiers. Classifiers were subsequently able to
proteins. Several computational methods have also been developed generate distribution profiles which were used as feature vectors in
for predicting protein aggregation propensity.87e90 Most methods training sets of classifiers. The specific particle classifiers had ac-
fit parameters to a sequence dependent functional based on the curacies of 95 and 98%.93
physicochemical properties of amino acids to estimate aggregation
rates. In contrast, other such as Tango use a statistical mechanics Pharmaceutical Excipients
approach to calculate the probability of peptide segments popu- Excipients are included in pharmaceutical formulations for
lating 4 possible structural states according to a Boltzmann distri- many different reasons. Pharmaceutical excipients are used to
bution.85 Most of the methods predict aggregation with reasonable stabilize active ingredients, aid in manufacture of the dosage form,
accuracy; however, the aforementioned methods use a small target delivery in the body or provide tonicity to minimize pain
number of empirically chosen properties to include in the model. upon injection. Examples include buffers controlling solution pH,
Others have applied machine learning methods to identify the most carbohydrates as bulking agents and stabilizers during
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681 675
Fig. 5. (a) Principal components scores plot. Grey circles represent 50,000 compounds from the zinc library and the blue circles represent GRAS excipients. (bee) Self organizing
weight maps for each element of the model input (b) MinssCH2 (c) maxHBint4 (d) maxHother and (e) MDEC-12. Maps are visualizations of the weights that connect each input to
each of the neurons with the similarity of connection inputs shown as the same color. MinssCH2 represents the descriptor minimum atomy type E-state: eCH2-. maxHBint4
represents maximum E-state descriptors of strength for potential hydrogen bonds of path length 4. maxHother represents maximum atom type H E-state: H on aaCH, dCH2 or dsCH.
MDEC-12 represents the molecular distance edge between all primary and secondary carbons.
lyophilization and frozen storage, polymers as viscosity agents for discriminatory behavior. GRAS excipients are shown in blue with
topical applications and salts or sugars adjusting solution osmo- the excipient arginine highlighted in chemical space. Arginine has
lality.97 The application of machine learning for excipient use in unique physical properties and often used in protein formulations
pharmaceutical formulation development has mostly been to help reduce solution viscosity and improve colloidal stability.97
restricted to the use of linear regression and principal component Our analysis identified several novel compounds with high self-
analysis to classify and predict excipient type and amount. For similarity to arginine which might prove useful for excipient se-
example, Connolly et al. compared principal component analysis, lection and design. Self-organizing maps also cluster compounds
partial least squares and linear regression models to quantitate
excipient (trehalose) crystallization in protein formulations using
Raman and near infrared spectroscopy.98 Numerous examples exist Table 4
Classification Models for Excipients that Reduce Viscosity.
that use spectroscopic techniques combined with multivariate
analysis to predict or classify formulation components including Model Features Performance
excipients.98e101 Ensemble Acc ¼ 98%
An exciting area of research is the application of machine 870 obs/sec. AUC ¼ 1.0
learning to (1) predict the effects of excipients on solution behavior AdaBoost MCE ¼ 0.023
maximum number of splits 37
and (2) identify novel excipients. A small number of studies have
number of learners 13
applied advanced machine learning methods to explore chemical SVM Acc ¼ 98%
space and identify novel pharmaceutical excipients for protein 940 obs/sec. AUC ¼ 0.97
based therapeutics.102,103 Currently, there are relatively few excip- quadratic kernel function MCE ¼ 0.023
ients used in the pharmaceutical industry and often limited to the box constraint level 996.7
kernel scale 1
Food and Drug Administration's list of substances generally KNN Acc ¼ 98%
recognized as safe (GRAS). As proof of concept, we used machine 980 obs/sec. AUC ¼ 1.0
learning methods to explore chemical space and identify novel # of neighbors 3 MCE ¼ 0.023
small molecules with similar physicochemical properties to well- Chebyshev distance metric
inverse distance weight
known and previously used excipients. A density matrix of nearly
Naïve Bayes Acc ¼ 93%
50,000 small molecules demonstrating broad chemical space and 640 obs/sec. AUC ¼ 0.98
700 chemical descriptors for each small molecule representing Kernel distribution MCE ¼ 0.068
their physicochemical properties was created from known com- Gaussian kernel type
pound libraries. The density matrix of small molecules with de- Decision Trees Acc ¼ 86%
1000 obs/sec. AUC ¼ 0.94
scriptors was subsequently used to cluster similar compounds in maximum number of splits 18 MCE ¼ 0.136
space using principal components analysis and neural network Gini's diversity index for split criterion
self-organizing feature maps. Fig. 5 shows the PCA scores plot and Discriminant Analysis Acc ¼ 77%
self-organizing maps which highlight diverse chemical space and 1000 obs/sec. AUC ¼ 0.91
diagonal linear discriminant type MCE ¼ 0.227
cluster compounds based on similarity of physicochemical prop-
erties. Novel small molecules may be identified as potential ex- Characterization of various classification models developed by the authors of this
cipients based on the self-similarity with several well-known GRAS review using data from reference.104 All models used Bayesian optimization for
hyper-parameter tuning, principal components analysis (PCA) for feature selection
excipients and other commonly used excipients. Both the PCA and 5 fold cross validation.
scores plot and SOMs cluster compounds based on physicochemical Abbreviations: Acc, accuracy; AUC, area under the curve; MCE, minimum classifi-
properties contributing to the self-similarity with strong cation error.
676 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
based on individual properties and combined properties. For properties of several excipients used in high concentration protein
example, Fig. 5e highlights compounds that cluster based on the formulations. Whitaker et al. screened greater than 50 excipients
molecular distance edge between all primary and secondary car- for their viscosity reducing effects on different monoclonal anti-
bons. Although we do not experimentally evaluate the effects of bodies.104 The measured solution viscosities in the presence of
novel excipients identified using these clustering methods, our excipients were subsequently grouped into 3 categories; preferred
analysis accentuates the power of machine learning for the iden- viscosities of 10 cP or lower, acceptable viscosities between 10 and
tification of potential new excipients with properties similar to 20 cP and unacceptable viscosities greater than 20 cP. We used the
well-known currently used excipients. A more complete descrip- published data to develop classification models that classify ex-
tion of this approach to identify novel excipients will be published cipients into 1 of 3 categories (low, intermediate or high viscosity)
in a separate manuscript by the authors of this review. (Table 4 and Fig. 6). A density matrix comprised of normalized
We also used previously published data to classify the solution excipient molecular descriptors (input variables) and class (output)
behavior and cluster compounds based on the physicochemical was used to train models that classify excipients using various
Fig. 6. Classification of excipients for viscosity reduction. The application of machine learning models developed by the authors of this review to previously published data.104 (a)
Confusion matrix plot of true class and predicted class for the bagged tree model. The diagonal squares shown in green correspond to observations that are correctly classified. The
off diagonal squares correspond to incorrectly classified observations (red). The numbers of observations are shown in each square. (b) Self organizing maps for each element of the
model input. Maps are visualizations of the weights that connect each input to each of the neurons with the similarity of connection inputs shown as the same color. (c) PCA biplot
representing how each variable contributes to the two principal components. The variables are represented by a vector shown in blue and the scores of observations shown as
circles in space. Table of classification models and corresponding accuracies (ACC) in percent. PCA was used for feature selection. Bayesian analysis was used for hyperparameter
optimization. Prediction speed is the number of observations per second (obs/sec).
Table 5
Characteristics of Machine Learning Models Used in Pharmaceutical Protein Development.
Machine Learning Algorithms Types of Data Input and Features Perf Ref
and Tasks
60
Asparagine RF Classification and Regression Dataset of n ¼ 776 mAb peptides. Training set includes 64 IgG1s and 3 IgG4s. Validation set contains 10 IgG1s and 2 IgG4s Acc 100%
Aspartic Acid Binary classifier for deamidation lability and regression analysis to predict deamidation rate R2 0.93
Degradation Features include 12 descriptors for primary sequence (2), backbone orientation (3), side chain orientation (2), solvent accessibility
(2) and hydrogen bonding (3)
59
SVM, RF,NB,KNN,ANN,PLS Training set consists of 194 Asn residues and 25 proteins. Test set consists of 81 Asn residues and 3 protein structures Acc 95%
Classification Binary classifier for deamidation lability AUC 0.96
Descriptors include one experimental measure (deamidation half-life) and several structural features including 3D structure Recall 0.80
based properties, backbone orientation, side chain orientation, local secondary structure, B-factors, solvent accessibility and re- Spec 0.96
action coordinate
Feature selection with recursive feature elimination
10 fold cross validation
Data sets unbalanced toward the negative case. Therefore used recall and specificity for further characterization
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
62
SVM,DI,ANN,DT, RP Dataset of 37 mAbs including 55 Asn hotspots and 940 Asn non hotspots, 40 Asp hotspots and 1425 Asp non hotspots. TPR 0.94
Classification Binary classifier for hotspot
Features include 20 descriptors, secondary structure, transition state accessibility, nucleophilicity, pKa, solvent accessibility
Dataset unbalanced toward non hotspots therefore a standard weighting scheme using inverse of class frequency employed
Monte Carlo cross validation used
63
DT Classification Dataset of 5 IgG1 and 5 IgG4 mAbs
Classification 4 groups with increasing deamidation propensity
Features include backbone flexibility, CeN distance and solvent accessibility
No clear validation method or performance metrics provided
64
Protein Oxidation SVM,RF,NN Classification Dataset of 113 polypeptides containing 975 Met residues of which 122 oxidation prone and 853 oxidation resistant Acc 84%
Binary classifier for oxidation Spec 0.87
54 Features include solvent accessibility, primary structure, met-aromatic residue distance, 3D structure F-meas 0.48
Feature extraction and selection with mRMR and GI
10 fold cross validation
Stratified random sampling used due to unbalanced data
105
SVM, RF, NN, FDA Classification 1646 proteins with 2616 methionine sulfoxides Acc 82%
Binary for oxidized AUC 0.85
54 Features including 40 primary structure, 14 tertiary structure, 2 solvent accessibility, 2 entropy and 1 frequency feature Sens 77
Feature extraction and selection with AI and GI Spec 82
10 fold cross validation
Stratified random sampling used due to unbalanced data
67
RF, MLR Classification and 172 Met residues across 122 mAbs R 0.94
Regression Binary for oxidized and RF regression model trained to predict relative change in % oxidized species Acc 96%
18 descriptors initially evaluated. Final model included 2 structural and 2 dynamic features including the number of overlaps AUC 0.96
between atoms of Met residue with neighbors, solvent accessibility, and mean square fluctuation of Met C alpha atom Spec 0.98
Feature extraction and selection with GI Sens 0.83
Hold out validation
66
NN Classification 166 peptide fragments of which 32 are positive and 134 negativ Acc 92%
Binary for oxidized Sens 0.84
Initially 735 features which was decreased to the top 16 features including secondary structure, solvent accessibility, disorder, Spec 0.94
amino acid properties and side chain count of carbon atom deviation from mean
Feature selection with mRMR
Jackknife cross validation
55
Protein Solution Log linear, PCR Regression 14 mAbs at a single high concentration R 0.8
Viscosity Features include Fv net charge, Fv charge symmetry and Fv hydrophobicity
Leave one out cross validation
54
Log linear, stepwise linear viscosity from concentration dependence of 16 mAbs varies
Regression Both log linear and stepwise linear regression used
Features include sequence structural attributes including charge, pI, zeta potential, dipole moment, solvent accessibility and
nonpolar surface
Slope of log linear equation (B) and stepwise linear approach used to identify correlations and predict B
leave one out cross validatoin
677
(continued on next page)
Table 5 (continued )
678
Machine Learning Algorithms Types of Data Input and Features Perf Ref
and Tasks
Log linear Regression Viscosity from concentration dependence of 11 mAbs R2 0.88 -0.96 53
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
Matrix of divergences projected using MDS
93
CNN, SVM, KNN, DT, ensemble 100,000 particles per stress source which includes friability, freeze-thawing, agitation and heating Acc 95e99%
Classification Feature extraction using PCA on 37 texture metrics from particles used to train classifiers
images also used as direct inputs in training CNN classifiers
n ¼ 600 randomly selected individual particles with a sampling frequency, f ¼ 20
10 fold cross validation
CNN Classification 2 105 total training samples Acc 100% 95
CNN to identify aggregation inducing stress which includes freeze thaw stress, agitation stress, pump recirculation and protein
and silicone oil mixtures
CNN was three layers and a total of 28,640 trainable parameters
96
RF, logistic regression Protein particle concentrations ~50,000/mL varies
Classification Random Forest used to classify and count silicone oil and non-silicone oil particles in formulations
Features include the aspect ratio, circularity, intensity mean, intensity standard deviation, intensity minimum and intensity
maximum
Random forest filters compared to Logistic S-factor filters
k-fold cross validation
86
Biophysical Stability RF Classification Random Forest used to classify acid stable and neutral proteins Acc 75%
393 proteins and 889 sequence based features AUC 0.91
GI used to rank feature importance. Top 10e25 features used
five-fold cross validation
106
Bagging, LDA, PCA, PLS Raman spectroscopy used to classify four mAbs and predict concentration varies
Classification and Regression Crowdsourced machine learning analyses
Collected ~350 spectra for each of four mAbs
10 fold cross-validation
76
kNN, SVM, LDA, QDA, NB, DT, Bio-similarity assessment of 4 mAbs in 2 different formulations resulting in 8 samples in triplicate for total of 24 samples Acc 100%
RF, AdaBoost, MI 2066 features from 4 analytical methods
Classification MI for feature ranking and PCA for visualization
Parameter optimization with grid search
Leave one and leave 8 out cross-validation
ANN, PLS Regression and 6 mAbs and 24 conditions per protein resulting in 144 samples R2 0.98 84
Classification Only number of each amino acid species of the protein used as input parameter 0.94
light scattering and fluorescence used to measure melting temperature (Tm), aggregation (Tagg) and kD
ANN and PLS for prediction of Tm, Tagg
82
kNN, SVM, DT, AdaBoost, RF, 35 samples Acc 99%
NB, LDA, PCA, MI Multiple biophysical techniques including UVeVis absorption spectroscopy, FTIR, CD and HPLC used to characterize Crofelemer
Classification Approximately 7000 features per sample
PCA for feature scaling and MI to identify important features
Monte Carlo cross-validation
RF, random forest; SVM, support vector machine; NB, Naïve Bayes; KNN, K nearest neighbor; LDA, linear discriminant analysis; QDA, quadratic discriminant analysis; NN, nearest neighbor; ANN, artificial neural network; CNN,
convolutional neural network; PLS, partial least squares regression; FDA, flexible discriminant analysis; PLS, partial least squares; PCR, principal component regression; KLD, Kullback-Leibler divergence; MDS, multidimensional
scaling; MI, mutual information; AI, accuracy importance; GI, Gini index importance; mRMR, maximum relevance minimum redundance; Per, performance.
TJK and CRM reference the authors in this review.
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681 679
machine learning algorithms. Combinations of hyper-parameter do not necessarily represent the official views of HCA Healthcare or
values were tuned using Bayesian optimization to minimize the any of its affiliated entities.
model mean squared error (MSE) and cross validation was used to
examine the predicted accuracy of the fitted models. The models
References
final optimized hyper-parameters are shown in Table 4. Ensemble,
SVM and KNN models all classify excipient class with 98% accuracy 1. Maclean D, Kamoun S. Big data in small places. Nat Biotechnol. 2012;30:33-34.
and AUC values near 1.0. The ensemble algorithm used AdaBoost to 2. Oliveira AL. Biotechnology, big data and artificial intelligence. Biotechnol J.
aggregate individual predictors to form a final prediction. Bootstrap 2019;14:e1800613.
3. Collins FS, Morgan M, Patrinos A. The Human Genome Project: lessons from
replicates of the learning set are used as new learning sets and have large-scale biology. Science. 2003;300:286-290.
been shown to provide substantial gains in accuracy.26 Fig. 6a 4. Hey T, Tansley S, Tolle K. The Fourth Paradigm: Data-Intensive Scientific Dis-
shows the confusion matrix for the ensemble model. The diago- covery. 2009.
5. Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, et al. Applications of
nal line of the confusion matrix highlights where the true class and
machine learning in drug discovery and development. Nat Rev Drug Discov.
predicted class match. Only one example was misclassified (red 2019;18:463-477.
square). The Linear Discriminant Analysis classification model was 6. Burbidge R, Trotter M, Buxton B, Holden S. Drug design by machine learning:
the least accurate which is not surprising as the relationships are support vector machines for pharmaceutical data analysis. Comput Chem.
2001;26:5-14.
complex and nonlinear. Nevertheless, the AUC of the Discriminant 7. Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Mol Inform.
analysis model was 0.91 suggesting reasonable performance. 2016;35:3-14.
Fig. 6b highlights the weight planes for each input vector 8. Khalid S, Nasreen S. A Survey of Feature Selection and Feature Extraction
Techniques in Machine Learning. 2014. London.
(descriptor) from the SOM model. Lighter and darker colors 9. Isabelle Guyon AE. In: Isabelle Guyon MN, Gunn S, Zadeh L, eds. Feature
represent larger and smaller weights, respectively. Excipients are Extraction: Foundations and Applications. Berlin, Hedelberg: Springer; 2006.
clustered together on the map based on physicochemical proper- 10. Van Der Maaten L, Postma E, Van den Herik J. Dimensionality reduction: a
comparative. J Mach Learn Res. 2009;10:13.
ties. For example, SP-7, VP-7, MDEO12 and nHBint10 are significant 11. Jolliffe IT. Principal component analysis and factor Analysis. In: Jolliffe IT, ed.
properties contributing to the clustering of excipients in the top Principal Component Analysis. New York, NY: Springer New York; 1986:115-128.
right corner of the maps while maxHBint7 and 4 contribute to 12. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New
York, NY: Springer; 2009.
clustering of compounds in the lower right quadrant. Excipients 13. Wilmott P. Machine Learning: An Applied Mathematics Introduction. Panda
that cluster in the top right quadrant all display high solution vis- Ohana Publishing; 2019.
cosity. Similarly, compounds with similar properties are clustered 14. Bylander T. Estimating generalization error on two-class datasets using out-
of-bag estimates. Mach Learn. 2002;48:287-297.
in four quadrants as seen in the PCA plot. The majority of nonionic
15. Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison
surfactants contribute to high viscosity behavior while sugars and of resampling methods. Bioinformatics. 2005;21:3301-3307.
disaccharides display intermediate solution viscosities. Using ma- 16. Yang Y, Ye Z, Su Y, Zhao Q, Li X, et al. Deep learning for in vitro prediction of
chine learning methods we are able to build models that accurately pharmaceutical formulations. Acta Pharm Sin B. 2019;9:177-185.
17. Degardin K, Guillemain A, Guerreiro NV, Roggo Y. Near infrared spectroscopy
classify excipients and identify important physicochemical prop- for counterfeit detection using a large database of pharmaceutical tablets.
erties contributing to class behavior. J Pharm Biomed Anal. 2016;128:89-97.
18. Welsh WJ, Lin W, Tersigni SH, Collantes E, Duta R, et al. Pharmaceutical
fingerprinting: evaluation of neural networks and chemometric techniques
Conclusions for distinguishing among same-product manufacturers. Anal Chem. 1996;68:
3473-3482.
19. Stanton DT, Morris TW, Roychoudhury S, Parker CN. Application of nearest-
In this review, we highlight the diverse application of machine neighbor and cluster analyses in pharmaceutical lead discovery. J Chem Inf
learning methods for therapeutic protein design and development. Comput Sci. 1999;39:21-27.
20. Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, et al. Applications of sup-
Machine learning algorithms are capable of identifying complex port vector machine (SVM) learning in cancer genomics. Cancer Genomics
patterns in large datasets and making accurate predictions and Proteomics. 2018;15:41-51.
classifications. Several methods are now being used more 21. Noble WS. What is a support vector machine? Nat Biotechnol. 2006;24:1565-
1567.
commonly in protein development with improved models
22. Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector
although many studies continue to use antiquated models. We machine and artificial neural network systems for drug/nondrug classifica-
show that the application of several machine learning models to tion. J Chem Inf Comput Sci. 2003;43:1882-1889.
23. Zhao C, Jain A, Hailemariam L, Suresh P, Akkisetty P, et al. Toward intelligent
previously published data improves predictions and identifies
decision support for pharmaceutical product development. J Pharm Innov.
novel associations. Machine learning methods are able to predict 2006;1:23-35.
protein solution viscosity and physicochemical stability, classify 24. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81-106.
sub-visible particles and identify novel excipients with high accu- 25. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression
Trees. Monterey, CA: Wadsworth and Brooks; 1984.
racy. We expect machine learning models to be applied to an 26. Breiman L. Bagging predictors. Mach Learn. 1996;24:123-140.
increasing number of problems and facilitate decisions during 27. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning
development that mitigate risk, decrease development time and and an application to boosting. J Comput Syst Sci. 1997;55:119-139.
28. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv.
improve costs. We provide a final Table 5 that highlights and 1999;31:264-323.
characterizes several of the machine learning models used in 29. Demuth HB, Beale MH, Jess OD, Hagan MT. Neural Network Design. Martin
pharmaceutical protein development and described throughout Hagan; 2014.
30. Kohonen T, Schroeder MR, Huang TS. Self-Organizing Maps. Springer-Verlag;
the review. The authors hope that this review can be used as a 2001.
resource to others and encourage continued application of machine 31. Riniker S, Wang Y, Jenkins JL, Landrum GA. Using information from historical
learning algorithms to problems in pharmaceutical protein high-throughput screens to predict active compounds. J Chem Inf Model.
2014;54:1880-1891.
development.
32. Jeon J, Nim S, Teyra J, Datti A, Wrana JL, et al. A systematic approach to identify
novel cancer drug targets using machine learning, inhibitor design and high-
throughput screening. Genome Med. 2014;6:57.
Acknowledgements 33. Kueltzo LA, Ersoy B, Ralston JP, Middaugh CR. Derivative absorbance spec-
troscopy and protein phase diagrams as tools for comprehensive protein
This research was supported (in whole or in part) by HCA characterization: a bGCSF case study. J Pharm Sci. 2003;92:1805-1820.
34. Mamoshina P, Volosnikova M, Ozerov IV, Putin E, Skibina E, et al. Machine
Healthcare and/or an HCA Healthcare affiliated entity. The views learning on human muscle transcriptomic data for biomarker discovery and
expressed in this publication represent those of the author(s) and tissue-specific drug target identification. Front Genet. 2018;9:242.
680 T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681
35. King RD, Sternberg MJ. Machine learning approach for the prediction of 68. Veredas FJ, Canton FR, Aledo JC. Methionine residues around phosphorylation
protein secondary structure. J Mol Biol. 1990;216:441-457. sites are preferentially oxidized in vivo under stress conditions. Sci Rep.
36. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, et al. Protein structure 2017;7:40403.
prediction using multiple deep neural networks in the 13th Critical Assess- 69. Alsenaidy MA, Jain NK, Kim JH, Middaugh CR, Volkin DB. Protein compara-
ment of Protein Structure Prediction (CASP13). Proteins. 2019;87:1141-1148. bility assessments and potential applicability of high throughput biophysical
37. Ma J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint methods and data visualization tools to compare physical stability profiles.
evolutionary coupling analysis and supervised learning. Bioinformatics. Front Pharmacol. 2014;5:39.
2015;31:3506-3513. 70. Chaudhuri R, Cheng Y, Middaugh CR, Volkin DB. High-throughput biophysical
38. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. Protein 3D structure analysis of protein therapeutics to examine interrelationships between
computed from evolutionary sequence variation. PLoS One. 2011;6:e28766. aggregate formation and conformational stability. AAPS J. 2014;16:48-64.
39. AlQuraishi M. End-to-End differentiable learning of protein structure. Cell 71. Fan H, Li H, Zhang M, Middaugh CR. Effects of solutes on empirical phase
Syst. 2019;8:292-301.e293. diagrams of human fibroblast growth factor 1. J Pharm Sci. 2007;96:1490-
40. Behler J, Parrinello M. Generalized neural-network representation of high- 1503.
dimensional potential-energy surfaces. Phys Rev Lett. 2007;98:146401. 72. Fan H, Ralston J, Dibiase M, Faulkner E, Middaugh CR. Solution behavior of
41. Rupp M, Tkatchenko A, Muller KR, von Lilienfeld OA. Fast and accurate IFN-beta-1a: an empirical phase diagram based approach. J Pharm Sci.
modeling of molecular atomization energies with machine learning. Phys Rev 2005;94:1893-1911.
Lett. 2012;108:058301. 73. Maddux NR, Joshi SB, Volkin DB, Ralston JP, Middaugh CR. Multidimensional
42. Snyder JC, Rupp M, Hansen K, Muller KR, Burke K. Finding density functionals methods for the formulation of biopharmaceuticals and vaccines. J Pharm Sci.
with machine learning. Phys Rev Lett. 2012;108:253002. 2011;100:4171-4197.
43. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, et al. Improved protein 74. Ramsey JD, Gill ML, Kamerzell TJ, Price ES, Joshi SB, et al. Using empirical
structure prediction using potentials from deep learning. Nature. 2020;577: phase diagrams to understand the role of intramolecular dynamics in
706-710. immunoglobulin G stability. J Pharm Sci. 2009;98:2432-2447.
44. Huang PS, Boyken SE, Baker D. The coming of age of de novo protein design. 75. Kissmann J, Ausar SF, Rudolph A, Braun C, Cape SP, et al. Stabilization of
Nature. 2016;537:320-327. measles virus for vaccine formulation. Hum Vaccin. 2008;4:350-359.
45. Chick H. The viscosity of protein solutions. II. Pseudoglobulin and euglobulin 76. Kim JH, Joshi SB, Tolbert TJ, Middaugh CR, Volkin DB, et al. Biosimilarity as-
(horse). Biochem J. 1914;8:261-280. sessments of model IgG1-Fc glycoforms using a machine learning approach.
46. Chick H, Lubrzynska E. The viscosity of some protein solutions. Biochem J. J Pharm Sci. 2016;105:602-612.
1914;8:59-69. 77. Hickey JM, Toprani VM, Kaur K, Mishra RPN, Goel A, et al. Analytical
47. Liu J, Nguyen MD, Andya JD, Shire SJ. Reversible self-association increases the comparability assessments of 5 recombinant CRM197 proteins from different
viscosity of a concentrated monoclonal antibody in aqueous solution. J Pharm manufacturers and expression systems. J Pharm Sci. 2018;107:1806-1819.
Sci. 2005;94:1928-1940. 78. Kim JH, Joshi SB, Esfandiary R, Iyer V, Bishop SM, et al. Improved comparative
48. Shire SJ, Shahrokh Z, Liu J. Challenges in the development of high protein signature diagrams to evaluate similarity of storage stability profiles of
concentration formulations. J Pharm Sci. 2004;93:1390-1402. different IgG1 mAbs. J Pharm Sci. 2016;105:1028-1035.
49. Cheng W, Joshi SB, Jain NK, He F, Kerwin BA, et al. Linking the solution vis- 79. More AS, Toprani VM, Okbazghi SZ, Kim JH, Joshi SB, et al. Correlating the
cosity of an IgG2 monoclonal antibody to its structure as a function of pH and impact of well-defined oligosaccharide structures on physical stability profiles
temperature. J Pharm Sci. 2013;102:4291-4304. of IgG1-Fc glycoforms. J Pharm Sci. 2016;105:588-601.
50. Galush WJ, Le LN, Moore JM. Viscosity behavior of high-concentration protein 80. Toprani VM, Cheng Y, Wahome N, Khasa H, Kueltzo LA, et al. Structural
mixtures. J Pharm Sci. 2012;101:1012-1020. characterization and formulation development of a trivalent equine en-
51. He F, Woods CE, Trilisky E, Bower KM, Litowski JR, et al. Screening of cephalitis virus-like particle vaccine candidate. J Pharm Sci. 2018;107:2544-
monoclonal antibody formulations based on high-throughput thermostability 2558.
and viscosity measurements: design of experiment and statistical analysis. 81. Ruponen M, Braun CS, Middaugh CR. Biophysical characterization of poly-
J Pharm Sci. 2011;100:1330-1340. meric and liposomal gene delivery systems using empirical phase diagrams.
52. Wang S, Zhang N, Hu T, Dai W, Feng X, et al. Viscosity-lowering effect of J Pharm Sci. 2006;95:2101-2114.
amino acids and salts on highly concentrated solutions of two IgG1 mono- 82. Nariya MK, Kim JH, Xiong J, Kleindl PA, Hewarathna A, et al. Comparative
clonal antibodies. Mol Pharm. 2015;12:4478-4487. characterization of crofelemer samples using data mining and machine
53. Li L, Kumar S, Buck PM, Burns C, Lavoie J, et al. Concentration dependent learning approaches with analytical stability data sets. J Pharm Sci. 2017;106:
viscosity of monoclonal antibody solutions: explaining experimental behavior 3270-3279.
in terms of molecular properties. Pharm Res (N Y). 2014;31:3161-3178. 83. King AC, Woods M, Liu W, Lu Z, Gill D, et al. High-throughput measurement,
54. Tomar DS, Li L, Broulidakis MP, Luksha NG, Burns CT, et al. In-silico prediction correlation analysis, and machine-learning predictions for pH and thermal
of concentration-dependent viscosity curves for monoclonal antibody solu- stabilities of Pfizer-generated antibodies. Protein Sci. 2011;20:1546-1557.
tions. mAbs. 2017;9:476-489. 84. Gentiluomo L, Roessner D, Augustijn D, Svilenov H, Kulakova A, et al. Appli-
55. Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, et al. In silico selection of cation of interpretable artificial neural networks to early monoclonal anti-
therapeutic antibodies for development: viscosity, clearance, and chemical bodies development. Eur J Pharm Biopharm. 2019;141:81-89.
stability. Proc Natl Acad Sci U S A. 2014;111:18601-18606. 85. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L. Prediction of
56. Robinson NE, Robinson AB. Deamidation of human proteins. Proc Natl Acad Sci sequence-dependent and mutational effects on the aggregation of peptides
U S A. 2001;98:12409-12413. and proteins. Nat Biotechnol. 2004;22:1302-1306.
57. Robinson NE, Robinson AB. Prediction of protein deamidation rates from 86. Fang Y, Middaugh CR, Fang J. In silico classification of proteins from acidic and
primary and three-dimensional structure. Proc Natl Acad Sci U S A. 2001;98: neutral cytoplasms. PLoS One. 2012;7:e45585.
4367-4372. 87. Niu M, Li Y, Wang C, Han K. RFAmyloid: a web server for predicting amyloid
58. Robinson AB, Westall FC, Ellison GW. Multiple sclerosis: urinary amine proteins. Int J Mol Sci. 2018;19:2071.
measurement for orthomolecular diagnosis. Life Sci. 1974;14:1747-1753. 88. Tartaglia GG, Cavalli A, Pellarin R, Caflisch A. Prediction of aggregation rate
59. Jia L, Sun Y. Protein asparagine deamidation prediction based on structures and aggregation-prone segments in polypeptide sequences. Protein Sci.
with machine learning methods. PLoS One. 2017;12:e0181347. 2005;14:2723-2734.
60. Delmar JA, Wang J, Choi SW, Martins JA, Mikhail JP. Machine learning enables 89. Tartaglia GG, Vendruscolo M. The Zyggregator method for predicting protein
accurate prediction of asparagine deamidation probability and rate. Mol Ther aggregation propensities. Chem Soc Rev. 2008;37:1395-1401.
Methods Clin Dev. 2019;15:264-274. 90. Trovato A, Seno F, Tosatto SC. The PASTA server for protein aggregation
61. Lorenzo JR, Alonso LG, Sanchez IE. Prediction of spontaneous protein dea- prediction. Protein Eng Des Sel. 2007;20:521-523.
midation from sequence-derived secondary structure and intrinsic disorder. 91. Fang Y, Gao S, Tai D, Middaugh CR, Fang J. Identification of properties
PLoS One. 2015;10:e0145186. important to protein aggregation using feature selection. BMC Bioinf. 2013;14:
62. Sydow JF, Lipsmeier F, Larraillet V, Hilger M, Mautz B, et al. Structure-based 314.
prediction of asparagine and aspartate degradation sites in antibody variable 92. Maddux NR, Daniels AL, Randolph TW. Microflow imaging analyses reflect
regions. PLoS One. 2014;9:e100736. mechanisms of aggregate formation: comparing protein particle data sets
63. Yan Q, Huang M, Lewis MJ, Hu P. Structure based prediction of asparagine using the kullback-leibler divergence. J Pharm Sci. 2017;106:1239-1248.
deamidation propensity in monoclonal antibodies. mAbs. 2018;10:901-912. 93. Gambe-Gilbuena A, Shibano Y, Krayukhina E, Torisu T, Uchiyama S. Automatic
64. Aledo JC, Canton FR, Veredas FJ. A machine learning approach for predicting identification of the stress sources of protein aggregates using flow imaging
methionine oxidation sites. BMC Bioinf. 2017;18:430. microscopy images. J Pharm Sci. 2020;109:614-623.
65. Chennamsetty N, Quan Y, Nashine V, Sadineni V, Lyngberg O, et al. Modeling 94. Daniels AL, Randolph TW. Flow microscopy imaging is sensitive to charac-
the oxidation of methionine residues by peroxides in proteins. J Pharm Sci. teristics of subvisible particles in peginesatide formulations associated with
2015;104:1246-1255. severe adverse reactions. J Pharm Sci. 2018;107:1313-1321.
66. Niu S, Hu LL, Zheng LL, Huang T, Feng KY, et al. Predicting protein oxidation 95. Calderon CP, Daniels AL, Randolph TW. Deep convolutional neural network
sites with feature selection and analysis approach. J Biomol Struct Dyn. analysis of flow imaging microscopy data to classify subvisible particles in
2012;29:650-658. protein formulations. J Pharm Sci. 2018;107:999-1008.
67. Sankar K, Hoi KH, Yin Y, Ramachandran P, Andersen N, et al. Prediction of 96. Saggu M, Patel AR, Koulis T. A random forest approach for counting silicone oil
methionine oxidation risk in monoclonal antibodies using a machine learning droplets and protein particles in antibody formulations using flow micro-
method. mAbs. 2018;10:1281-1290. scopy. Pharm Res (N Y). 2017;34:479-491.
T.J. Kamerzell, C.R. Middaugh / Journal of Pharmaceutical Sciences 110 (2021) 665-681 681
97. Kamerzell TJ, Esfandiary R, Joshi SB, Middaugh CR, Volkin DB. Protein-excip- 102. Tosstorff A, Menzen T, Winter G. Exploring chemical space for new substances
ient interactions: mechanisms and biophysical characterization applied to to stabilize a therapeutic monoclonal antibody. J Pharm Sci. 2020;109:301-307.
protein formulation development. Adv Drug Deliv Rev. 2011;63:1118-1159. 103. Cloutier TK, Sudrik C, Mody N, Sathish HA, Trout BL. Machine learning models
98. Connolly B, Patapoff TW, Wang YJ, Moore JM, Kamerzell TJ. Vibrational of antibody-excipient preferential interactions for use in computational
spectroscopy and chemometrics to characterize and quantitate trehalose formulation design. Mol Pharm. 2020;17:3589-3599.
crystallization. Anal Biochem. 2010;399:48-57. 104. Whitaker N, Xiong J, Pace SE, et al. A formulation development approach
99. Dave VS, Saoji SD, Raut NA, Haware RV. Excipient variability and its impact on to identify and select stable ultra-high-concentration monoclonal anti-
dosage form functionality. J Pharm Sci. 2015;104:906-915. body formulations with reduced viscosities. J Pharm Sci. 2017;106:3230-
100. Li W, Worosila GD. Quantitation of active pharmaceutical ingredients and 3241.
excipients in powder blends using designed multivariate calibration models 105. Veredas FJ, Canton FR, Aledo JC. Prediction of Protein Oxidation Sites. IWANN.
by near-infrared spectroscopy. Int J Pharm. 2005;295:213-219. 2017;10306. https://doi.org/10.1007/978-3-319-59147-6_1.
101. Griffen JA, Owen AW, Burley J, Taresco V, Matousek P. Rapid quantification of 106. Laetitia Minh ML, et al. Optimization of classification and regression analysis
low level polymorph content in a solid dose form using transmission Raman of four monoclonal antibodies from Raman spectra using collaborative ma-
spectroscopy. J Pharm Biomed Anal. 2016;128:35-45. chine learning approach. Talanta. 2018;184:260-265.