Tutorials Book on Deep Chem
Tutorials Book on Deep Chem
DeepChem
The DeepChem Book
The DeepChem Book is a step-by-step tutorial series for deep
life sciences. The author, Bharath Ramsundar and the
DeepChem team, cover the essential tools and techniques for
mastering deep learning in life sciences. Tailored for
beginners in both machine learning and life sciences, the
The
book builds a repertoire of tools required to perform
meaningful work in the dynamic field of life sciences. Going
beyond machine learning, the tutorial covers critical aspects
DeepChem
of data handling necessary for constructing systems within
the deep life sciences. Executed on Google Colab, these
tutorials prioritize accessibility and convenience, providing
an open avenue for exploration.
Book
“The DeepChem project aims to make high quality open source
software for scientific machine learning more accessible to
scientists and developers worldwide. We have a particular Democratizing Deep Learning
for Sciences
focus on molecular machine learning and drug discovery, but
also support a broad range of applications in bioinformatics,
materials science, and computational physics. I started
DeepChem while doing my Ph.D. at Stanford, but today
DeepChem operates as a global distributed community of Ramsundar
researchers spread across many academic and industrial
institutions. We hope that you will join our community and
help us build!”
Bharath
- Bharath Ramsundar
www.deepchem.io
www.deepforestsci.com Bharath Ramsundar and the DeepChem Team
The DeepChem Book
Democratizing Deep-Learning for Drug Discovery Quantum Chemistry,
Materials Science and Biology
3. Modeling Proteins
1. Protein Deep Learning
5. Quantum Chemistry
1. Exploring Quantum Chemistry with GDB1k
2. DeepQMC tutorial
3. Training an Exchange Correlation Functional using Deepchem
6. Bioinformatics
1. Introduction to Bioinformatics
2. Multisequence Alignments
3. Deep probabilistic analysis of single-cell omics data
7. Material Sciences
1. Introduction To Material Science
10. Equivariance
1. Introduction to Equivariance
2. Modeling Protein Ligand Interactions With Atomic Convolutions
3. DeepChemXAlphafold
11. Olfaction
1. Predict Multi Label Odor Descriptors using OpenPOM
The Basic Tools of the Deep Life Sciences
Welcome to DeepChem's introductory tutorial for the deep life sciences. This series of notebooks is a step-by-step guide
for you to get to know the new tools and techniques needed to do deep learning for the life sciences. We'll start from
the basics, assuming that you're new to machine learning and the life sciences, and build up a repertoire of tools and
techniques that you can use to do meaningful work in the life sciences.
Scope: This tutorial will encompass both the machine learning and data handling needed to build systems for the deep
life sciences.
Colab
This tutorial and the rest in the sequences are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
2) Humanitarian Considerations: Disease is the oldest cause of human suffering. From the dawn of human
civilization, humans have suffered from pathogens, cancers, and neurological conditions. One of the greatest
achievements of the last few centuries has been the development of effective treatments for many diseases. By
mastering the skills in this tutorial, you will be able to stand on the shoulders of the giants of the past to help develop
new medicine.
3) Lowering the Cost of Medicine: The art of developing new medicine is currently an elite skill that can only be
practiced by a small core of expert practitioners. By enabling the growth of open source tools for drug discovery, you
can help democratize these skills and open up drug discovery to more competition. Increased competition can help
drive down the cost of medicine.
Prerequisites
This tutorial sequence will assume some basic familiarity with the Python data science ecosystem. We will assume that
you have familiarity with libraries such as Numpy, Pandas, and TensorFlow. We'll provide some brief refreshers on
basics through the tutorial so don't worry if you're not an expert.
Setup
The first step is to get DeepChem up and running. We recommend using Google Colab to work through this tutorial
series. You'll also need to run the following commands to get DeepChem installed on your colab notebook. We are going
to use a model based on tensorflow, because of that we've added [tensorflow] to the pip install command to ensure the
necessary dependencies are also installed
You can of course run this tutorial locally if you prefer. In this case, don't run the above cell since it will download and
install Anaconda on your local machine. In either case, we can now import the deepchem package to play with.
import deepchem as dc
dc.__version__
'2.5.0.dev'
1. Select the data set you will train your model on (or create a new data set if there isn't an existing suitable one).
2. Create the model.
3. Train the model on the data.
4. Evaluate the model on an independent test set to see how well it works.
5. Use the model to make predictions about new data.
With DeepChem, each of these steps can be as little as one or two lines of Python code. In this tutorial we will walk
through a basic example showing the complete workflow to solve a real world scientific problem.
The problem we will solve is predicting the solubility of small molecules given their chemical formulas. This is a very
important property in drug development: if a proposed drug isn't soluble enough, you probably won't be able to get
enough into the patient's bloodstream to have a therapeutic effect. The first thing we need is a data set of measured
solubilities for real molecules. One of the core components of DeepChem is MoleculeNet, a diverse collection of chemical
and molecular data sets. For this tutorial, we can use the Delaney solubility data set. The property of solubility in this
data set is reported in log(solubility) where solubility is measured in moles/liter.
I won't say too much about this code right now. We will see many similar examples in later tutorials. There are two
details I do want to draw your attention to. First, notice the featurizer argument passed to the load_delaney()
function. Molecules can be represented in many ways. We therefore tell it which representation we want to use, or in
more technical language, how to "featurize" the data. Second, notice that we actually get three different data sets: a
training set, a validation set, and a test set. Each of these serves a different function in the standard deep learning
workflow.
Now that we have our data, the next step is to create a model. We will use a particular kind of model called a "graph
convolutional network", or "graphconv" for short.
Here again I will not say much about the code. Later tutorials will give lots more information about GraphConvModel , as
well as other types of models provided by DeepChem.
We now need to train the model on the data set. We simply give it the data set and tell it how many epochs of training
to perform (that is, how many complete passes through the data to make).
model.fit(train_dataset, nb_epoch=100)
If everything has gone well, we should now have a fully trained model! But do we? To find out, we must evaluate the
model on the test set. We do that by selecting an evaluation metric and calling evaluate() on the model. For this
example, let's use the Pearson correlation, also known as r2, as our metric. We can evaluate it on both the training set
and test set.
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))
Notice that it has a higher score on the training set than the test set. Models usually perform better on the particular
data they were trained on than they do on similar but independent data. This is called "overfitting", and it is the reason
it is essential to evaluate your model on an independent test set.
Our model still has quite respectable performance on the test set. For comparison, a model that produced totally
random outputs would have a correlation of 0, while one that made perfect predictions would have a correlation of 1.
Our model does quite well, so now we can use it to make predictions about other molecules we care about.
Since this is just a tutorial and we don't have any other molecules we specifically want to predict, let's just use the first
ten molecules from the test set. For each one we print out the chemical structure (represented as a SMILES string) and
the predicted log(solubility). To put these predictions in context, we print out the log(solubility) values from the test set
as well.
solubilities = model.predict_on_batch(test_dataset.X[:10])
for molecule, solubility, test_solubility in zip(test_dataset.ids, solubilities, test_dataset.y):
print(solubility, test_solubility, molecule)
@manual{Intro1,
title={The Basic Tools of the Deep Life Sciences},
organization={DeepChem},
author={Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/The_Basic_Tools_of_the_Deep
year={2021},
}
Working With Datasets
Data is central to machine learning. This tutorial introduces the Dataset class that DeepChem uses to store and
manage data. It provides simple but powerful tools for efficiently working with large amounts of data. It also is designed
to easily interact with other popular Python frameworks such as NumPy, Pandas, TensorFlow, and PyTorch.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
import deepchem as dc
dc.__version__
'2.4.0-rc1.dev'
Anatomy of a Dataset
In the last tutorial we loaded the Delaney dataset of molecular solubilities. Let's load it again.
We now have three Dataset objects: the training, validation, and test sets. What information does each of them contain?
We can start to get an idea by printing out the string representation of one of them.
print(test_dataset)
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['C1c2ccccc2c3ccc4ccccc4c13' 'COc1ccccc
1Cl'
'COP(=S)(OC)Oc1cc(Cl)c(Br)cc1Cl' ... 'CCSCCSP(=S)(OC)OC' 'CCC(C)C'
'COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl'], task_names: ['measured log solubility in mols per litre']>
There's a lot of information there, so let's start at the beginning. It begins with the label "DiskDataset". Dataset is an
abstract class. It has a few subclasses that correspond to different ways of storing data.
DiskDataset is a dataset that has been saved to disk. The data is stored in a way that can be efficiently accessed,
even if the total amount of data is far larger than your computer's memory.
NumpyDataset is an in-memory dataset that holds all the data in NumPy arrays. It is a useful tool when
manipulating small to medium sized datasets that can fit entirely in memory.
ImageDataset is a more specialized class that stores some or all of the data in image files on disk. It is useful when
working with models that have images as their inputs or outputs.
Now let's consider the contents of the Dataset. Every Dataset stores a list of samples. Very roughly speaking, a sample
is a single data point. In this case, each sample is a molecule. In other datasets a sample might correspond to an
experimental assay, a cell line, an image, or many other things. For every sample the dataset stores the following
information.
The features, referred to as X . This is the input that should be fed into a model to represent the sample.
The labels, referred to as y . This is the desired output from the model. During training, it tries to make the model's
output for each sample as close as possible to y .
The weights, referred to as w . This can be used to indicate that some data values are more important than others.
In later tutorials we will see examples of how this is useful.
An ID, which is a unique identifier for the sample. This can be anything as long as it is unique. Sometimes it is just
an integer index, but in this dataset the ID is a SMILES string describing the molecule.
Notice that X , y , and w all have 113 as the size of their first dimension. That means this dataset contains 113
samples.
The final piece of information listed in the output is task_names . Some datasets contain multiple pieces of information
for each sample. For example, if a sample represents a molecule, the dataset might record the results of several
different experiments on that molecule. This dataset has only a single task: "measured log solubility in mols per litre".
Also notice that y and w each have shape (113, 1). The second dimension of these arrays usually matches the
number of tasks.
test_dataset.y
array([[-1.7065408738415053],
[0.2911162036252904],
[-1.4272475857596547],
[-0.9254664241210759],
[-1.9526976701170347],
[1.3514839414275706],
[-0.8591934405084332],
[-0.6509069205829855],
[-0.32900957160729316],
[0.6082797680572224],
[1.8295961803473488],
[1.6213096604219008],
[1.3751528641463715],
[0.45632528420252055],
[1.0532555151706793],
[-1.1053502367839627],
[-0.2011973889257683],
[0.3479216181504126],
[-0.9870056231899582],
[-0.8161160011602158],
[0.8402352107014712],
[0.22815686919328],
[0.06247441016167367],
[1.040947675356903],
[-0.5197810887208284],
[0.8023649343513898],
[-0.41895147793873655],
[-2.5964923680684198],
[1.7443880585596654],
[0.45206487811313645],
[0.233837410645792],
[-1.7917489956291888],
[0.7739622270888287],
[1.0011838851893173],
[-0.05445006806920272],
[1.1043803882432892],
[0.7597608734575482],
[-0.7001382798380905],
[0.8213000725264304],
[-1.3136367567094103],
[0.4567986626568967],
[-0.5732728540653187],
[0.4094608172192949],
[-0.3242757870635329],
[-0.049716283525442634],
[-0.39054877067617544],
[-0.08095926151425996],
[-0.2627365879946506],
[-0.5467636606202616],
[1.997172153196459],
[-0.03551492989416198],
[1.4508934168465344],
[-0.8639272250521937],
[0.23904457364392848],
[0.5278054308132993],
[-0.48475108309700315],
[0.2248432200126478],
[0.3431878336066523],
[1.5029650468278963],
[-0.4946920306388995],
[0.3479216181504126],
[0.7928973652638694],
[0.5609419226196206],
[-0.13965818985688602],
[-0.13965818985688602],
[0.15857023640000523],
[1.6071083067906202],
[1.9006029485037514],
[-0.7171799041956278],
[-0.8165893796145915],
[-0.13019062076936566],
[-0.24380144981960986],
[-0.14912575894440638],
[0.9538460397517154],
[-0.07811899078800374],
[-0.18226225075072758],
[0.2532459272752089],
[0.6887541053011454],
[0.044012650441008896],
[-0.5514974451640217],
[-0.2580028034508905],
[-0.021313576262881533],
[-2.4128215277705247],
[0.07336211461232214],
[0.9017744097703536],
[1.9384732248538328],
[0.8402352107014712],
[-0.10652169805056463],
[1.07692443788948],
[-0.403803367398704],
[1.2662758196398873],
[-0.2532690189071302],
[0.29064282517091444],
[0.9443784706641951],
[-0.41563782875810434],
[-0.7370617992794205],
[-1.0012069768212388],
[0.46626623174441706],
[0.3758509469585975],
[-0.46628932337633816],
[1.2662758196398873],
[-1.4968342185529295],
[-0.17800184466134344],
[0.8828392715953128],
[-0.6083028596891439],
[-2.170451759130003],
[0.32898647997537184],
[0.3005837727128107],
[0.6461500444073038],
[1.5058053175541524],
[-0.007585601085977053],
[-0.049716283525442634],
[-0.6849901692980588]], dtype=object)
This is a very easy way to access data, but you should be very careful about using it. This requires the data for all
samples to be loaded into memory at once. That's fine for small datasets like this one, but for large datasets it could
easily take more memory than you have.
A better approach is to iterate over the dataset. That lets it load just a little data at a time, process it, then free the
memory before loading the next bit. You can use the itersamples() method to iterate over samples one at a time.
for X, y, w, id in test_dataset.itersamples():
print(y, id)
[-1.70654087] C1c2ccccc2c3ccc4ccccc4c13
[0.2911162] COc1ccccc1Cl
[-1.42724759] COP(=S)(OC)Oc1cc(Cl)c(Br)cc1Cl
[-0.92546642] ClC(Cl)CC(=O)NC2=C(Cl)C(=O)c1ccccc1C2=O
[-1.95269767] ClC(Cl)C(c1ccc(Cl)cc1)c2ccc(Cl)cc2
[1.35148394] COC(=O)C=C
[-0.85919344] CN(C)C(=O)Nc2ccc(Oc1ccc(Cl)cc1)cc2
[-0.65090692] N(=Nc1ccccc1)c2ccccc2
[-0.32900957] CC(C)c1ccc(C)cc1
[0.60827977] Oc1c(Cl)cccc1Cl
[1.82959618] OCC2OC(OC1(CO)OC(CO)C(O)C1O)C(O)C(O)C2O
[1.62130966] OC1C(O)C(O)C(O)C(O)C1O
[1.37515286] Cn2c(=O)n(C)c1ncn(CC(O)CO)c1c2=O
[0.45632528] OCC(NC(=O)C(Cl)Cl)C(O)c1ccc(cc1)N(=O)=O
[1.05325552] CCC(O)(CC)CC
[-1.10535024] CC45CCC2C(CCC3CC1SC1CC23C)C4CCC5O
[-0.20119739] Brc1ccccc1Br
[0.34792162] Oc1c(Cl)cc(Cl)cc1Cl
[-0.98700562] CCCN(CCC)c1c(cc(cc1N(=O)=O)S(N)(=O)=O)N(=O)=O
[-0.816116] C2c1ccccc1N(CCF)C(=O)c3ccccc23
[0.84023521] CC(C)C(=O)C(C)C
[0.22815687] O=C1NC(=O)NC(=O)C1(C(C)C)CC=C(C)C
[0.06247441] c1c(O)C2C(=O)C3cc(O)ccC3OC2cc1(OC)
[1.04094768] Cn1cnc2n(C)c(=O)n(C)c(=O)c12
[-0.51978109] CC(=O)SC4CC1=CC(=O)CCC1(C)C5CCC2(C)C(CCC23CCC(=O)O3)C45
[0.80236493] Cc1ccc(O)cc1C
[-0.41895148] O(c1ccccc1)c2ccccc2
[-2.59649237] Clc1cc(Cl)c(cc1Cl)c2cc(Cl)c(Cl)cc2Cl
[1.74438806] NC(=O)c1cccnc1
[0.45206488] Sc1ccccc1
[0.23383741] CNC(=O)Oc1cc(C)cc(C)c1
[-1.791749] ClC1CC2C(C1Cl)C3(Cl)C(=C(Cl)C2(Cl)C3(Cl)Cl)Cl
[0.77396223] CSSC
[1.00118389] NC(=O)c1ccccc1
[-0.05445007] Clc1ccccc1Br
[1.10438039] COC(=O)c1ccccc1OC2OC(COC3OCC(O)C(O)C3O)C(O)C(O)C2O
[0.75976087] CCCCC(O)CC
[-0.70013828] CCN2c1nc(C)cc(C)c1NC(=O)c3cccnc23
[0.82130007] Oc1cc(Cl)cc(Cl)c1
[-1.31363676] Cc1cccc2c1ccc3ccccc32
[0.45679866] CCCCC(CC)CO
[-0.57327285] CC(C)N(C(C)C)C(=O)SCC(=CCl)Cl
[0.40946082] Cc1ccccc1
[-0.32427579] Clc1cccc(n1)C(Cl)(Cl)Cl
[-0.04971628] C1CCC=CCC1
[-0.39054877] CN(C)C(=S)SSC(=S)N(C)C
[-0.08095926] COC1=CC(=O)CC(C)C13Oc2c(Cl)c(OC)cc(OC)c2C3=O
[-0.26273659] CCCCCCCCCCO
[-0.54676366] CCC(C)(C)CC
[1.99717215] CNC(=O)C(C)SCCSP(=O)(OC)(OC)
[-0.03551493] Oc1cc(Cl)c(Cl)c(Cl)c1Cl
[1.45089342] CCCC=O
[-0.86392723] CC4CC3C2CCC1=CC(=O)C=CC1(C)C2(F)C(O)CC3(C)C4(O)C(=O)COC(C)=O
[0.23904457] CCCC
[0.52780543] COc1ccccc1O
[-0.48475108] CC1CC2C3CCC(O)(C(=O)C)C3(C)CC(O)C2(F)C4(C)C=CC(=O)C=C14
[0.22484322] ClC(Cl)C(Cl)(Cl)Cl
[0.34318783] CCOC(=O)c1ccccc1C(=O)OCC
[1.50296505] CC(C)CO
[-0.49469203] CC(C)Cc1ccccc1
[0.34792162] ICI
[0.79289737] CCCC(O)CCC
[0.56094192] CCCCCOC(=O)C
[-0.13965819] Oc1c(Cl)c(Cl)cc(Cl)c1Cl
[-0.13965819] CCCc1ccccc1
[0.15857024] FC(F)(Cl)C(F)(F)Cl
[1.60710831] CC=CC=O
[1.90060295] CN(C)C(=O)N(C)C
[-0.7171799] Cc1cc(C)c(C)cc1C
[-0.81658938] CC(=O)OC3(CCC4C2CCC1=CC(=O)CCC1C2CCC34C)C#C
[-0.13019062] CCOP(=S)(OCC)N2C(=O)c1ccccc1C2=O
[-0.24380145] c1ccccc1NC(=O)c2c(O)cccc2
[-0.14912576] CCN(CC)C(=S)SCC(Cl)=C
[0.95384604] ClCC
[-0.07811899] CC(=O)Nc1cc(NS(=O)(=O)C(F)(F)F)c(C)cc1C
[-0.18226225] O=C(C=CC=Cc2ccc1OCOc1c2)N3CCCCC3
[0.25324593] CC/C=C\C
[0.68875411] CNC(=O)ON=C(CSC)C(C)(C)C
[0.04401265] O=C2NC(=O)C1(CCCCCCC1)C(=O)N2
[-0.55149745] c1(C(C)(C)C)cc(C(C)(C)C)cc(OC(=O)NC)c1
[-0.2580028] Oc2cc(O)c1C(=O)CC(Oc1c2)c3ccc(O)c(O)c3
[-0.02131358] O=C(c1ccccc1)c2ccccc2
[-2.41282153] CCCCCCCCCCCCCCCCCCCC
[0.07336211] N(Nc1ccccc1)c2ccccc2
[0.90177441] CCC(CC)CO
[1.93847322] Oc1ccncc1
[0.84023521] Cl\C=C/Cl
[-0.1065217] CC1CCCC1
[1.07692444] CC(C)CC(C)O
[-0.40380337] O2c1ccc(N)cc1N(C)C(=O)c3cc(C)ccc23
[1.26627582] CC(C)(C)CO
[-0.25326902] CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n2cncn2
[0.29064283] Cc1cc(no1)C(=O)NNCc2ccccc2
[0.94437847] CC=C
[-0.41563783] Oc1ccc(Cl)cc1Cc2cc(Cl)ccc2O
[-0.7370618] CCOC(=O)Nc2cccc(OC(=O)Nc1ccccc1)c2
[-1.00120698] O=C1c2ccccc2C(=O)c3ccccc13
[0.46626623] CCCCCCC(C)O
[0.37585095] CC1=C(C(=O)Nc2ccccc2)S(=O)(=O)CCO1
[-0.46628932] CCCCc1ccccc1
[1.26627582] O=C1NC(=O)C(=O)N1
[-1.49683422] COP(=S)(OC)Oc1ccc(Sc2ccc(OP(=S)(OC)OC)cc2)cc1
[-0.17800184] NS(=O)(=O)c1cc(ccc1Cl)C2(O)NC(=O)c3ccccc23
[0.88283927] CC(C)COC(=O)C
[-0.60830286] CC(C)C(C)(C)C
[-2.17045176] Clc1ccc(c(Cl)c1Cl)c2c(Cl)cc(Cl)c(Cl)c2Cl
[0.32898648] N#Cc1ccccc1C#N
[0.30058377] Cc1cccc(c1)N(=O)=O
[0.64615004] FC(F)(F)C(Cl)Br
[1.50580532] CNC(=O)ON=C(SC)C(=O)N(C)C
[-0.0075856] CCSCCSP(=S)(OC)OC
[-0.04971628] CCC(C)C
[-0.68499017] COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl
Most deep learning models can process a batch of multiple samples all at once. You can use iterbatches() to iterate
over batches of samples.
(50, 1)
(50, 1)
(13, 1)
iterbatches() has other features that are useful when training models. For example,
iterbatches(batch_size=100, epochs=10, deterministic=False) will iterate over the complete dataset ten
times, each time with the samples in a different random order.
Datasets can also expose data using the standard interfaces for TensorFlow and PyTorch. To get a
tensorflow.data.Dataset , call make_tf_dataset() . To get a torch.utils.data.IterableDataset , call
make_pytorch_dataset() . See the API documentation for more details.
The final way of accessing data is to_dataframe() . This copies the data into a Pandas DataFrame . This requires
storing all the data in memory at once, so you should only use it with small datasets.
test_dataset.to_dataframe()
X y w ids
Creating Datasets
Now let's talk about how you can create your own datasets. Creating a NumpyDataset is very simple: just pass the
arrays containing the data to the constructor. Let's create some random arrays, then wrap them in a NumpyDataset.
import numpy as np
X = np.random.random((10, 5))
y = np.random.random((10, 2))
dataset = dc.data.NumpyDataset(X=X, y=y)
print(dataset)
<NumpyDataset X.shape: (10, 5), y.shape: (10, 2), w.shape: (10, 1), ids: [0 1 2 3 4 5 6 7 8 9], task_names: [0 1
]>
Notice that we did not specify weights or IDs. These are optional, as is y for that matter. Only X is required. Since we
left them out, it automatically built w and ids arrays for us, setting all weights to 1 and setting the IDs to integer
indices.
dataset.to_dataframe()
X1 X2 X3 X4 X5 y1 y2 w ids
What about creating a DiskDataset? If you have the data in NumPy arrays, you can call DiskDataset.from_numpy() to
save it to disk. Since this is just a tutorial, we will save it to a temporary directory.
import tempfile
<DiskDataset X.shape: (10, 5), y.shape: (10, 2), w.shape: (10, 1), ids: [0 1 2 3 4 5 6 7 8 9], task_names: [0 1]
>
What about larger datasets that can't fit in memory? What if you have some huge files on disk containing data on
hundreds of millions of molecules? The process for creating a DiskDataset from them is slightly more involved.
Fortunately, DeepChem's DataLoader framework can automate most of the work for you. That is a larger subject, so
we will return to it in a later tutorial.
One of the most powerful features of DeepChem is that it comes "batteries included" with datasets to use. The
DeepChem developer community maintains the MoleculeNet [1] suite of datasets which maintains a large collection of
different scientific datasets for use in machine learning applications. The original MoleculeNet suite had 17 datasets
mostly focused on molecular properties. Over the last several years, MoleculeNet has evolved into a broader collection
of scientific datasets to facilitate the broad use and development of scientific machine learning tools.
These datasets are integrated with the rest of the DeepChem suite so you can conveniently access these through
functions in the dc.molnet submodule. You've already seen a few examples of these loaders already as you've worked
through the tutorial series. The full documentation for the MoleculeNet suite is available in our docs [2].
[1] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-
530.
[2] https://deepchem.readthedocs.io/en/latest/moleculenet.html
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following installation commands. You can of course run this
tutorial locally if you prefer. In that case, don't run these cells since they will download and install DeepChem again on
your local machine.
import deepchem as dc
dc.__version__
'2.4.0-rc1.dev'
MoleculeNet Overview
In the last two tutorials we loaded the Delaney dataset of molecular solubilities. Let's load it one more time.
Notice that the loader function we invoke dc.molnet.load_delaney lives in the dc.molnet submodule of
MoleculeNet loaders. Let's take a look at the full collection of loaders available for us
The set of MoleculeNet loaders is actively maintained by the DeepChem community and we work on adding new
datasets to the collection. Let's see how many datasets there are in MoleculeNet today
46
dc.molnet.load_qm7 : V1
dc.molnet.load_qm7b_from_mat : V1
dc.molnet.load_qm8 : V1
dc.molnet.load_qm9 : V1
dc.molnet.load_delaney : V1. This dataset is also referred to as ESOL in the original paper.
dc.molnet.load_sampl : V1. This dataset is also referred to as FreeSolv in the original paper.
dc.molnet.load_lipo : V1. This dataset is also referred to as Lipophilicity in the original paper.
dc.molnet.load_thermosol : V2.
dc.molnet.load_hppb : V2.
dc.molnet.load_hopv : V2. This dataset is drawn from a recent publication [3]
dc.molnet.load_uspto
Biochemical/Biophysical Datasets
These datasets are drawn from various biochemical/biophysical datasets that measure things like the binding affinity of
compounds to proteins.
dc.molnet.load_pcba : V1
dc.molnet.load_nci : V2.
dc.molnet.load_muv : V1
dc.molnet.load_hiv : V1
dc.molnet.load_ppb : V2.
dc.molnet.load_bace_classification : V1. This loader loads the classification task for the BACE dataset from
the original MoleculeNet paper.
dc.molnet.load_bace_regression : V1. This loader loads the regression task for the BACE dataset from the
original MoleculeNet paper.
dc.molnet.load_kaggle : V2. This dataset is from Merck's drug discovery kaggle contest and is described in [4].
dc.molnet.load_factors : V2. This dataset is from [4].
dc.molnet.load_uv : V2. This dataset is from [4].
dc.molnet.load_kinase : V2. This datset is from [4].
dc.molnet.load_zinc15 : V2
dc.molnet.load_chembl : V2
dc.molnet.load_chembl25 : V2
Physiology Datasets
These datasets measure physiological properties of how molecules interact with human patients.
dc.molnet.load_bbbp : V1
dc.molnet.load_tox21 : V1
dc.molnet.load_toxcast : V1
dc.molnet.load_sider : V1
dc.molnet.load_clintox : V1
dc.molnet.load_clearance : V2.
dc.molnet.load_pdbbind : V1
Microscopy Datasets
These datasets contain microscopy image datasets, typically of cell lines. These datasets were not in the original
MoleculeNet paper.
dc.molnet.load_bbbc001 : V2
dc.molnet.load_bbbc002 : V2
dc.molnet.load_cell_counting : V2
dc.molnet.load_bandgap : V2
dc.molnet.load_perovskite : V2
dc.molnet.load_mp_formation_energy : V2
dc.molnet.load_mp_metallicity : V2
[3] Lopez, Steven A., et al. "The Harvard organic photovoltaic dataset." Scientific data 3.1 (2016): 1-7.
[4] Ramsundar, Bharath, et al. "Is multitask deep learning practical for pharma?." Journal of chemical information and
modeling 57.8 (2017): 2068-2076.
1. tasks : This is a list of task-names. Many datasets in MoleculeNet are "multitask". That is, a given datapoint has
multiple labels associated with it. These correspond to different measurements or values associated with this
datapoint.
2. datasets : This field is a tuple of three dc.data.Dataset objects (train, valid, test) . These correspond to
the training, validation, and test set for this MoleculeNet dataset.
3. transformers : This field is a list of dc.trans.Transformer objects which were applied to this dataset during
processing.
This is abstract so let's take a look at each of these fields for the dc.molnet.load_delaney function we invoked above.
Let's start with tasks .
tasks
We have one task in this dataset which corresponds to the measured log solubility in mol/L. Let's now take a look at
datasets :
datasets
(<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['CCC(C)Cl' 'O=C1NC(=O)NC(=O)C1(C(C)C
)CC=C' 'Oc1ccccn1' ...
'CCCCCCCC(=O)OCC' 'O=Cc1ccccc1' 'CCCC=C(CC)C=O'], task_names: ['measured log solubility in mols per litre']>,
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CSc1nc(nc(n1)N(C)C)N(C)C' 'CC#N' 'C
CCCCCCC#C' ... 'ClCCBr'
'CCN(CC)C(=O)CSc1ccc(Cl)nn1' 'CC(=O)OC3CCC4C2CCC1=CC(=O)CCC1(C)C2CCC34C '], task_names: ['measured log solubi
lity in mols per litre']>,
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CCCCc1c(C)nc(nc1O)N(C)C '
'Cc3cc2nc1c(=O)[nH]c(=O)nc1n(CC(O)C(O)C(O)CO)c2cc3C'
'CSc1nc(NC(C)C)nc(NC(C)C)n1' ... 'O=c1[nH]cnc2[nH]ncc12 '
'CC(=C)C1CC=C(C)C(=O)C1' 'OC(C(=O)c1ccccc1)c2ccccc2'], task_names: ['measured log solubility in mols per litr
e']>)
As we mentioned previously, we see that datasets is a tuple of 3 datasets. Let's split them out.
train
<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['CCC(C)Cl' 'O=C1NC(=O)NC(=O)C1(C(C)C)
CC=C' 'Oc1ccccn1' ...
'CCCCCCCC(=O)OCC' 'O=Cc1ccccc1' 'CCCC=C(CC)C=O'], task_names: ['measured log solubility in mols per litre']>
valid
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CSc1nc(nc(n1)N(C)C)N(C)C' 'CC#N' 'CC
CCCCCC#C' ... 'ClCCBr'
'CCN(CC)C(=O)CSc1ccc(Cl)nn1' 'CC(=O)OC3CCC4C2CCC1=CC(=O)CCC1(C)C2CCC34C '], task_names: ['measured log solubil
ity in mols per litre']>
test
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CCCCc1c(C)nc(nc1O)N(C)C '
'Cc3cc2nc1c(=O)[nH]c(=O)nc1n(CC(O)C(O)C(O)CO)c2cc3C'
'CSc1nc(NC(C)C)nc(NC(C)C)n1' ... 'O=c1[nH]cnc2[nH]ncc12 '
'CC(=C)C1CC=C(C)C(=O)C1' 'OC(C(=O)c1ccccc1)c2ccccc2'], task_names: ['measured log solubility in mols per litre
']>
train.X[0]
<deepchem.feat.mol_graphs.ConvMol at 0x7fe1ef601438>
Note that this is a dc.feat.mol_graphs.ConvMol object produced by dc.feat.ConvMolFeaturizer . We'll say more
about how to control choice of featurization shortly. Finally let's take a look at the transformers field:
transformers
[<deepchem.trans.transformers.NormalizationTransformer at 0x7fe2029bdfd0>]
After reading through this description so far, you may be wondering what choices are made under the hood. As we've
briefly mentioned previously, datasets can be processed with different choices of "featurizers". Can we control the
choice of featurization here? In addition, how was the source dataset split into train/valid/test as three different
datasets?
You can use the 'featurizer' and 'splitter' keyword arguments and pass in different strings. Common possible choices for
'featurizer' are 'ECFP', 'GraphConv', 'Weave' and 'smiles2img' corresponding to the dc.feat.CircularFingerprint ,
dc.feat.ConvMolFeaturizer , dc.feat.WeaveFeaturizer and dc.feat.SmilesToImage featurizers. Common
possible choices for 'splitter' are None , 'index', 'random', 'scaffold' and 'stratified' corresponding to no split,
dc.splits.IndexSplitter , dc.splits.RandomSplitter , dc.splits.SingletaskStratifiedSplitter . We
haven't talked much about splitters yet, but intuitively they're a way to partition a dataset based on different criteria.
We'll say more in a future tutorial.
Instead of a string, you also can pass in any Featurizer or Splitter object. This is very useful when, for example, a
Featurizer has constructor arguments you can use to customize its behavior.
train
<DiskDataset X.shape: (902, 1024), y.shape: (902, 1), w.shape: (902, 1), ids: ['CC(C)=CCCC(C)=CC(=O)' 'CCCC=C'
'CCCCCCCCCCCCCC' ...
'Nc2cccc3nc1ccccc1cc23 ' 'C1CCCCCC1' 'OC1CCCCCC1'], task_names: ['measured log solubility in mols per litre']>
train.X[0]
Note that unlike the earlier invocation we have numpy arrays produced by dc.feat.CircularFingerprint instead of
ConvMol objects produced by dc.feat.ConvMolFeaturizer .
Give it a try for yourself. Try invoking MoleculeNet to load some other datasets and experiment with different
featurizer/split options and see what happens!
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
import deepchem as dc
dc.__version__
'2.4.0-rc1.dev'
What is a Fingerprint?
Deep learning models almost always take arrays of numbers as their inputs. If we want to process molecules with them,
we somehow need to represent each molecule as one or more arrays of numbers.
Many (but not all) types of models require their inputs to have a fixed size. This can be a challenge for molecules, since
different molecules have different numbers of atoms. If we want to use these types of models, we somehow need to
represent variable sized molecules with fixed sized arrays.
Fingerprints are designed to address these problems. A fingerprint is a fixed length array, where different elements
indicate the presence of different features in the molecule. If two molecules have similar fingerprints, that indicates they
contain many of the same features, and therefore will likely have similar chemistry.
DeepChem supports a particular type of fingerprint called an "Extended Connectivity Fingerprint", or "ECFP" for short.
They also are sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on
their direct properties and bonds. Each unique pattern is a feature. For example, "carbon atom bonded to two
hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any
molecule that contains that feature. It then iteratively identifies new features by looking at larger circular
neighborhoods. One specific feature bonded to two other specific features becomes a higher level feature, and the
corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most
often two.
Let's take a look at a dataset that has been featurized with ECFP.
<DiskDataset X.shape: (6264, 1024), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' '
NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>
The feature array X has shape (6264, 1024). That means there are 6264 samples in the training set. Each one is
represented by a fingerprint of length 1024. Also notice that the label array y has shape (6264, 12): this is a multitask
dataset. Tox21 contains information about the toxicity of molecules. 12 different assays were used to look for signs of
toxicity. The dataset records the results of all 12 assays, each as a different task.
train_dataset.w
array([[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
1.060388945752303, 1.1895710249165168, 1.0700990099009902],
[1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
0.0, 1.1895710249165168, 1.0700990099009902],
[0.0, 0.0, 0.0, ..., 1.060388945752303, 0.0, 0.0],
...,
[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
1.060388945752303, 0.0, 0.0],
[1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
1.060388945752303, 1.1895710249165168, 1.0700990099009902]],
dtype=object)
Notice that some elements are 0. The weights are being used to indicate missing data. Not all assays were actually
performed on every molecule. Setting the weight for a sample or sample/task pair to 0 causes it to be ignored during
fitting and evaluation. It will have no effect on the loss function or other metrics.
Most of the other weights are close to 1, but not exactly 1. This is done to balance the overall weight of positive and
negative samples on each task. When training the model, we want each of the 12 tasks to contribute equally, and on
each task we want to put equal weight on positive and negative samples. Otherwise, the model might just learn that
most of the training samples are non-toxic, and therefore become biased toward identifying other molecules as non-
toxic.
MultitaskClassifier is a simple stack of fully connected layers. In this example we tell it to use a single hidden layer
of width 1000. We also tell it that each input will have 1024 features, and that it should produce predictions for 12
different tasks.
Why not train a separate model for each task? We could do that, but it turns out that training a single model for multiple
tasks often works better. We will see an example of that in a later tutorial.
import numpy as np
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))
Not bad performance for such a simple model and featurization. More sophisticated models do slightly better on this
dataset, but not enormously better.
@manual{Intro4,
title={Molecular Fingerprints},
organization={DeepChem},
author={Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Molecular_Fingerprints.ipyn
year={2021},
}
Creating Models with TensorFlow and PyTorch
In the tutorials so far, we have used standard models provided by DeepChem. This is fine for many applications, but
sooner or later you will want to create an entirely new model with an architecture you define yourself. DeepChem
provides integration with both TensorFlow (Keras) and PyTorch, so you can use it with models from either of these
frameworks.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
There are actually two different approaches you can take to using TensorFlow or PyTorch models with DeepChem. It
depends on whether you want to use TensorFlow/PyTorch APIs or DeepChem APIs for training and evaluating your
model. For the former case, DeepChem's Dataset class has methods for easily adapting it to use with other
frameworks. make_tf_dataset() returns a tensorflow.data.Dataset object that iterates over the data.
make_pytorch_dataset() returns a torch.utils.data.IterableDataset that iterates over the data. This lets you
use DeepChem's datasets, loaders, featurizers, transformers, splitters, etc. and easily integrate them into your existing
TensorFlow or PyTorch code.
But DeepChem also provides many other useful features. The other approach, which lets you use those features, is to
wrap your model in a DeepChem Model object. Let's look at how to do that.
KerasModel
KerasModel is a subclass of DeepChem's Model class. It acts as a wrapper around a tensorflow.keras.Model . Let's
see an example of using it. For this example, we create a simple sequential model consisting of two dense layers.
import deepchem as dc
import tensorflow as tf
keras_model = tf.keras.Sequential([
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dropout(rate=0.5),
tf.keras.layers.Dense(1)
])
model = dc.models.KerasModel(keras_model, dc.models.losses.L2Loss())
For this example, we used the Keras Sequential class. Our model consists of a dense layer with ReLU activation, 50%
dropout to provide regularization, and a final layer that produces a scalar output. We also need to specify the loss
function to use when training the model, in this case L2 loss. We can now train and evaluate the model exactly as we
would with any other DeepChem model. For example, let's load the Delaney solubility dataset. How does our model do
at predicting the solubilities of molecules based on their extended-connectivity fingerprints (ECFPs)?
TorchModel
TorchModel works just like KerasModel , except it wraps a torch.nn.Module . Let's use PyTorch to create another
model just like the previous one and train it on the same data.
import torch
pytorch_model = torch.nn.Sequential(
torch.nn.Linear(1024, 1000),
torch.nn.ReLU(),
torch.nn.Dropout(0.5),
torch.nn.Linear(1000, 1)
)
model = dc.models.TorchModel(pytorch_model, dc.models.losses.L2Loss())
model.fit(train_dataset, nb_epoch=50)
print('training set score:', model.evaluate(train_dataset, [metric]))
print('test set score:', model.evaluate(test_dataset, [metric]))
Computing Losses
Now let's see a more advanced example. In the above models, the loss was computed directly from the model's output.
Often that is fine, but not always. Consider a classification model that outputs a probability distribution. While it is
possible to compute the loss from the probabilities, it is more numerically stable to compute it from the logits.
To do this, we create a model that returns multiple outputs, both probabilities and logits. KerasModel and
TorchModel let you specify a list of "output types". If a particular output has type 'prediction' , that means it is a
normal output that should be returned when you call predict() . If it has type 'loss' , that means it should be
passed to the loss function in place of the normal outputs.
Sequential models do not allow multiple outputs, so instead we use a subclassing style model.
class ClassificationModel(tf.keras.Model):
def __init__(self):
super(ClassificationModel, self).__init__()
self.dense1 = tf.keras.layers.Dense(1000, activation='relu')
self.dense2 = tf.keras.layers.Dense(1)
keras_model = ClassificationModel()
output_types = ['prediction', 'loss']
model = dc.models.KerasModel(keras_model, dc.models.losses.SigmoidCrossEntropy(), output_types=output_types)
We can train our model on the BACE dataset. This is a binary classification task that tries to predict whether a molecule
will inhibit the enzyme BACE-1.
Similarly, we will create a custom Classifier Model class to be used with TorchModel . Using similar reasoning to the
above KerasModel , a custom model allows for easy capturing of the unscaled output (logits in Tensorflow) of the
second dense layer. The custom class allows definition of how forward pass is done; enabling capture of the logits right
before the final sigmoid is applied to produce the prediction.
Finally, an instance of ClassificationModel is coupled with a loss function that requires both the prediction and logits
to produce an instance of TorchModel to train.
class ClassificationModel(torch.nn.Module):
def __init__(self):
super(ClassificationModel, self).__init__()
self.dense1 = torch.nn.Linear(1024, 1000)
self.dense2 = torch.nn.Linear(1000, 1)
We will use the same BACE dataset. As before, the model will try to do a binary classification task that tries to predict
whether a molecule will inhibit the enzyme BACE-1.
Other Features
KerasModel and TorchModel have lots of other features. Here are some of the more important ones.
By wrapping your own models in a KerasModel or TorchModel , you get immediate access to all these features. See
the API documentation for full details on them.
@manual{Intro1,
title={5},
organization={DeepChem},
author={Ramsundar, Bharath and Rebel, Alles},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Creating_Models_with_Tensor
year={2021},
}
Introduction to Graph Convolutions
In this tutorial we will learn more about "graph convolutions." These are one of the most powerful deep learning tools for
working with molecular data. The reason for this is that molecules can be naturally viewed as graphs.
Note how standard chemical diagrams of the sort we're used to from high school lend themselves naturally to
visualizing molecules as graphs. In the remainder of this tutorial, we'll dig into this relationship in significantly more
detail. This will let us get a deeper understanding of how these systems work.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Graph convolutions are similar, but they operate on a graph. They begin with a data vector for each node of the graph
(for example, the chemical properties of the atom that node represents). Convolutional and pooling layers combine
information from connected nodes (for example, atoms that are bonded to each other) to produce a new data vector for
each node.
Training a GraphConvModel
Let's use the MoleculeNet suite to load the Tox21 dataset. To featurize the data in a way that graph convolutional
networks can use, we set the featurizer option to 'GraphConv' . The MoleculeNet call returns a training set, a validation
set, and a test set for us to use. It also returns tasks , a list of the task names, and transformers , a list of data
transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of
data transformations to ensure that training proceeds stably.)
Note: While importing deepchem, if you see any warnings, ignore them for now. Deepchem is a vast library and there
are many things that can cause minor warnings to occur. Almost always, it doesn't require any action from your side.
import deepchem as dc
Let's now train a graph convolutional network on this dataset. DeepChem has the class GraphConvModel that wraps a
standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this
class and train it on our dataset.
n_tasks = len(tasks)
num_features = train_dataset.X[0].get_atom_features().shape[1]
model = dc.models.torch_models.GraphConvModel(n_tasks, mode='classification',number_input_features=[num_features,64
model.fit(train_dataset, nb_epoch=50)
0.29102970123291017
Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of
model performance. dc.metrics holds a collection of metrics already. For this dataset, it is standard to use the ROC-
AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision
and recall). Luckily, the ROC-AUC score is already available in DeepChem.
To measure the performance of the model under this metric, we can use the convenience function model.evaluate() .
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('Training set score:', model.evaluate(train_dataset, [metric], transformers))
print('Test set score:', model.evaluate(test_dataset, [metric], transformers))
The results are pretty good, and GraphConvModel is very easy to use. But what's going on under the hood? Could we
build GraphConvModel ourselves? Of course! DeepChem provides Keras layers for all the calculations involved in a
graph convolution. We are going to apply the following layers from DeepChem.
GraphConv layer: This layer implements the graph convolution. The graph convolution combines per-node feature
vectures in a nonlinear fashion with the feature vectors for neighboring nodes. This "blends" information in local
neighborhoods of a graph.
GraphPool layer: This layer does a max-pooling over the feature vectors of atoms in a neighborhood. You can think
of this layer as analogous to a max-pooling layer for 2D convolutions but which operates on graphs instead.
GraphGather : Many graph convolutional networks manipulate feature vectors per graph-node. For a molecule for
example, each node might represent an atom, and the network would manipulate atomic feature vectors that
summarize the local chemistry of the atom. However, at the end of the application, we will likely want to work with a
molecule level feature representation. This layer creates a graph level feature vector by combining all the node-
level feature vectors.
Apart from this we are going to apply standard neural network layers such as Dense, BatchNormalization and Softmax
layer.
Tensorflow
from deepchem.models.layers import GraphConv, GraphPool, GraphGather
import tensorflow as tf
import tensorflow.keras.layers as layers
batch_size = 100
class GraphConvModelTensorflow(tf.keras.Model):
def __init__(self):
super(GraphConvModelTensorflow, self).__init__()
self.gc1 = GraphConv(128, activation_fn=tf.nn.tanh)
self.batch_norm1 = layers.BatchNormalization()
self.gp1 = GraphPool()
self.gc2 = GraphConv(128, activation_fn=tf.nn.tanh)
self.batch_norm2 = layers.BatchNormalization()
self.gp2 = GraphPool()
self.dense2 = layers.Dense(n_tasks*2)
self.logits = layers.Reshape((n_tasks, 2))
self.softmax = layers.Softmax()
dense1_output = self.dense1(gp2_output)
batch_norm3_output = self.batch_norm3(dense1_output)
readout_output = self.readout([batch_norm3_output] + inputs[1:])
logits_output = self.logits(self.dense2(readout_output))
return self.softmax(logits_output)
We can now see more clearly what is happening. There are two convolutional blocks, each consisting of a GraphConv ,
followed by batch normalization, followed by a GraphPool to do max pooling. We finish up with a dense layer, another
batch normalization, a GraphGather to combine the data from all the different nodes, and a final dense layer to
produce the global output.
Let's now create the DeepChem model which will be a wrapper around the Keras model that we just created. We will
also specify the loss function so the model knows the objective to minimize.
What are the inputs to this model? A graph convolution requires a complete description of each molecule, including the
list of nodes (atoms) and a description of which ones are bonded to each other. In fact, if we inspect the dataset we see
that the feature array contains Python objects of type ConvMol .
test_dataset.X[0]
<deepchem.feat.mol_graphs.ConvMol at 0x7bf66bfa1160>
Models expect arrays of numbers as their inputs, not Python objects. We must convert the ConvMol objects into the
particular set of arrays expected by the GraphConv , GraphPool , and GraphGather layers. Fortunately, the ConvMol
class includes the code to do this, as well as to combine all the molecules in a batch to create a single set of arrays.
The following code creates a Python generator that given a batch of data generates the lists of inputs, labels, and
weights whose values are Numpy arrays. atom_features holds a feature vector of length 75 for each atom. The other
inputs are required to support minibatching in TensorFlow. degree_slice is an indexing convenience that makes it
easy to locate atoms from all molecules with a given degree. membership determines the membership of atoms in
molecules (atom i belongs to molecule membership[i] ). deg_adjs is a list that contains adjacency lists grouped by
atom degree. For more details, check out the code.
Now, we can train the model using fit_generator(generator) which will use the generator we've defined to train the
model.
model.fit_generator(data_generator(train_dataset, epochs=50))
0.23354644775390626
Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our
defined generator to evaluate model performance.
PyTorch
Before working on the PyTorch implementation, we must import a few crucial layers from the torch_models collection.
These are PyTorch implementations of GraphConv , GraphPool and GraphGather which we used in the tensorflow's
implementation as well.
import torch
import torch.nn as nn
from deepchem.models.torch_models.layers import GraphConv, GraphGather, GraphPool
PyTorch's GraphConv requires the number of input features to be specified, hence we can extract that piece of
information by following steps:
sample_batch = next(data_generator(train_dataset))
node_features = sample_batch[0][0]
num_input_features = node_features.shape[1]
print(f"Number of input features: {num_input_features}")
class GraphConvModelTorch(nn.Module):
def __init__(self):
super(GraphConvModelTorch, self).__init__()
dense1_output = self.act3(self.dense1(gp2_output))
batch_norm3_output = self.batch_norm3(dense1_output)
readout_output = self.readout([batch_norm3_output] + inputs[1:])
dense2_output = self.dense2(readout_output)
logits_output = self.logits(dense2_output)
softmax_output = self.softmax(logits_output)
return softmax_output
Success! Both the models we've constructed behave nearly identically to GraphConvModel . If you're looking to build
your own custom models, you can follow the examples we've provided here to do so. We hope to see exciting
constructions from your end soon!
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
'2.6.0.dev'
Featurizers
In DeepChem, a method of featurizing a molecule (or any other sort of input) is defined by a Featurizer object. There
are three different ways of using featurizers.
1. When using the MoleculeNet loader functions, you simply pass the name of the featurization method to use. We
have seen examples of this in earlier tutorials, such as featurizer='ECFP' or featurizer='GraphConv' .
2. You also can create a Featurizer and directly apply it to molecules. For example:
import deepchem as dc
featurizer = dc.feat.CircularFingerprint()
print(featurizer(['CC', 'CCC', 'CCO']))
3. When creating a new dataset with the DataLoader framework, you can specify a Featurizer to use for processing the
data. We will see this in a future tutorial.
We use propane (CH3CH2CH3, represented by the SMILES string 'CCC' ) as a running example throughout this tutorial.
Many of the featurization methods use conformers of the molecules. A conformer can be generated using the
ConformerGenerator class in deepchem.utils.conformers .
RDKitDescriptors
RDKitDescriptors featurizes a molecule by using RDKit to compute values for a list of descriptors. These are basic
physical and chemical properties: molecular weight, polar surface area, numbers of hydrogen bond donors and
acceptors, etc. This is most useful for predicting things that depend on these high level properties rather than on
detailed molecular structure.
Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using
RDKitDescriptors.allowedDescriptors . The featurizer uses the descriptors in
rdkit.Chem.Descriptors.descList , checks if they are in the list of allowed descriptors, and computes the descriptor
value for the molecule.
Let's print the values of the first ten descriptors for propane.
MaxAbsEStateIndex 2.125
MaxEStateIndex 2.125
MinAbsEStateIndex 1.25
MinEStateIndex 1.25
qed 0.3854706587740357
SPS 6.0
MolWt 44.097
HeavyAtomMolWt 36.033
ExactMolWt 44.062600255999996
NumValenceElectrons 20.0
[09:07:17] DEPRECATION WARNING: please use MorganGenerator
[09:07:17] DEPRECATION WARNING: please use MorganGenerator
[09:07:17] DEPRECATION WARNING: please use MorganGenerator
DeepChem supports lots of different graph based models. Some of them require molecules to be featurized in slightly
different ways. Because of this, there are two other featurizers called WeaveFeaturizer and
MolGraphConvFeaturizer . They each convert molecules into a different type of Python object that is used by
particular models. When using any graph based model, just check the documentation to see what featurizer you need to
use with it.
CoulombMatrix
All the models we have looked at so far consider only the intrinsic properties of a molecule: the list of atoms that
compose it and the bonds connecting them. When working with flexible molecules, you may also want to consider the
different conformations the molecule can take on. For example, when a drug molecule binds to a protein, the strength of
the binding depends on specific interactions between pairs of atoms. To predict binding strength, you probably want to
consider a variety of possible conformations and use a model that takes them into account when making predictions.
The Coulomb matrix is one popular featurization for molecular conformations. Recall that the electrostatic Coulomb
interaction between two charges is proportional to
where
and
matrix where each element gives the strength of the electrostatic interaction between two atoms. It contains
information both about the charges on the atoms and the distances between them. More information on the functional
forms used can be found here.
To apply this featurizer, we first need a set of conformations for the molecule. We can use the ConformerGenerator
class to do this. It takes a RDKit molecule, generates a set of energy minimized conformers, and prunes the set to only
include ones that are significantly different from each other. Let's try running it for propane.
generator = dc.utils.ConformerGenerator(max_conformers=5)
propane_mol = generator.generate_conformers(Chem.MolFromSmiles('CCC'))
print("Number of available conformers for propane: ", len(propane_mol.GetConformers()))
Number of available conformers for propane: 1
It only found a single conformer. This shouldn't be surprising, since propane is a very small molecule with hardly any
flexibility. Let's try adding another carbon.
butane_mol = generator.generate_conformers(Chem.MolFromSmiles('CCCC'))
print("Number of available conformers for butane: ", len(butane_mol.GetConformers()))
coulomb_mat = dc.feat.CoulombMatrix(max_atoms=20)
features = coulomb_mat(propane_mol)
print(features)
[[[36.8581052 12.48684429 7.5619687 2.85945193 2.85804514
2.85804556 1.4674015 1.46740144 0.91279491 1.14239698
1.14239675 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[12.48684429 36.8581052 12.48684388 1.46551218 1.45850736
1.45850732 2.85689525 2.85689538 1.4655122 1.4585072
1.4585072 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 7.5619687 12.48684388 36.8581052 0.9127949 1.14239695
1.14239692 1.46740146 1.46740145 2.85945178 2.85804504
2.85804493 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 2.85945193 1.46551218 0.9127949 0.5 0.29325367
0.29325369 0.21256978 0.21256978 0.12268391 0.13960187
0.13960185 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 2.85804514 1.45850736 1.14239695 0.29325367 0.5
0.29200271 0.17113413 0.21092513 0.13960186 0.1680002
0.20540029 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 2.85804556 1.45850732 1.14239692 0.29325369 0.29200271
0.5 0.21092513 0.17113413 0.13960187 0.20540032
0.16800016 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.4674015 2.85689525 1.46740146 0.21256978 0.17113413
0.21092513 0.5 0.29351308 0.21256981 0.2109251
0.17113412 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.46740144 2.85689538 1.46740145 0.21256978 0.21092513
0.17113413 0.29351308 0.5 0.21256977 0.17113412
0.21092513 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0.91279491 1.4655122 2.85945178 0.12268391 0.13960186
0.13960187 0.21256981 0.21256977 0.5 0.29325366
0.29325365 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.14239698 1.4585072 2.85804504 0.13960187 0.1680002
0.20540032 0.2109251 0.17113412 0.29325366 0.5
0.29200266 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.14239675 1.4585072 2.85804493 0.13960185 0.20540029
0.16800016 0.17113412 0.21092513 0.29325365 0.29200266
0.5 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]]]
Notice that many elements are 0. To combine multiple molecules in a batch we need all the Coulomb matrices to be the
same size, even if the molecules have different numbers of atoms. We specified max_atoms=20 , so the returned matrix
has size (20, 20). The molecule only has 11 atoms, so only an 11 by 11 submatrix is nonzero.
CoulombMatrixEig
An important feature of Coulomb matrices is that they are invariant to molecular rotation and translation, since the
interatomic distances and atomic numbers do not change. Respecting symmetries like this makes learning easier.
Rotating a molecule does not change its physical properties. If the featurization does change, then the model is forced
to learn that rotations are not important, but if the featurization is invariant then the model gets this property
automatically.
Coulomb matrices are not invariant under another important symmetry: permutations of the atoms' indices. A
molecule's physical properties do not depend on which atom we call "atom 1", but the Coulomb matrix does. To deal
with this, the CoulumbMatrixEig featurizer was introduced, which uses the eigenvalue spectrum of the Coulumb
matrix and is invariant to random permutations of the atom's indices. The disadvantage of this featurization is that it
contains much less information (
eigenvalues instead of an
CoulombMatrixEig inherits from CoulombMatrix and featurizes a molecule by first computing the Coulomb matrices
for different conformers of the molecule and then computing the eigenvalues for each Coulomb matrix. These
eigenvalues are then padded to account for variation in number of atoms across molecules.
coulomb_mat_eig = dc.feat.CoulombMatrixEig(max_atoms=20)
features = coulomb_mat_eig(propane_mol)
print(features)
To prepare SMILES strings for a sequence model, we break them down into lists of substrings (called tokens) and turn
them into lists of integer values (numericalization). Sequence models use those integer values as indices of an
embedding matrix, which contains a vector of floating-point numbers for each token in the vocabulary. These
embedding vectors are updated during model training. This process allows the sequence model to learn its own
representations of the molecular properties implicit in the training data.
We will use DeepChem's BasicSmilesTokenizer and the Tox21 dataset from MoleculeNet to demonstrate the process
of tokenizing SMILES.
import numpy as np
<DiskDataset X.shape: (6264,), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-Ah
R' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>
We loaded the datasets with featurizer="Raw" . Now we obtain the SMILES from their ids attributes.
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
test_smiles = test_dataset.ids
print(train_smiles[:5])
['CC(O)(P(=O)(O)O)P(=O)(O)O' 'CC(C)(C)OOC(C)(C)CCC(C)(C)OOC(C)(C)C'
'OC[C@H](O)[C@@H](O)[C@H](O)CO'
'CCCCCCCC(=O)[O-].CCCCCCCC(=O)[O-].[Zn+2]' 'CC(C)COC(=O)C(C)C']
Next we define our tokenizer and map it onto all our data to convert the SMILES strings into lists of tokens. The
BasicSmilesTokenizer breaks down SMILES roughly at atom level.
tokenizer = dc.feat.smiles_tokenizer.BasicSmilesTokenizer()
train_tok = list(map(tokenizer.tokenize, train_smiles))
valid_tok = list(map(tokenizer.tokenize, valid_smiles))
test_tok = list(map(tokenizer.tokenize, test_smiles))
print(train_tok[0])
len(train_tok)
['C', 'C', '(', 'O', ')', '(', 'P', '(', '=', 'O', ')', '(', 'O', ')', 'O', ')', 'P', '(', '=', 'O', ')', '(', '
O', ')', 'O']
6264
Now we have tokenized versions of all SMILES strings in our dataset. To convert those into lists of integer values we first
need to create a list of all possible tokens in our dataset. That list is called the vocabulary. We also add the empty string
"" to our vocabulary in order to correctly handle trailing zeros when decoding zero-padded numericalized SMILES.
['', '#', '(', ')', '-', '.', '/', '1', '2', '3', '4', '5'] ... ['[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]'
, '[se]', '\\', 'c', 'n', 'o', 's']
128
To numericalize tokenized SMILES strings we create a str2int dictionary which assigns a number to each token in the
dictionary. We also create the reverse int2str dictionary and define the corresponding encode and decode
functions. Finally we map the encode function on the tokenized data to obtain numericalized SMILES data.
CC(O)(P(=O)(O)O)P(=O)(O)O
[19, 19, 2, 24, 3, 2, 25, 2, 16, 24, 3, 2, 24, 3, 24, 3, 25, 2, 16, 24, 3, 2, 24, 3, 24]
CC(O)(P(=O)(O)O)P(=O)(O)O
[19, 19, 2, 24, 3, 2, 25, 2, 16, 24, 3, 2, 24, 3, 24, 3, 25, 2, 16, 24, 3, 2, 24, 3, 24]
Lastly, we would like to combine all molecules in a dataset in an np.array so they can be served to a model in
batches. To achieve that, all sequences have to be of the same length. As in the CoulombMatrix section, we achieve that
by appending zeros up to a fixed value.
240
The longest sequence across all Tox21 datasets has length 240 , so we use that as our fixed length. We create a
zero_pad function, map it to all numericalized SMILES, and turn them into np.array s.
Cc1cc(C(C)(C)c2ccc(O)c(C)c2)ccc1O
Cc1cc(C(C)(C)c2ccc(O)c(C)c2)ccc1O
The padded data passes the test. It is now in the correct format to be used for training of a sequence model, but it
doesn't yet interface nicely with DeepChem's training framework. To change that, we define a tokenize_smiles
function that combines all the steps spelled out above to process a single datapoint. Additionally, we define a
SmilesFeaturizer that uses our custom tokenize_smiles function in its _featurize method and instanciate it as
smiles_featurizer passing it our vocab and max_len .
class SmilesFeaturizer(dc.feat.Featurizer):
def __init__(self, feat_func, vocab, max_len):
self.feat_func = feat_func
self.vocab = vocab
self.max_len = max_len
Finally, we use the smiles_featurizer to create new Tox21 datasets that contain tokenized and numericalized
SMILES in their X attribute.
The datasets are now ready to be used with your custom DeepChem sequence model. Don't forget to wrap your model
into the appropriate DeepChem model class.
@manual{Intro7,
title={Going Deeper on Molecular Featurizations},
organization={DeepChem},
author={Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Going_Deeper_on_Molecular_F
year={2021},
}
Working With Splitters
When using machine learning, you typically divide your data into training, validation, and test sets. The MoleculeNet
loaders do this automatically. But how should you divide up the data? This question seems simple at first, but it turns
out to be quite complicated. There are many ways of splitting up data, and which one you choose can have a big impact
on the reliability of your results. This tutorial introduces some of the splitting methods provided by DeepChem.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
'2.7.1'
Splitters
In DeepChem, a method of splitting samples into multiple datasets is defined by a Splitter object. Choosing an
appropriate method for your data is very important. Otherwise, your trained model may seem to work much better than
it really does.
Consider a typical drug development pipeline. You might begin by screening many thousands of molecules to see if they
bind to your target of interest. Once you find one that seems to work, you try to optimize it by testing thousands of
minor variations on it, looking for one that binds more strongly. Then perhaps you test it in animals and find it has
unacceptable toxicity, so you try more variations to fix the problems.
This has an important consequence for chemical datasets: they often include lots of molecules that are very similar to
each other. If you split the data into training and test sets in a naive way, the training set will include many molecules
that are very similar to the ones in the test set, even if they are not exactly identical. As a result, the model may do very
well on the test set, but then fail badly when you try to use it on other data that is less similar to the training data.
• General Splitters
○ RandomSplitter
○ RandomGroupSplitter
○ RandomStratifiedSplitter
○ SingletaskStratifiedSplitter
○ IndexSplitter
○ SpecifiedSplitter
○ TaskSplitter
• Molecular Splitters
○ ScaffoldSplitter
○ MolecularWeightSplitter
○ MaxMinSplitter
○ ButinaSplitter
○ FingerprintSplitter
RandomSplitter
This is one of the simplest splitters. It just selects samples for the training, validation, and test sets in a completely
random way.
Didn't we just say that's a bad idea? Well, it depends on your data. If every sample is truly independent of every other,
then this is just as good a way as any to split the data. There is no universally best choice of splitter. It all depends on
your particular dataset, and for some datasets this is a fine choice.
RandomStratifiedSplitter
Some datasets are very unbalanced: only a tiny fraction of all samples are positive. In that case, random splitting may
sometimes lead to the validation or test set having few or even no positive samples for some tasks. That makes it
unable to evaluate performance.
RandomStratifiedSplitter addresses this by dividing up the positive and negative samples evenly. If you ask for a
80/10/10 split, the validation and test sets will contain not just 10% of samples, but also 10% of the positive samples for
each task.
ScaffoldSplitter
This splitter tries to address the problem discussed above where many molecules are very similar to each other. It
identifies the scaffold that forms the core of each molecule, and ensures that all molecules with the same scaffold are
put into the same dataset. This is still not a perfect solution, since two molecules may have different scaffolds but be
very similar in other ways, but it usually is a large improvement over random splitting.
ButinaSplitter
This is another splitter that tries to address the problem of similar molecules. It clusters them based on their molecular
fingerprints, so that ones with similar fingerprints will tend to be in the same dataset. The time required by this splitting
algorithm scales as the square of the number of molecules, so it is mainly useful for small to medium sized datasets.
SpecifiedSplitter
This splitter leaves everything up to the user. You tell it exactly which samples to put in each dataset. This is useful
when you know in advance that a particular splitting is appropriate for your data.
An example is temporal splitting. Consider a research project where you are continually generating and testing new
molecules. As you gain more data, you periodically retrain your model on the steadily growing dataset, then use it to
predict results for other not yet tested molecules. A good way of validating whether this works is to pick a particular
cutoff date, train the model on all data you had at that time, and see how well it predicts other data that was generated
later.
TaskSplitter
Provides a simple interface for splitting datasets task-wise.
For some learning problems, the training and test datasets should have different tasks entirely. This is a different
paradigm from the usual Splitter, which ensures that split datasets have different data points, not different tasks. This
method improves multi-task learning and problem decomposition situations by enhancing their efficiency and
performance.
SingletaskStratifiedSplitter
Another way of splitting data, particularly for classification tasks with imbalanced class distributions is the single-task
stratified splitter. The single-task stratified splitter maintains the class distribution in the original dataset across training,
validation and test sets. This is crucial when working with imbalanced datasets where some classes may be under-
represented.
FingerprintSplitter
Class for doing data splits based on the Tanimoto similarity(Tanimoto similarity measures overlap between two sets
succinctly) between ECFP4 fingerprints(ECFP4 fingerprints encode unique parts of molecules for efficient comparison).
This class tries to split the data such that the molecules in each dataset are as different as possible from the ones in the
other datasets. This makes it a very stringent test of models. Predicting the test and validation sets may require
extrapolating far outside the training data.It splits molecular datasets using Tanimoto similarity scores calculated from
ECFP4 fingerprints. ECFP4, based on Morgan fingerprints, encodes molecular substructures.
MolecularWeightSplitter
Another splitter performs data splits based on molecular weight
splitter: random
training set score: {'roc_auc_score': 0.9554904185889012}
test set score: {'roc_auc_score': 0.7854105497196335}
splitter: scaffold
training set score: {'roc_auc_score': 0.958752269558084}
test set score: {'roc_auc_score': 0.6849149319233084}
splitter: butina
training set score: {'roc_auc_score': 0.9584914471889929}
test set score: {'roc_auc_score': 0.6061155305251504}
splitter: fingerprint
training set score: {'roc_auc_score': 0.954193849465875}
test set score: {'roc_auc_score': 0.6235667313881933}
All of them produce very similar performance on the training set, but the random splitter has much higher performance
on the test set. Scaffold splitting has a lower test set score, and Butina splitting is even lower. Does that mean random
splitting is better? No! It means random splitting doesn't give you an accurate measure of how well your model works.
Because the test set contains lots of molecules that are very similar to ones in the training set, it isn't truly independent.
It makes the model appear to work better than it really does. Scaffold splitting and Butina splitting give a better
indication of what you can expect on independent data in the future.
@manual{Intro8,
title={Working With Splitters},
organization={DeepChem},
author={Eastman, Peter, Mohapatra, Bibhusundar and Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Working_With_Splitters.ipyn
year={2021},
}
Advanced Model Training
In the tutorials so far we have followed a simple procedure for training models: load a dataset, create a model, call
fit() , evaluate it, and call ourselves done. That's fine for an example, but in real machine learning projects the
process is usually more complicated. In this tutorial we will look at a more realistic workflow for training a model.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following installation commands. You can of course run this
tutorial locally if you prefer. In that case, don't run these cells since they will download and install DeepChem in your
local machine again.
Hyperparameter Optimization
Let's start by loading the HIV dataset. It classifies over 40,000 molecules based on whether they inhibit HIV replication.
import deepchem as dc
Now let's train a model on it. We will use a MultitaskClassifier , which is just a stack of dense layers. But that still
leaves a lot of options. How many layers should there be, and how wide should each one be? What dropout rate should
we use? What learning rate?
These are called hyperparameters. The standard way to select them is to try lots of values, train each model on the
training set, and evaluate it on the validation set. This lets us see which ones work best.
You could do that by hand, but usually it's easier to let the computer do it for you. DeepChem provides a selection of
hyperparameter optimization algorithms, which are found in the dc.hyper package. For this example we'll use
GridHyperparamOpt , which is the most basic method. We just give it a list of options for each hyperparameter and it
exhaustively tries all combinations of them.
The lists of options are defined by a dict that we provide. For each of the model's arguments, we provide a list of
values to try. In this example we consider three possible sets of hidden layers: a single layer of width 500, a single layer
of width 1000, or two layers each of width 1000. We also consider two dropout rates (20% and 50%) and two learning
rates (0.001 and 0.0001).
params_dict = {
'n_tasks': [len(tasks)],
'n_features': [1024],
'layer_sizes': [[500], [1000], [1000, 1000]],
'dropouts': [0.2, 0.5],
'learning_rate': [0.001, 0.0001]
}
optimizer = dc.hyper.GridHyperparamOpt(dc.models.MultitaskClassifier)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
best_model, best_hyperparams, all_results = optimizer.hyperparam_search(
params_dict, train_dataset, valid_dataset, metric, transformers)
hyperparam_search() returns three arguments: the best model it found, the hyperparameters for that model, and a
full listing of the validation score for every model. Let's take a look at the last one.
all_results
{'_dropouts_0.200000_layer_sizes[500]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.759624393738977,
'_dropouts_0.200000_layer_sizes[500]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7680791323731138,
'_dropouts_0.500000_layer_sizes[500]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7623870149911817,
'_dropouts_0.500000_layer_sizes[500]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7552282358416618,
'_dropouts_0.200000_layer_sizes[1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7689915858318636,
'_dropouts_0.200000_layer_sizes[1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7619292572996277,
'_dropouts_0.500000_layer_sizes[1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7641491524593376,
'_dropouts_0.500000_layer_sizes[1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7609877155594749,
'_dropouts_0.200000_layer_sizes[1000, 1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7707169802077
21,
'_dropouts_0.200000_layer_sizes[1000, 1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7750327625906
329,
'_dropouts_0.500000_layer_sizes[1000, 1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7259723140799
53,
'_dropouts_0.500000_layer_sizes[1000, 1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7546280986674
505}
We can see a few general patterns. Using two layers with the larger learning rate doesn't work very well. It seems the
deeper model requires a smaller learning rate. We also see that 20% dropout usually works better than 50%. Once we
narrow down the list of models based on these observations, all the validation scores are very close to each other,
probably close enough that the remaining variation is mainly noise. It doesn't seem to make much difference which of
the remaining hyperparameter sets we use, so let's arbitrarily pick a single layer of width 1000 and learning rate of
0.0001.
Early Stopping
There is one other important hyperparameter we haven't considered yet: how long we train the model for.
GridHyperparamOpt trains each for a fixed, fairly small number of epochs. That isn't necessarily the best number.
You might expect that the longer you train, the better your model will get, but that isn't usually true. If you train too
long, the model will usually start overfitting to irrelevant details of the training set. You can tell when this happens
because the validation set score stops increasing and may even decrease, while the score on the training set continues
to improve.
Fortunately, we don't need to train lots of different models for different numbers of steps to identify the optimal number.
We just train it once, monitor the validation score, and keep whichever parameters maximize it. This is called "early
stopping". DeepChem's ValidationCallback class can do this for us automatically. In the example below, we have it
compute the validation set's ROC AUC every 1000 training steps. If you add the save_dir argument, it will also save a
copy of the best model parameters to disk.
model = dc.models.MultitaskClassifier(n_tasks=len(tasks),
n_features=1024,
layer_sizes=[1000],
dropouts=0.2,
learning_rate=0.0001)
callback = dc.models.ValidationCallback(valid_dataset, 1000, metric)
model.fit(train_dataset, nb_epoch=50, callbacks=callback)
@manual{Intro9,
title={Advanced Model Training},
organization={DeepChem},
author={Eastman, Peter and Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Advanced_Model_Training.ipy
year={2021},
}
Creating a High Fidelity Dataset from Experimental Data
In this tutorial, we will look at what is involved in creating a new Dataset from experimental data. As we will see, the
mechanics of creating the Dataset object is only a small part of the process. Most real datasets need significant cleanup
and QA before they are suitable for training models.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
How do you transform this data into a dataset capable of creating a useful model?
Building models from novel data can present several challenges. Perhaps the data was not recorded in a convenient
manner. Additionally, perhaps the data contains noise. This is a common occurrence with, for example, biological assays
due to the large number of external variables and the difficulty and cost associated with collecting multiple samples.
This is a problem because you do not want your model to fit to this noise.
Parsing data
De-noising data
In this tutorial, we will walk through an example of curating a dataset from an excel spreadsheet of experimental drug
measurements. Before we dive into this example though, let's do a brief review of DeepChem's input file handling and
featurization capabilities.
Input Formats
DeepChem supports a whole range of input files. For example, accepted input formats include .csv, .sdf, .fasta, .png, .tif
and other file formats. The loading for a particular file format is governed by the Loader class associated with that
format. For example, to load a .csv file we use the CSVLoader class. Here's an example of a .csv file that fits the
requirements of CSVLoader .
Here the "smiles" column contains the SMILES string, the "measured log solubility in mols per litre" contains the
experimental measurement, and "Compound ID" contains the unique compound identifier.
Data Featurization
Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets
routinely come in the form of lists of molecules and associated experimental readouts. To load the data, we use a
subclass of dc.data.DataLoader such as dc.data.CSVLoader or dc.data.SDFLoader . Users can subclass
dc.data.DataLoader to load arbitrary file formats. All loaders must be passed a dc.feat.Featurizer object, which
specifies how to transform molecules into vectors. DeepChem provides a number of different subclasses of
dc.feat.Featurizer .
Parsing data
In order to read in the data, we will use the pandas data analysis library.
In order to convert the drug names into smiles strings, we will use pubchempy. This isn't a standard DeepChem
dependency, but you can install this library with conda install pubchempy .
import os
import pandas as pd
from pubchempy import get_cids, get_compounds
Pandas is magic but it doesn't automatically know where to find your data of interest. You likely will have to look at it
first using a GUI.
import os
from IPython.display import Image, display
current_dir = os.path.dirname(os.path.realpath('__file__'))
data_screenshot = os.path.join(current_dir, 'assets/dataset_preparation_gui.png')
display(Image(filename=data_screenshot))
We see the data of interest is on the second sheet, and contained in columns "TA ID", "N #1 (%)", and "N #2 (%)".
Additionally, it appears much of this spreadsheet was formatted for human readability (multicolumn headers, column
labels with spaces and symbols, etc.). This makes the creation of a neat dataframe object harder. For this reason we will
cut everything that is unnecesary or inconvenient.
import deepchem as dc
dc.utils.download_url(
'https://github.com/deepchem/deepchem/raw/master/datasets/Positive%20Modulators%20Summary_%20918.TUC%20_%20v1.xls
current_dir,
'Positive Modulators Summary_ 918.TUC _ v1.xlsx'
)
Threshold (%) =
1 TA ## Position TA ID Mean SD N #1 (%) N #2 (%)
Mean + 4xSD
Penicillin V
2 1 1-A02 -12.8689 6.74705 14.1193 -10.404 -18.1929
Potassium
Mycophenolate
3 2 1-A03 -12.8689 6.74705 14.1193 -12.4453 -11.7175
Mofetil
Note that the actual row headers are stored in row 1 and not 0 above.
# reset the index so we keep the label but number from 0 again
raw_data.reset_index(inplace=True)
## rename columns
raw_data.columns = ['label', 'drug', 'n1', 'n2']
label drug n1 n2
Now, let's take the drug names and get smiles strings for them (format needed for DeepChem).
drugs = raw_data['drug'].values
For many of these, we can retreive the smiles string via the canonical_smiles attribute of the get_compounds object
(using pubchempy )
get_compounds(drugs[1], 'name')
[Compound(5281078)]
get_compounds(drugs[1], 'name')[0].canonical_smiles
'CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3CCOCC3)O'
However, some of these drug names have variables spaces and symbols (·, (±), etc.), and names that may not be
readable by pubchempy.
For this task, we will do a bit of hacking via regular expressions. Also, we notice that all ions are written in a shortened
form that will need to be expanded. For this reason we use a dictionary, mapping the shortened ion names to versions
recognizable to pubchempy.
Unfortunately you may have several corner cases that will require more hacking.
import re
ion_replacements = {
'HBr': ' hydrobromide',
'2Br': ' dibromide',
'Br': ' bromide',
'HCl': ' hydrochloride',
'2H2O': ' dihydrate',
'H20': ' hydrate',
'Na': ' sodium'
}
def compound_to_smiles(cmpd):
# remove spaces and irregular characters
compound = re.sub(r'([^\s\w]|_)+', '', cmpd)
return smiles
Now let's actually convert all these compounds to smiles. This conversion will take a few minutes so might not be a bad
spot to go grab a coffee or tea and take a break while this is running! Note that this conversion will sometimes fail so
we've added some error handling to catch these cases below.
smiles_map = {}
for i, compound in enumerate(drugs):
try:
smiles_map[compound] = compound_to_smiles(compound)
except:
print("Errored on %s" % i)
continue
Errored on 162
Errored on 303
smiles_data = raw_data
# map drug name to smiles string
smiles_data['drug'] = smiles_data['drug'].apply(lambda x: smiles_map[x] if x in smiles_map else None)
label drug n1 n2
Hooray, we have mapped each drug name to its corresponding smiles code.
Now, we need to look at the data and remove as much noise as possible.
De-noising data
In machine learning, we know that there is no free lunch. You will need to spend time analyzing and understanding your
data in order to frame your problem and determine the appropriate model framework. Treatment of your data will
depend on the conclusions you gather from this process.
I would like to build a model capable of predicting the affinity of an arbitrary small molecule drug to a particular ion
channel protein
For an input drug, data describing channel inhibition
A few hundred drugs, with n=2
Will need to look more closely at the dataset*
Nothing on this particular protein
*This will involve plotting, so we will import matplotlib and seaborn. We will also need to look at molecular structures, so
we will import rdkit. We will also use the seaborn library which you can install with conda install seaborn .
Our goal is to build a small molecule model, so let's make sure our molecules are all small. This can be approximated by
the length of each smiles string.
Some of these look rather large, len(smiles) > 150. Let's see what they look like.
# look
Draw._MolsToGridImage([Chem.MolFromSmiles(i) for i in long_smiles], molsPerRow=6)
As suspected, these are not small molecules, so we will remove them from the dataset. The argument here is that these
molecules could register as inhibitors simply because they are large. They are more likely to sterically block the
channel, rather than diffuse inside and bind (which is what we are interested in).
The lesson here is to remove data that does not fit your use case.
n1 n2
62 NaN -7.8266
df = smiles_data.dropna(axis=0, how='any')
# seaborn jointplot will allow us to compare n1 and n2, and plot each marginal
sns.jointplot(x='n1', y='n2', data=smiles_data)
<seaborn.axisgrid.JointGrid at 0x14c4e37d0>
We see that most of the data is contained in the gaussian-ish blob centered a bit below zero. We see that there are a
few clearly active datapoints located in the bottom left, and one on the top right. These are all distinguished from the
majority of the data. How do we handle the data in the blob?
Because n1 and n2 represent the same measurement, ideally they would be of the same value. This plot should be
tightly aligned to the diagonal, and the pearson correlation coefficient should be 1. We see this is not the case. This
helps gives us an idea of the error of our assay.
Let's look at the error more closely, plotting in the distribution of (n1-n2).
sns.histplot(diff_df)
plt.xlabel('difference in n')
plt.ylabel('probability')
17.75387954711914
Now, I don't trust the data outside of the confidence interval, and will therefore drop these datapoints from df.
For example, in the plot above, at least one datapoint has n1-n2 > 60. This is disconcerting.
<seaborn.axisgrid.JointGrid at 0x15a363c10>
So, let's average n1 and n2, and take the error bar to be ci_95.
In my case, this required domain knowledge. Having worked in this area, and having consulted with professors
specializing on this channel, I am interested in compounds where the absolute value of the activity is greater than 25.
This relates to the desired drug potency we would like to model.
If you are not certain how to draw the line between active and inactive, this cutoff could potentially be treated as a
hyperparameter.
# summary
print (raw_data.shape, avg_df.shape, len(actives.index))
(430, 5) (392, 3) 6
In summary, we have:
Removed data that did not address the question we hope to answer (small molecules only)
Dropped NaNs
Determined the noise of our measurements
Removed exceptionally noisy datapoints
Identified actives (using domain knowledge to determine a threshold)
Given that we have 392 datapoints and 6 actives, this data will be used to build a low data one-shot classifier
(10.1021/acscentsci.6b00367). If there were datasets of similar character, transfer learning could potentially be used,
but this is not the case at the moment.
Let's apply logic to our dataframe in order to cast it into a binary format, suitable for classification.
avg_df.to_csv('modulators.csv', index=False)
Lastly, it is often advantageous to numerically transform the data in some way. For example, sometimes it is useful to
normalize the data, or to zero the mean. This depends in the task at hand.
Built into DeepChem are many useful transformers, located in the deepchem.transformers.transformers base class.
Because this is a classification model, and the number of actives is low, I will apply a balancing transformer. I treated
this transformer as a hyperparameter when I began training models. It proved to unambiguously improve model
performance.
transformer = dc.trans.BalancingTransformer(dataset=dataset)
dataset = transformer.transform(dataset)
Now let's save the balanced dataset object to disk, and then reload it as a sanity check.
dc.utils.save_to_disk(dataset, 'balanced_dataset.joblib')
balanced_dataset = dc.utils.load_from_disk('balanced_dataset.joblib')
Bibliography
[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line notation and computerized interpreter for
chemical structures." US Environmental Protection Agency, Environmental Research Laboratory, 1987.
@manual{Intro10,
title={Creating a high fidelity model from experimental data},
organization={DeepChem},
author={Eastman, Peter and Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/tree/master/examples/tutorials}},
year={2021},
}
Putting Multitask Learning to Work
This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate how multitask
methods can provide improved performance in situations with little or very unbalanced data.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
The MUV dataset is a challenging benchmark in molecular design that consists of 17 different "targets" where there are
only a few "active" compounds per target. There are 93,087 compounds in total, yet no task has more than 30 active
compounds, and many have even less. Training a model with such a small number of positive examples is very
challenging. Multitask models address this by training a single model that predicts all the different targets at once. If a
feature is useful for predicting one task, it often is useful for predicting several other tasks as well. Each added task
makes it easier to learn important features, which improves performance on other tasks [2].
To get started, let's load the MUV dataset. The MoleculeNet loader function automatically splits it into training,
validation, and test sets. Because there are so few positive examples, we use stratified splitting to ensure the test set
has enough of them to evaluate.
import deepchem as dc
import numpy as np
Now let's train a model on it. We'll use a MultitaskClassifier, which is a simple stack of fully connected layers.
n_tasks = len(tasks)
n_features = train_dataset.get_data_shape()[0]
model = dc.models.MultitaskClassifier(n_tasks, n_features)
model.fit(train_dataset)
0.0004961589723825455
Let's see how well it does on the test set. We loop over the 17 tasks and compute the ROC AUC for each one.
y_true = test_dataset.y
y_pred = model.predict(test_dataset)
metric = dc.metrics.roc_auc_score
for i in range(n_tasks):
score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
print(tasks[i], score)
MUV-466 0.9207684040838259
MUV-548 0.7480655561526062
MUV-600 0.9927995701235895
MUV-644 0.9974207415368082
MUV-652 0.7823481998925309
MUV-689 0.6636843990686011
MUV-692 0.6319093677234462
MUV-712 0.7787838079885365
MUV-713 0.7910711087229088
MUV-733 0.4401307540748701
MUV-737 0.34679383843811573
MUV-810 0.9564571019165323
MUV-832 0.9991044241447251
MUV-846 0.7519881783987103
MUV-852 0.8516747268493642
MUV-858 0.5906591438294824
MUV-859 0.5962954008166774
Not bad! Recall that random guessing would produce a ROC AUC score of 0.5, and a perfect predictor would score 1.0.
Most of the tasks did much better than random guessing, and many of them are above 0.9.
Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue
working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the
DeepChem community in the following ways:
Bibliography
[1] https://pubs.acs.org/doi/10.1021/ci8002649
[2] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00146
Tutorial Part 13: Modeling Protein-Ligand Interactions
By Nathan C. Frey | Twitter and Bharath Ramsundar | Twitter
In this tutorial, we'll walk you through the use of machine learning and molecular docking methods to predict the
binding energy of a protein-ligand complex. Recall that a ligand is some small molecule which interacts (usually non-
covalently) with a protein. Molecular docking performs geometric calculations to find a “binding pose” with a small
molecule interacting with a protein in a suitable binding pocket (that is, a region on the protein which has a groove in
which the small molecule can rest).
The structure of proteins can be determined experimentally with techniques like Cryo-EM or X-ray crystallography. This
can be a powerful tool for structure-based drug discovery. For more info on docking, read the AutoDock Vina paper and
the deepchem.dock documentation. There are many graphical user and command line interfaces (like AutoDock) for
performing molecular docking. Here, we show how docking can be performed programmatically with DeepChem, which
enables automation and easy integration with machine learning pipelines.
To start the tutorial, we'll use a simple pre-processed dataset file that comes in the form of a gzipped file. Each row is a
molecular system, and each column represents a different piece of information about that system. For instance, in this
example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex
identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python
list of all lines in a PDB file for the protein alone; and a Python list of all lines in a ligand file for the ligand alone.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the syst
em package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
✨ ✨ Everything looks OK!
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the syst
em package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
import os
import numpy as np
import pandas as pd
import tempfile
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometri
c'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from
'deepchem.models.torch_models' (/usr/local/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py
)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightn
ing'
Skipped loading some Jax models, missing a dependency. No module named 'haiku'
To illustrate the docking procedure, here we'll use a csv that contains SMILES strings of ligands as well as PDB files for
the ligand and protein targets from PDBbind. Later, we'll use the labels to train a model to predict binding affinities.
We'll also show how to download and featurize PDBbind to train a model from scratch.
data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")
if not os.path.exists(dataset_file):
print('File does not exist. Downloading file...')
download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
print('File downloaded...')
raw_dataset = load_from_disk(dataset_file)
raw_dataset = raw_dataset[['pdb_id', 'smiles', 'label']]
raw_dataset.head(2)
%%time
fixer = PDBFixer(pdbid=pdbid)
PDBFile.writeFile(fixer.topology, fixer.positions, open('%s.pdb' % (pdbid), 'w'))
p, m = None, None
# fix protein, optimize ligand geometry, and sanitize molecules
try:
p, m = prepare_inputs('%s.pdb' % (pdbid), ligand)
except:
print('%s failed PDB fixing' % (pdbid))
<timed exec>:7: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
3cyx 1510
CPU times: user 2.04 s, sys: 157 ms, total: 2.2 s
Wall time: 4.32 s
Visualization
If you're outside of Colab, you can expand these cells and use MDTraj and nglview to visualize proteins and ligands.
import mdtraj as md
import nglview
Let's take a look at the first protein ligand pair in our dataset:
protein_mdtraj = md.load_pdb('3cyx.pdb')
ligand_mdtraj = md.load_pdb('ligand_3cyx.pdb')
We'll use the convenience function nglview.show_mdtraj in order to view our proteins and ligands. Note that this will
only work if you uncommented the above cell, installed nglview, and enabled the necessary notebook extensions.
v = nglview.show_mdtraj(ligand_mdtraj)
NGLWidget()
Now that we have an idea of what the ligand looks like, let's take a look at our protein:
view = nglview.show_mdtraj(protein_mdtraj)
display(view) # interactive view outside Colab
NGLWidget()
Molecular Docking
Ok, now that we've got our data and basic visualization tools up and running, let's see if we can use molecular docking
to estimate the binding affinities between our protein ligand systems.
There are three steps to setting up a docking job, and you should experiment with different settings. The three things
we need to specify are 1) how to identify binding pockets in the target protein; 2) how to generate poses (geometric
configurations) of a ligand in a binding pocket; and 3) how to "score" a pose. Remember, our goal is to identify
candidate ligands that strongly interact with a target protein, which is reflected by the score.
DeepChem has a simple built-in method for identifying binding pockets in proteins. It is based on the convex hull
method. The method works by creating a 3D polyhedron (convex hull) around a protein structure and identifying the
surface atoms of the protein as the ones closest to the convex hull. Some biochemical properties are considered, so the
method is not purely geometrical. It has the advantage of having a low computational cost and is good enough for our
purposes.
finder = dc.dock.binding_pocket.ConvexHullPocketFinder()
pockets = finder.find_pockets('3cyx.pdb')
len(pockets) # number of identified pockets
36
Pose generation is quite complex. Luckily, using DeepChem's pose generator will install the AutoDock Vina engine under
the hood, allowing us to get up and running generating poses quickly.
vpg = dc.dock.pose_generation.VinaPoseGenerator()
We could specify a pose scoring function from deepchem.dock.pose_scoring , which includes things like repulsive and
hydrophobic interactions and hydrogen bonding. Vina will take care of this, so instead we'll allow Vina to compute scores
for poses.
!mkdir -p vina_test
%%time
complexes, scores = vpg.generate_poses(molecular_complex=('3cyx.pdb', 'ligand_3cyx.pdb'), # protein-ligand files for
out_dir='vina_test',
generate_scores=True
)
CPU times: user 41min 4s, sys: 21.9 s, total: 41min 26s
Wall time: 28min 32s
/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for
the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is
safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If yo
u wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#dep
recations
self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
We used the default value for num_modes when generating poses, so Vina will return the 9 lowest energy poses it found
in units of kcal/mol .
scores
Can we view the complex with both protein and ligand? Yes, but we'll need to combine the molecules into a single RDkit
molecule.
Let's now visualize our complex. We can see that the ligand slots into a pocket of the protein.
v = nglview.show_rdkit(complex_mol)
display(v)
NGLWidget()
Now that we understand each piece of the process, we can put it all together using DeepChem's Docker class. Docker
creates a generator that yields tuples of posed complexes and docking scores.
docker = dc.dock.docking.Docker(pose_generator=vpg)
posed_complex, score = next(docker.dock(molecular_complex=('3cyx.pdb', 'ligand_3cyx.pdb'),
use_pose_generator_scores=True))
Next, we'll need a way to transform our protein-ligand complexes into representations which can be used by learning
algorithms. Ideally, we'd have neural protein-ligand complex fingerprints, but DeepChem doesn't yet have a good
learned fingerprint of this sort. We do however have well-tuned manual featurizers that can help us with our challenge
here.
We'll make use of two types of fingerprints in the rest of the tutorial, the CircularFingerprint and
ContactCircularFingerprint . DeepChem also has voxelizers and grid descriptors that convert a 3D volume
containing an arragment of atoms into a fingerprint. These featurizers are really useful for understanding protein-ligand
complexes since they allow us to translate complexes into vectors that can be passed into a simple machine learning
algorithm. First, we'll create circular fingerprints. These convert small molecules into a vector of fragments.
pdbids = raw_dataset['pdb_id'].values
ligand_smiles = raw_dataset['smiles'].values
%%time
for (pdbid, ligand) in zip(pdbids, ligand_smiles):
fixer = PDBFixer(url='https://files.rcsb.org/download/%s.pdb' % (pdbid))
PDBFile.writeFile(fixer.topology, fixer.positions, open('%s.pdb' % (pdbid), 'w'))
p, m = None, None
# skip pdb fixing for speed
try:
p, m = prepare_inputs('%s.pdb' % (pdbid), ligand, replace_nonstandard_residues=False,
remove_heterogens=False, remove_water=False,
add_hydrogens=False)
except:
print('%s failed sanitization' % (pdbid))
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:11:45] UFFTYPER: Unrecognized atom type: S_5+4 (7)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
3cyx failed sanitization
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:02] UFFTYPER: Warning: hybridization set to SP3 for atom 17
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:04] UFFTYPER: Warning: hybridization set to SP3 for atom 6
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:06] UFFTYPER: Warning: hybridization set to SP3 for atom 1
[15:12:06] UFFTYPER: Unrecognized atom type: S_5+4 (21)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:23] UFFTYPER: Warning: hybridization set to SP3 for atom 20
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:31] UFFTYPER: Warning: hybridization set to SP3 for atom 19
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:35] UFFTYPER: Warning: hybridization set to SP3 for atom 29
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:13:03] UFFTYPER: Unrecognized atom type: S_5+4 (39)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:13:37] UFFTYPER: Warning: hybridization set to SP3 for atom 33
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:01] UFFTYPER: Unrecognized atom type: S_5+4 (11)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:02] UFFTYPER: Unrecognized atom type: S_5+4 (47)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:14] UFFTYPER: Unrecognized atom type: S_5+4 (1)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:27] UFFTYPER: Warning: hybridization set to SP3 for atom 6
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:33] UFFTYPER: Unrecognized atom type: S_5+4 (47)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:43] UFFTYPER: Unrecognized atom type: S_5+4 (28)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:55] UFFTYPER: Warning: hybridization set to SP3 for atom 17
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:57] UFFTYPER: Warning: hybridization set to SP3 for atom 6
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:08] Explicit valence for atom # 388 O, 3, is greater than permitted
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:15] UFFTYPER: Warning: hybridization set to SP3 for atom 9
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:19] UFFTYPER: Unrecognized atom type: S_5+4 (6)
3utu failed sanitization
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:29] UFFTYPER: Unrecognized atom type: S_5+4 (1)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:39] UFFTYPER: Unrecognized atom type: S_5+4 (19)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:43] UFFTYPER: Unrecognized atom type: S_5+4 (21)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:57] UFFTYPER: Unrecognized atom type: S_5+4 (9)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:16:01] UFFTYPER: Warning: hybridization set to SP3 for atom 18
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:16:21] UFFTYPER: Warning: hybridization set to SP3 for atom 17
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:16:42] UFFTYPER: Warning: hybridization set to SP3 for atom 10
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:19] UFFTYPER: Unrecognized atom type: S_5+4 (13)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:25] UFFTYPER: Unrecognized atom type: S_5+4 (10)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:27] UFFTYPER: Unrecognized atom type: S_5+4 (6)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:28] UFFTYPER: Warning: hybridization set to SP3 for atom 11
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:46] UFFTYPER: Unrecognized atom type: S_5+4 (8)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:58] UFFTYPER: Unrecognized atom type: S_5+4 (4)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:02] UFFTYPER: Unrecognized atom type: S_5+4 (9)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:15] UFFTYPER: Unrecognized atom type: S_5+4 (1)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:32] UFFTYPER: Unrecognized atom type: S_5+4 (23)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:35] UFFTYPER: Unrecognized atom type: S_5+4 (22)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:42] UFFTYPER: Warning: hybridization set to SP3 for atom 8
[15:18:42] UFFTYPER: Warning: hybridization set to SP3 for atom 24
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:01] UFFTYPER: Warning: hybridization set to SP3 for atom 16
[15:19:01] UFFTYPER: Unrecognized atom type: S_5+4 (20)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:02] UFFTYPER: Unrecognized atom type: S_5+4 (6)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:05] UFFTYPER: Unrecognized atom type: S_5+4 (6)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
1hfs failed sanitization
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:22] UFFTYPER: Warning: hybridization set to SP3 for atom 20
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:41] Explicit valence for atom # 1800 C, 5, is greater than permitted
[15:19:41] UFFTYPER: Unrecognized atom type: S_5+4 (11)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:42] UFFTYPER: Warning: hybridization set to SP3 for atom 11
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:57] UFFTYPER: Warning: hybridization set to SP3 for atom 9
[15:19:57] UFFTYPER: Warning: hybridization set to SP3 for atom 23
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 8
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 12
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 34
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 41
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
CPU times: user 4min 9s, sys: 3.31 s, total: 4min 12s
Wall time: 8min 19s
We'll do some clean up to make sure we have a valid ligand file for every valid protein. The lines here will compare the
PDB IDs between the ligand and protein files and remove any proteins that don't have corresponding ligands.
(190, 190)
fp_featurizer = dc.feat.CircularFingerprint(size=2048)
The convenience loader dc.molnet.load_pdbbind will take care of downloading and featurizing the pdbbind dataset
under the hood for us. This will take quite a bit of time and compute, so the code to do it is commented out. Uncomment
it and grab a cup of coffee if you'd like to featurize all of PDBbind's refined set. Otherwise, you can continue with the
small dataset we constructed above.
To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a
created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to
interoperate with deepchem. To instantiate an SklearnModel , you will need (a) task_types, (b) model_params, another
dict as illustrated below, and (c) a model_instance defining the type of model you would like to fit, in this case a
RandomForestRegressor .
value for the test set indicates that the model isn't producing meaningful outputs. It turns out that predicting binding
affinities is hard. This tutorial isn't meant to show how to create a state-of-the-art model for predicting binding affinities,
but it gives you the tools to generate your own datasets with molecular docking, featurize complexes, and train models.
We're using a very small dataset and an overly simplistic representation, so it's no surprise that the test set
performance is quite bad.
[(6.862549999999994, 7.4),
(6.616400000000008, 6.85),
(4.852004999999995, 3.4),
(6.43060000000001, 6.72),
(8.66322999999999, 11.06)]
list(zip(model.predict(test_dataset), test_dataset.y))[:5]
[(5.960549999999999, 4.21),
(6.051305714285715, 8.7),
(5.799900000000003, 6.39),
(6.433881666666665, 4.94),
(6.7465399999999995, 9.21)]
fp_featurizer = dc.feat.ContactCircularFingerprint(size=2048)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
Ok, it looks like we have lower accuracy than the ligand-only dataset. Nonetheless, it's probably still useful to have a
protein-ligand model since it's likely to learn different features than the the pure ligand-only model.
Further reading
So far we have used DeepChem's docking module with the AutoDock Vina backend to generate docking scores for the
PDBbind dataset. We trained a simple machine learning model to directly predict binding affinities, based on featurizing
the protein-ligand complexes. We might want to try more sophisticated docking protocols, like the deep learning
framework gnina. You can read more about using convolutional neural nets for protein-ligand scoring here. And here is a
review of machine learning-based scoring functions.
This DeepChem tutorial introduces the Atomic Convolutional Neural Network. We'll see the structure of the
AtomicConvModel and write a simple program to run Atomic Convolutions.
ACNN Architecture
ACNN’s directly exploit the local three-dimensional structure of molecules to hierarchically learn more complex chemical
features by optimizing both the model and featurization simultaneously in an end-to-end fashion.
The atom type convolution makes use of a neighbor-listed distance matrix to extract features encoding local chemical
environments from an input representation (Cartesian atomic coordinates) that does not necessarily contain spatial
locality. The following methods are used to build the ACNN architecture:
Distance Matrix
The distance matrix
coordinate matrix
matrix
. The matrix
, where
is the number of unique atomic numbers (atom types) present in the molecular system. The atom type convolution
kernel is a step function that operates on the neighbor distance matrix
, where
) output of the radial pooling layer into the atom type convolution operation. Finally, we feed the tensor row-wise
(per-atom) into a fully-connected network. The same fully connected weights and biases are used for each atom in a
given molecule.
Now that we have seen the structural overview of ACNNs, we'll try to get deeper into the model and see how we can
train it and what we expect as the output.
For the training, we will use the publicly available PDBbind dataset. In this example, every row reflects a protein-ligand
complex and the target is the binding affinity (
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
!/usr/local/bin/conda install -c conda-forge pycosat mdtraj pdbfixer openmm -y -q # needed for AtomicConvs
import deepchem as dc
import os
import numpy as np
import tensorflow as tf
acf = AtomicConvFeaturizer(frag1_num_atoms=f1_num_atoms,
frag2_num_atoms=f2_num_atoms,
complex_num_atoms=f1_num_atoms+f2_num_atoms,
max_num_neighbors=max_num_neighbors,
neighbor_cutoff=4)
load_pdbbind allows us to specify if we want to use the entire protein or only the binding pocket ( pocket=True ) for
featurization. Using only the pocket saves memory and speeds up the featurization. We can also use the "core" dataset
of ~200 high-quality complexes for rapidly testing our model, or the larger "refined" set of nearly 5000 complexes for
more datapoints and more robust training/validation. On Colab, it takes only a minute to featurize the core PDBbind set!
This is pretty incredible, and it means you can quickly experiment with different featurizations and model architectures.
%%time
tasks, datasets, transformers = load_pdbbind(featurizer=acf,
save_dir='.',
data_dir='.',
pocket=True,
reload=False,
set_name='core')
Unfortunately, if you try to use the "refined" dataset, there are some complexes that cannot be featurized. To resolve
this issue, rather than increasing complex_num_atoms , simply omit the lines of the dataset that have an x value of
None
class MyTransformer(dc.trans.Transformer):
def transform_array(x, y, w, ids):
kept_rows = x != None
return x[kept_rows], y[kept_rows], w[kept_rows], ids[kept_rows],
datasets
(<DiskDataset X.shape: (154, 9), y.shape: (154,), w.shape: (154,), ids: ['1mq6' '3pe2' '2wtv' ... '3f3c' '4gqq'
'2x00'], task_names: [0]>,
<DiskDataset X.shape: (19, 9), y.shape: (19,), w.shape: (19,), ids: ['3ivg' '4de1' '4tmn' ... '2vw5' '1w3l' '2
zjw'], task_names: [0]>,
<DiskDataset X.shape: (20, 9), y.shape: (20,), w.shape: (20,), ids: ['1kel' '2w66' '2xnb' ... '2qbp' '3lka' '1
qi0'], task_names: [0]>)
acm = AtomicConvModel(n_tasks=1,
frag1_num_atoms=f1_num_atoms,
frag2_num_atoms=f2_num_atoms,
complex_num_atoms=f1_num_atoms+f2_num_atoms,
max_num_neighbors=max_num_neighbors,
batch_size=12,
layer_sizes=[32, 32, 16],
learning_rate=0.003,
)
%%time
max_epochs = 50
metric = dc.metrics.Metric(dc.metrics.score_function.rms_score)
step_cutoff = len(train)//12
def val_cb(model, step):
if step%step_cutoff!=0:
return
val_losses.append(model.evaluate(val, metrics=[metric])['rms_score']**2) # L2 Loss
losses.append(model.evaluate(train, metrics=[metric])['rms_score']**2) # L2 Loss
CPU times: user 2min 41s, sys: 11.4 s, total: 2min 53s
Wall time: 2min 47s
The loss curves are not exactly smooth, which is unsurprising because we are using 154 training and 19 validation
datapoints. Increasing the dataset size may help with this, but will also require greater computational resources.
f, ax = plt.subplots()
ax.scatter(range(len(losses)), losses, label='train loss')
ax.scatter(range(len(val_losses)), val_losses, label='val loss')
plt.legend(loc='upper right');
score of 0.912 and 0.448 for a random 80/20 split of the PDBbind core train/test sets. Here, we've used an 80/10/10
training/validation/test split and achieved similar performance for the training set (0.943). We can see from the
performance on the training, validation, and test sets (and from the results in the paper) that the ACNN can learn
chemical interactions from small training datasets, but struggles to generalize. Still, it is pretty amazing that we can
train an AtomicConvModel with only a few lines of code and start predicting binding affinities!
From here, you can experiment with different hyperparameters, more challenging splits, and the "refined" set of
PDBbind to see if you can reduce overfitting and come up with a more robust model.
score = dc.metrics.Metric(dc.metrics.score_function.pearson_r2_score)
for tvt, ds in zip(['train', 'val', 'test'], datasets):
print(tvt, acm.evaluate(ds, metrics=[score]))
Further reading
We have explored the ACNN architecture and used the PDBbind dataset to train an ACNN to predict protein-ligand
binding energies. For more information, read the original paper that introduced ACNNs: Gomes, Joseph, et al. "Atomic
convolutional networks for predicting protein-ligand binding affinity." arXiv preprint arXiv:1703.10603 (2017). There are
many other methods and papers on predicting binding affinities. Here are a few interesting ones to check out:
predictions using only ligands or proteins, molecular docking with deep learning, and AtomNet.
A Conditional GAN (CGAN) allows additional inputs to the generator and discriminator that their output is conditioned on.
For example, this might be a class label, and the GAN tries to learn how the data distribution varies between classes.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands.
For this example, we will create a data distribution consisting of a set of ellipses in 2D, each with a random position,
shape, and orientation. Each class corresponds to a different ellipse. Let's randomly generate the ellipses. For each one
we select a random center position, X and Y size, and rotation angle. We then create a transformation matrix that maps
the unit circle to the ellipse.
import deepchem as dc
import numpy as np
import tensorflow as tf
n_classes = 4
class_centers = np.random.uniform(-4, 4, (n_classes, 2))
class_transforms = []
for i in range(n_classes):
xscale = np.random.uniform(0.5, 2)
yscale = np.random.uniform(0.5, 2)
angle = np.random.uniform(0, np.pi)
m = [[xscale*np.cos(angle), -yscale*np.sin(angle)],
[xscale*np.sin(angle), yscale*np.cos(angle)]]
class_transforms.append(m)
class_transforms = np.array(class_transforms)
This function generates random data from the distribution. For each point it chooses a random class, then a random
position in that class' ellipse.
def generate_data(n_points):
classes = np.random.randint(n_classes, size=n_points)
r = np.random.random(n_points)
angle = 2*np.pi*np.random.random(n_points)
points = (r*np.array([np.cos(angle), np.sin(angle)])).T
points = np.einsum('ijk,ik->ij', class_transforms[classes], points)
points += class_centers[classes]
return classes, points
Let's plot a bunch of random points drawn from this distribution to see what it looks like. Points are colored based on
their class label.
%matplotlib inline
import matplotlib.pyplot as plot
classes, points = generate_data(1000)
plot.scatter(x=points[:,0], y=points[:,1], c=classes)
<matplotlib.collections.PathCollection at 0x1584692d0>
Now let's create the model for our CGAN. DeepChem's GAN class makes this very easy. We just subclass it and
implement a few methods. The two most important are:
create_generator() constructs a model implementing the generator. The model takes as input a batch of random
noise plus any condition variables (in our case, the one-hot encoded class of each sample). Its output is a synthetic
sample that is supposed to resemble the training data.
create_discriminator() constructs a model implementing the discriminator. The model takes as input the
samples to evaluate (which might be either real training data or synthetic samples created by the generator) and
the condition variables. Its output is a single number for each sample, which will be interpreted as the probability
that the sample is real training data.
In this case, we use very simple models. They just concatenate the inputs together and pass them through a few dense
layers. Notice that the final layer of the discriminator uses a sigmoid activation. This ensures it produces an output
between 0 and 1 that can be interpreted as a probability.
We also need to implement a few methods that define the shapes of the various inputs. We specify that the random
noise provided to the generator should consist of ten numbers for each sample; that each data sample consists of two
numbers (the X and Y coordinates of a point in 2D); and that the conditional input consists of n_classes numbers for
each sample (the one-hot encoded class index).
class ExampleGAN(dc.models.GAN):
def get_noise_input_shape(self):
return (10,)
def get_data_input_shapes(self):
return [(2,)]
def get_conditional_input_shapes(self):
return [(n_classes,)]
def create_generator(self):
noise_in = Input(shape=(10,))
conditional_in = Input(shape=(n_classes,))
gen_in = Concatenate()([noise_in, conditional_in])
gen_dense1 = Dense(30, activation=tf.nn.relu)(gen_in)
gen_dense2 = Dense(30, activation=tf.nn.relu)(gen_dense1)
generator_points = Dense(2)(gen_dense2)
return tf.keras.Model(inputs=[noise_in, conditional_in], outputs=[generator_points])
def create_discriminator(self):
data_in = Input(shape=(2,))
conditional_in = Input(shape=(n_classes,))
discrim_in = Concatenate()([data_in, conditional_in])
discrim_dense1 = Dense(30, activation=tf.nn.relu)(discrim_in)
discrim_dense2 = Dense(30, activation=tf.nn.relu)(discrim_dense1)
discrim_prob = Dense(1, activation=tf.sigmoid)(discrim_dense2)
return tf.keras.Model(inputs=[data_in, conditional_in], outputs=[discrim_prob])
gan = ExampleGAN(learning_rate=1e-4)
Now to fit the model. We do this by calling fit_gan() . The argument is an iterator that produces batches of training
data. More specifically, it needs to produce dicts that map all data inputs and conditional inputs to the values to use for
them. In our case we can easily create as much random data as we need, so we define a generator that calls the
generate_data() function defined above for each new batch.
def iterbatches(batches):
for i in range(batches):
classes, points = generate_data(gan.batch_size)
classes = dc.metrics.to_one_hot(classes, n_classes)
yield {gan.data_inputs[0]: points, gan.conditional_inputs[0]: classes}
gan.fit_gan(iterbatches(5000))
Ending global_step 999: generator average loss 0.87121, discriminator average loss 1.08472
Ending global_step 1999: generator average loss 0.968357, discriminator average loss 1.17393
Ending global_step 2999: generator average loss 0.710444, discriminator average loss 1.37858
Ending global_step 3999: generator average loss 0.699195, discriminator average loss 1.38131
Ending global_step 4999: generator average loss 0.694203, discriminator average loss 1.3871
TIMING: model fitting took 31.352 s
Have the trained model generate some data, and see how well it matches the training distribution we plotted before.
<matplotlib.collections.PathCollection at 0x160dedf50>
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
To begin, let's import all the libraries we'll need and load the dataset (which comes bundled with Tensorflow).
import deepchem as dc
import tensorflow as tf
from deepchem.models.optimizers import ExponentialDecay
from tensorflow.keras.layers import Conv2D, Conv2DTranspose, Dense, Reshape
import matplotlib.pyplot as plot
import matplotlib.gridspec as gridspec
%matplotlib inline
mnist = tf.keras.datasets.mnist.load_data(path='mnist.npz')
images = mnist[0][0].reshape((-1, 28, 28, 1))/255
dataset = dc.data.NumpyDataset(images)
Let's view some of the images to get an idea of what they look like.
def plot_digits(im):
plot.figure(figsize=(3, 3))
grid = gridspec.GridSpec(4, 4, wspace=0.05, hspace=0.05)
for i, g in enumerate(grid):
ax = plot.subplot(g)
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(im[i,:,:,0], cmap='gray')
plot_digits(images)
Now we can create our GAN. Like in the last tutorial, it consists of two parts:
1. The generator takes random noise as its input and produces output that will hopefully resemble the training data.
2. The discriminator takes a set of samples as input (possibly training data, possibly created by the generator), and
tries to determine which are which.
This time we will use a different style of GAN called a Wasserstein GAN (or WGAN for short). In many cases, they are
found to produce better results than conventional GANs. The main difference between the two is in the discriminator
(often called a "critic" in this context). Instead of outputting the probability of a sample being real training data, it tries
to learn how to measure the distance between the training distribution and generated distribution. That measure can
then be directly used as a loss function for training the generator.
We use a very simple model. The generator uses a dense layer to transform the input noise into a 7x7 image with eight
channels. That is followed by two convolutional layers that upsample it first to 14x14, and finally to 28x28.
The discriminator does roughly the same thing in reverse. Two convolutional layers downsample the image first to
14x14, then to 7x7. A final dense layer produces a single number as output. In the last tutorial we used a sigmoid
activation to produce a number between 0 and 1 that could be interpreted as a probability. Since this is a WGAN, we
instead use a softplus activation. It produces an unbounded positive number that can be interpreted as a distance.
class DigitGAN(dc.models.WGAN):
def get_noise_input_shape(self):
return (10,)
def get_data_input_shapes(self):
return [(28, 28, 1)]
def create_generator(self):
return tf.keras.Sequential([
Dense(7*7*8, activation=tf.nn.relu),
Reshape((7, 7, 8)),
Conv2DTranspose(filters=16, kernel_size=5, strides=2, activation=tf.nn.relu, padding='same'),
Conv2DTranspose(filters=1, kernel_size=5, strides=2, activation=tf.sigmoid, padding='same')
])
def create_discriminator(self):
return tf.keras.Sequential([
Conv2D(filters=32, kernel_size=5, strides=2, activation=tf.nn.leaky_relu, padding='same'),
Conv2D(filters=64, kernel_size=5, strides=2, activation=tf.nn.leaky_relu, padding='same'),
Dense(1, activation=tf.math.softplus)
])
Now to train it. As in the last tutorial, we write a generator to produce data. This time the data is coming from a dataset,
which we loop over 100 times.
One other difference is worth noting. When training a conventional GAN, it is important to keep the generator and
discriminator in balance thoughout training. If either one gets too far ahead, it becomes very difficult for the other one
to learn.
WGANs do not have this problem. In fact, the better the discriminator gets, the cleaner a signal it provides and the
easier it becomes for the generator to learn. We therefore specify generator_steps=0.2 so that it will only take one
step of training the generator for every five steps of training the discriminator. This tends to produce faster training and
better results.
def iterbatches(epochs):
for i in range(epochs):
for batch in dataset.iterbatches(batch_size=gan.batch_size):
yield {gan.data_inputs[0]: batch[0]}
Ending global_step 4999: generator average loss 0.340072, discriminator average loss -0.0234236
Ending global_step 9999: generator average loss 0.52308, discriminator average loss -0.00702729
Ending global_step 14999: generator average loss 0.572661, discriminator average loss -0.00635684
Ending global_step 19999: generator average loss 0.560454, discriminator average loss -0.00534357
Ending global_step 24999: generator average loss 0.556055, discriminator average loss -0.00620613
Ending global_step 29999: generator average loss 0.541958, discriminator average loss -0.00734233
Ending global_step 34999: generator average loss 0.540904, discriminator average loss -0.00736641
Ending global_step 39999: generator average loss 0.524298, discriminator average loss -0.00650514
Ending global_step 44999: generator average loss 0.503931, discriminator average loss -0.00563732
Ending global_step 49999: generator average loss 0.528964, discriminator average loss -0.00590612
Ending global_step 54999: generator average loss 0.510892, discriminator average loss -0.00562366
Ending global_step 59999: generator average loss 0.494756, discriminator average loss -0.00533636
TIMING: model fitting took 4197.860 s
Let's generate some data and see how the results look.
plot_digits(gan.predict_gan_generator(batch_size=16))
Not too bad. Many of the generated images look plausibly like handwritten digits. A larger model trained for a longer
time can do much better, of course.
Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue
working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the
DeepChem community in the following ways:
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Setup
To run DeepChem and Hyperopt within Colab, you'll need to run the following installation commands. You can of course
run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install DeepChem
and Hyperopt in your local machine again.
Collecting deepchem
Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_hiv(featurizer='ECFP', split='scaffold')
train_dataset, valid_dataset, test_dataset = datasets
Now, lets import the hyperopt library, which we will be using to fund the best parameters
Then we have to declare a dictionary with all the hyperparameters and their range that you will be tuning them in. This
dictionary will serve as the search space for the hyperopt. Some basic ways of declaring the ranges in the dictionary
are:
Here, we are going to use a multitaskclassifier to classify the HIV dataset and hence the appropriate search space is as
follows.
search_space = {
'layer_sizes': hp.choice('layer_sizes',[[500], [1000], [2000],[1000,1000]]),
'dropouts': hp.uniform('dropout',low=0.2, high=0.5),
'learning_rate': hp.uniform('learning_rate',high=0.001, low=0.0001)
}
We should then declare a function to be minimized by the hyperopt. So, here we should use the function to minimize our
multitaskclassifier model. Additionally, we are using a validation callback to validate the classifier for every 1000 steps,
then we are passing the best score as the return. The metric used here is 'roc_auc_score', which needs to be maximized.
To maximize a non-negative value is equivalent to minimize its opposite number, hence we are returning the negative
of the validation score.
import tempfile
#tempfile is used to save the best checkpoint later in the program.
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
def fm(args):
save_dir = tempfile.mkdtemp()
model = dc.models.MultitaskClassifier(n_tasks=len(tasks),n_features=1024,layer_sizes=args['layer_sizes'],dropouts
#validation callback that saves the best checkpoint, i.e the one with the maximum score.
validation=dc.models.ValidationCallback(valid_dataset, 1000, [metric],save_dir=save_dir,transformers=transformers
model.fit(train_dataset, nb_epoch=25,callbacks=validation)
#restoring the best checkpoint and passing the negative of its validation score to be minimized.
model.restore(model_dir=save_dir)
valid_score = model.evaluate(valid_dataset, [metric], transformers)
return -1*valid_score['roc_auc_score']
Here, we are calling the fmin function of the hyperopt, where we pass on the function to be minimized, the algorithm to
be followed, max number of evals and a trials object. The Trials object is used to keep All hyperparameters, loss, and
other information, this means you can access them after running optimization. Also, trials can help you to save
important information and later load and then resume the optimization process.
Moreover, for the algorithm there are three choice which can be used without any additional configuration. they are :-
trials=Trials()
best = fmin(fm,
space= search_space,
algo=tpe.suggest,
max_evals=15,
trials = trials)
0%| | 0/15 [00:00<?, ?it/s, best loss: ?]Step 1000 validation: roc_auc_score=0.777648
Step 2000 validation: roc_auc_score=0.755485
Step 3000 validation: roc_auc_score=0.739519
Step 4000 validation: roc_auc_score=0.764756
Step 5000 validation: roc_auc_score=0.757006
Step 6000 validation: roc_auc_score=0.752609
Step 7000 validation: roc_auc_score=0.763002
Step 8000 validation: roc_auc_score=0.749202
7%|▋ | 1/15 [05:37<1:18:46, 337.58s/it, best loss: -0.7776476459925534]Step 1000 validation: roc_auc_s
core=0.750455
Step 2000 validation: roc_auc_score=0.783594
Step 3000 validation: roc_auc_score=0.775872
Step 4000 validation: roc_auc_score=0.768825
Step 5000 validation: roc_auc_score=0.769555
Step 6000 validation: roc_auc_score=0.765324
Step 7000 validation: roc_auc_score=0.771146
Step 8000 validation: roc_auc_score=0.760138
13%|█▎ | 2/15 [07:05<41:16, 190.51s/it, best loss: -0.7835939030962179] Step 1000 validation: roc_auc_s
core=0.744178
Step 2000 validation: roc_auc_score=0.765406
Step 3000 validation: roc_auc_score=0.76532
Step 4000 validation: roc_auc_score=0.769255
Step 5000 validation: roc_auc_score=0.77029
Step 6000 validation: roc_auc_score=0.768024
Step 7000 validation: roc_auc_score=0.764157
Step 8000 validation: roc_auc_score=0.756805
20%|██ | 3/15 [09:40<34:53, 174.42s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.714572
Step 2000 validation: roc_auc_score=0.770712
Step 3000 validation: roc_auc_score=0.777914
Step 4000 validation: roc_auc_score=0.76923
Step 5000 validation: roc_auc_score=0.774823
Step 6000 validation: roc_auc_score=0.775927
Step 7000 validation: roc_auc_score=0.777054
Step 8000 validation: roc_auc_score=0.778508
27%|██▋ | 4/15 [12:12<30:22, 165.66s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.743939
Step 2000 validation: roc_auc_score=0.759478
Step 3000 validation: roc_auc_score=0.738839
Step 4000 validation: roc_auc_score=0.751084
Step 5000 validation: roc_auc_score=0.740504
Step 6000 validation: roc_auc_score=0.753612
Step 7000 validation: roc_auc_score=0.71802
Step 8000 validation: roc_auc_score=0.761025
33%|███▎ | 5/15 [17:40<37:21, 224.16s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.74099
Step 2000 validation: roc_auc_score=0.767516
Step 3000 validation: roc_auc_score=0.767338
Step 4000 validation: roc_auc_score=0.775691
Step 5000 validation: roc_auc_score=0.768731
Step 6000 validation: roc_auc_score=0.755029
Step 7000 validation: roc_auc_score=0.767115
Step 8000 validation: roc_auc_score=0.764744
40%|████ | 6/15 [22:48<37:54, 252.71s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.713761
Step 2000 validation: roc_auc_score=0.759518
Step 3000 validation: roc_auc_score=0.765853
Step 4000 validation: roc_auc_score=0.771976
Step 5000 validation: roc_auc_score=0.772762
Step 6000 validation: roc_auc_score=0.773206
Step 7000 validation: roc_auc_score=0.775565
Step 8000 validation: roc_auc_score=0.768521
47%|████▋ | 7/15 [27:53<35:58, 269.84s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.717178
Step 2000 validation: roc_auc_score=0.754258
Step 3000 validation: roc_auc_score=0.767905
Step 4000 validation: roc_auc_score=0.762917
Step 5000 validation: roc_auc_score=0.766162
Step 6000 validation: roc_auc_score=0.767581
Step 7000 validation: roc_auc_score=0.770746
Step 8000 validation: roc_auc_score=0.77597
53%|█████▎ | 8/15 [30:36<27:29, 235.64s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.74314
Step 2000 validation: roc_auc_score=0.757408
Step 3000 validation: roc_auc_score=0.76668
Step 4000 validation: roc_auc_score=0.768104
Step 5000 validation: roc_auc_score=0.746377
Step 6000 validation: roc_auc_score=0.745282
Step 7000 validation: roc_auc_score=0.74113
Step 8000 validation: roc_auc_score=0.734482
60%|██████ | 9/15 [36:53<28:00, 280.04s/it, best loss: -0.7835939030962179]Step 1000 validation: roc_auc_sco
re=0.743204
Step 2000 validation: roc_auc_score=0.76912
Step 3000 validation: roc_auc_score=0.769981
Step 4000 validation: roc_auc_score=0.784163
Step 5000 validation: roc_auc_score=0.77536
Step 6000 validation: roc_auc_score=0.779237
Step 7000 validation: roc_auc_score=0.782344
Step 8000 validation: roc_auc_score=0.779085
67%|██████▋ | 10/15 [38:23<18:26, 221.33s/it, best loss: -0.7841634210268469]Step 1000 validation: roc_auc_sc
ore=0.743565
Step 2000 validation: roc_auc_score=0.765063
Step 3000 validation: roc_auc_score=0.75284
Step 4000 validation: roc_auc_score=0.759978
Step 5000 validation: roc_auc_score=0.74255
Step 6000 validation: roc_auc_score=0.721809
Step 7000 validation: roc_auc_score=0.729863
Step 8000 validation: roc_auc_score=0.73075
73%|███████▎ | 11/15 [44:07<17:15, 258.91s/it, best loss: -0.7841634210268469]Step 1000 validation: roc_auc_sc
ore=0.695949
Step 2000 validation: roc_auc_score=0.765082
Step 3000 validation: roc_auc_score=0.756256
Step 4000 validation: roc_auc_score=0.771923
Step 5000 validation: roc_auc_score=0.758841
Step 6000 validation: roc_auc_score=0.759393
Step 7000 validation: roc_auc_score=0.765971
Step 8000 validation: roc_auc_score=0.747064
80%|████████ | 12/15 [48:54<13:21, 267.23s/it, best loss: -0.7841634210268469]Step 1000 validation: roc_auc_sc
ore=0.757871
Step 2000 validation: roc_auc_score=0.765296
Step 3000 validation: roc_auc_score=0.769748
Step 4000 validation: roc_auc_score=0.776487
Step 5000 validation: roc_auc_score=0.775009
Step 6000 validation: roc_auc_score=0.779539
Step 7000 validation: roc_auc_score=0.763165
Step 8000 validation: roc_auc_score=0.772093
87%|████████▋ | 13/15 [50:22<07:06, 213.15s/it, best loss: -0.7841634210268469]Step 1000 validation: roc_auc_sc
ore=0.720166
Step 2000 validation: roc_auc_score=0.768489
Step 3000 validation: roc_auc_score=0.782853
Step 4000 validation: roc_auc_score=0.785556
Step 5000 validation: roc_auc_score=0.78583
Step 6000 validation: roc_auc_score=0.786569
Step 7000 validation: roc_auc_score=0.779249
Step 8000 validation: roc_auc_score=0.783423
93%|█████████▎| 14/15 [51:52<02:55, 175.93s/it, best loss: -0.7865693280913189]Step 1000 validation: roc_auc_sc
ore=0.743232
Step 2000 validation: roc_auc_score=0.762007
Step 3000 validation: roc_auc_score=0.771809
Step 4000 validation: roc_auc_score=0.755023
Step 5000 validation: roc_auc_score=0.769812
Step 6000 validation: roc_auc_score=0.769867
Step 7000 validation: roc_auc_score=0.777354
Step 8000 validation: roc_auc_score=0.775313
100%|██████████| 15/15 [56:47<00:00, 227.13s/it, best loss: -0.7865693280913189]
The code below is used to print the best hyperparameters found by the hyperopt.
print("Best: {}".format(best))
The hyperparameter found here may not be necessarily the best one, but gives a general idea on which parameters are
effective. To get mroe accurate results, one has to increase the number of validation epochs and the epochs the model
fit. But doing so may increase the time in finding the best hyperparameters.
As a short intro, GP allows us to build up our statistical model using an infinite number of Gaussian functions over our n-
dimensional space, where n is the number of features. However, we pick these functions based on how well they fit the
data we pass it. We end up with a statistical model built from an ensemble of Gaussian functions which can actually vary
quite a bit. The result is that for points we have trained the model on, the variance in our ensemble should be very low.
For test set points close to the training set points, the variance should be higher but still low as the ensemble was
picked to predict well in its neighborhood. For points far from the training set points, however, we did not pick our
ensemble of Gaussian functions to fit them so we'd expect the variance in our ensemble to be high. In this way, we end
up with a statistical model that allows for a natural generation of uncertainty.
Colab
This tutorial and the rest in the sequences are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
The first step is to get DeepChem up and running. We recommend using Google Colab to work through this tutorial
series. You'll need to run the following commands to get DeepChem installed on your colab notebook.
Gaussian Processes
As stated earlier, GP is already implemented in scikit-learn so we will be using DeepChem's scikit-learn wrapper.
SklearnModel is a subclass of DeepChem's Model class. It acts as a wrapper around a sklearn.base.BaseEstimator.
import deepchem as dc
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
import numpy as np
import matplotlib.pyplot as plt
Loading data
Next we need a dataset that presents a regression problem. For this tutorial we will be using the BACE dataset from
MoleculeNet.
I always like to get a close look at what the objects in my code are storing. We see that tasks is a list of tasks that we
are trying to predict. The transformer is a NormalizationTransformer that normalizes the outputs (y values) of the
dataset.
Here we see that the data has already been split into a training set, a validation set, and a test set. We will train the
model on the training set and test the accuracy of the model on the test set. If we were to do any hyperparameter
tuning, we would use the validation set. The split was ~80/10/10 train/valid/test.
print(train_dataset)
print(valid_dataset)
print(test_dataset)
<DiskDataset X.shape: (1210, 1024), y.shape: (1210, 1), w.shape: (1210, 1), task_names: ['pIC50']>
<DiskDataset X.shape: (151, 1024), y.shape: (151, 1), w.shape: (151, 1), ids: ['Fc1ncccc1-c1cc(ccc1)C1(N=C(N)N(C
)C1=O)c1cn(nc1)CC(CC)CC'
'S1(=O)(=O)N(c2cc(cc3n(cc(CC1)c23)CC)C(=O)NC(Cc1ccccc1)C(=O)C[NH2+]C1CCOCC1)C'
's1ccnc1-c1cc(ccc1)CC(NC(=O)[C@@H](OC)C)C(O)C[NH2+]C1CC2(Oc3ncc(cc13)CC(C)(C)C)CCC2'
...
'S(=O)(=O)(Nc1cc(cc(c1)C(C)(C)C)C1([NH2+]CC(O)C(NC(=O)C)Cc2cc(F)cc(F)c2)CCCCC1)C'
'O=C1N(C)C(=N[C@]1(c1cc(nc(c1)CC)CC)c1cc(ccc1)-c1cncnc1)N'
'Clc1cc2CC(N=C(NC(Cc3ccccc3)C=3NC(=O)c4c(N=3)cccc4)c2cc1)(C)C'], task_names: ['pIC50']>
<DiskDataset X.shape: (152, 1024), y.shape: (152, 1), w.shape: (152, 1), ids: ['Clc1ccc(cc1)CC(NC(=O)C)C(O)C[NH2
+]C1CC2(Oc3ncc(cc13)CC(C)(C)C)CCC2'
'Fc1cc(cc(F)c1)CC(NC(=O)c1cc(cc(Oc2ccc(F)cc2)c1)C(=O)N(CCC)CCC)C(O)C[NH2+]Cc1cc(OC)ccc1'
'O1c2c(cc(cc2)C2CCCCC2)C2(N=C(N)N(C)C2=O)CC1(C)C' ...
'S(=O)(=O)(N(C)c1cc(cc(c1)COCC([NH3+])(Cc1ccccc1)C(F)F)C(=O)NC(C)c1ccc(F)cc1)C'
'O1CCCC1CN1C(=O)C(N=C1N)(C1CCCCC1)c1ccccc1'
'Fc1cc(cc(c1)C#C)CC(NC(=O)COC)C(O)C[NH2+]C1CC2(Oc3ncc(cc13)CC(C)(C)C)CCC2'], task_names: ['pIC50']>
As you see, the values I picked for the parameters seem awfully specific. This is because I needed to do some
hyperparameter tuning beforehand to get model that wasn't wildly overfitting the training set. You can learn more about
how I tuned the model in the Appendix at the end of this tutorial.
output_variance = 7.908735015054668
length_scale = 6.452349252677817
noise_level = 0.10475507755839343
kernel = output_variance**2 * RBF(length_scale=length_scale, length_scale_bounds='fixed') + WhiteKernel(noise_level
alpha = 4.989499481123432e-09
Then we fit our model to the data and see how it performs both on the training set and on the test set.
model.fit(train_dataset)
metric1 = dc.metrics.Metric(dc.metrics.mean_squared_error)
metric2 = dc.metrics.Metric(dc.metrics.r2_score)
print(f'Training set score: {model.evaluate(train_dataset, [metric1, metric2])}')
print(f'Test set score: {model.evaluate(test_dataset, [metric1, metric2])}')
For our training set, we see a pretty good correlation between the measured values (x-axis) and the predicted values (y-
axis). Note that we use the transformer from earlier to untransform our predicted values.
y_meas_train = transformers[0].untransform(train_dataset.y)
y_pred_train, y_pred_train_stds = predict_with_error(model, train_dataset.X, transformers[0])
plt.xlim([2.5, 10.5])
plt.ylim([2.5, 10.5])
plt.scatter(y_meas_train, y_pred_train)
<matplotlib.collections.PathCollection at 0x7fc0431b45d0>
We now do the same for our test set. We see a fairly good correlation! However, it is certainly not as tight. This is
reflected in the difference between the R2 scores calculated above.
y_meas_test = transformers[0].untransform(test_dataset.y)
y_pred_test, y_pred_test_stds = predict_with_error(model, test_dataset.X, transformers[0])
plt.xlim([2.5, 10.5])
plt.ylim([2.5, 10.5])
plt.scatter(y_meas_test, y_pred_test)
<matplotlib.collections.PathCollection at 0x7fc04023b590>
We can also write a function to calculate how many of the predicted values fall within the predicted error range. This is
done by counting up how many samples have a true error smaller than its standard deviation calculated earlier. One
standard deviation is a 68% confidence interval.
count_within_error = 0
for i in range(len(y_meas)):
if abs(y_meas[i][0]-y_pred[i]) < y_std[i]:
count_within_error += 1
return count_within_error/len(y_meas)
For the train set, >90% of the samples are within a standard deviation. In comparison, only ~70% of the samples are
within a standard deviation for the test set. A standard deviation is a 68% confidence interval so we see that for the
training set, the uncertainty is close. However, this model overpredicts uncertainty on the training set.
0.9355371900826446
0.7368421052631579
We can also take a look at the distributions of the standard deviations for the test set predictions. We see a very roughly
Gaussian distribution in the predicted errors.
plt.hist(y_pred_test_stds)
plt.show()
For now, this is the end of our tutorial. We plan to follow up soon with a deeper dive into uncertainty estimation and in
particular, calibrated uncertainty estimation. We will see you then!
def get_model(trial):
output_variance = trial.suggest_float('output_variance', 0.1, 10, log=True)
length_scale = trial.suggest_float('length_scale', 1e-5, 1e5, log=True)
noise_level = trial.suggest_float('noise_level', 1e-5, 1e5, log=True)
params = {
'kernel': output_variance**2 * RBF(length_scale=length_scale, length_scale_bounds='fixed') + WhiteKernel(nois
'alpha': trial.suggest_float('alpha', 1e-12, 1e-5, log=True),
}
sklearn_gpr = GaussianProcessRegressor(**params)
return dc.models.SklearnModel(sklearn_gpr)
def objective(trial):
model = get_model(trial)
model.fit(train_dataset)
metric = dc.metrics.Metric(dc.metrics.mean_squared_error)
return model.evaluate(valid_dataset, [metric])['mean_squared_error']
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)
print(study.best_params)
1. Multi-gpu training functionalities: pytorch-lightning provides easy multi-gpu, multi-node training. It also simplifies
the process of launching multi-gpu, multi-node jobs across different cluster infrastructure, e.g. AWS, slurm based
clusters.
2. Reducing boilerplate pytorch code: lightning takes care of details like, optimizer.zero_grad(), model.train(),
model.eval() . Lightning also provides experiment logging functionality, for e.g. irrespective of training on CPU,
GPU, multi-nodes the user can use the method self.log inside the trainer and it will appropriately log the metrics.
3. Features that can speed up training: half-precision training, gradient checkpointing, code profiling.
Open in Colab
Setup
This notebook assumes that you have already installed deepchem, if you have not follow the instructions at the
deepchem installation page: https://deepchem.readthedocs.io/en/latest/get_started/installation.html.
Install pytorch lightning following the instructions on lightning's home page: https://www.pytorchlightning.ai/
import deepchem as dc
from deepchem.models import GCNModel
import pytorch_lightning as pl
import torch
from torch.nn import functional as F
from torch import nn
import pytorch_lightning as pl
from pytorch_lightning.core.lightning import LightningModule
from torch.optim import Adam
import numpy as np
import torch
Deepchem Example
Below we show an example of a Graph Convolution Network (GCN). Note that this is a simple example which uses a
GCNModel to predict the label from an input sequence. We do not showcase the complete functionality of deepchem in
this example as we want to restructure the deepchem code and adapt it so that it can be easily plugged into pytorch-
lightning. This example was inspired from the GCNModel documentation present here.
Prepare the dataset: for training our deepchem models we need a dataset that we can use to train the model. Below
we prepare a sample dataset for the purposes of this tutorial. Below we also directly use the featurized to encode
examples for the dataset.
Setup the model: now we initialize the Graph Convolutional Network model that we will use in our training.
model = GCNModel(
mode='classification',
n_tasks=1,
batch_size=2,
learning_rate=0.001
)
Train the model: fit the model on our training dataset, also specify the number of epochs to run.
0.18830760717391967
/Users/princychahal/mambaforge/envs/keras_try_5/lib/python3.8/site-packages/torch/autocast_mode.py:141: UserWarn
ing: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
1. LightningDataModule : This module defines who the data is prepared and fed into the model so that the model
can use it for training. The module defines the train dataloader function which are directly used by the trainer to
generate data for the LightningModule . To learn more about the LightningDataModule refer to the
datamodules documentation.
2. LightningModule : This module defines the training, validation steps for our model. We can use this module to
initialize our model based on the hyperparameters. There are a number of boilerplate functions which we use
directly to track our experiments, for example we can save all the hyperparameters that we used for training using
the self.save_hyperparameters() method. For more details on how to use this module refer to the
lightningmodules documentation.
Setup the torch dataset: Note that here we need to create a custome SmilesDataset so that we can easily
interface with the deepchem featurizers. For this interface we need to define a collate method so that we can create
batches for the dataset.
# prepare LightningDataModule
class SmilesDataset(torch.utils.data.Dataset):
def __init__(self, smiles, labels):
assert len(smiles) == len(labels)
featurizer = dc.feat.MolGraphConvFeaturizer()
X = featurizer.featurize(smiles)
self._samples = dc.data.NumpyDataset(X=X, y=labels)
def __len__(self):
return len(self._samples)
class SmilesDatasetBatch:
def __init__(self, batch):
X = [np.array([b[0] for b in batch])]
y = [np.array([b[1] for b in batch])]
w = [np.array([b[2] for b in batch])]
self.batch_list = [X, y, w]
def collate_smiles_dataset_wrapper(batch):
return SmilesDatasetBatch(batch)
Create the GCN specific lightning module: in this part we use an object of the SmilesDataset created above to
create the SmilesDatasetModule
class SmilesDatasetModule(pl.LightningDataModule):
def __init__(self, train_smiles, train_labels, batch_size):
super().__init__()
self._train_smiles = train_smiles
self._train_labels = train_labels
self._batch_size = batch_size
def train_dataloader(self):
return torch.utils.data.DataLoader(
self.train_dataset,
batch_size=self._batch_size,
collate_fn=collate_smiles_dataset_wrapper,
shuffle=True,
)
Create the lightning module: in this part we create the GCN specific lightning module. This class specifies the logic
flow for the training step. We also create the required models, optimizers and losses for the training flow.
def configure_optimizers(self):
return self.gcn_model.optimizer._create_pytorch_optimizer(
self.pt_model.parameters(),
)
if isinstance(outputs, torch.Tensor):
outputs = [outputs]
if self.gcn_model._loss_outputs is not None:
outputs = [outputs[i] for i in self.gcn_model._loss_outputs]
self.log(
"train_loss",
loss_outputs,
on_epoch=True,
sync_dist=True,
reduce_fx="mean",
prog_bar=True,
)
return loss_outputs
gcnmodule = GCNModule(
mode="classification",
n_tasks=1,
learning_rate=1e-3,
)
Lightning Trainer
Trainer is the wrapper which builds on top of the LightningDataModule and LightningModule . When constructing
the lightning trainer you can also specify the number of epochs, max-steps to run, number of GPUs, number of nodes to
be used for trainer. Lightning trainer acts as a wrapper over your distributed training setup and this way you are able to
build your models in a way you would build them in a simple way for your local runs.
trainer = pl.Trainer(
max_epochs=5,
)
# train
trainer.fit(
model=gcnmodule,
datamodule=smiles_datasetmodule,
)
Effective optimization techniques can significantly reduce training times, lower computational costs, and improve model
performance. This makes optimization particularly crucial in research and industrial settings where faster iterations can
accelerate scientific discoveries, product development, and the deployment of AI solutions. Moreover, as models grow
larger and more sophisticated, optimization plays a vital role in making advanced AI accessible and practical for a wider
range of applications and environments.
To address the need for optimization of Deep Larning models and as an improvement over existing methods, PyTorch
introduced the torch.compile() function in PyTorch 2.0 to allow faster training and inference of the models.
torch.compile() works by compiling PyTorch code into optimised kernels using a JIT(Just in Time) compiler. Different
models show varying levels of improvement in run times depending on their architecture and batch size when compiled.
Compared to existing methods like TorchScript or FX tracing, compile() also offers advantages such as the ability to
handle arbitrary Python code and conditional graph-breaking flow of the inputs to the models. This allows compile() to
work with minimal or no code modification to the model.
DeepChem has builtin support for compiling PyTorch models using torch.compile() and using this feature, users can
efficiently run PyTorch models and achieve significant performance gains. This tutorial contains the steps for compiling
DeepChem PyTorch models, benchmarking and evaluating their performance with the uncompiled models.
NOTE: DeepChem contains many models with varying architecture and complexity. Not all models will show
significant improvements in run times when compiled. It is recommended to test the models with and
without compilation to determine the performance improvements.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Compilation Process
This section gives an introductory explanation about the compilation process of PyTorch models and assumes prior
knowledge about forward pass, backward pass and computational graphs in neural networks. If you're unfamiliar with
these concepts, you can refer to these slides for a basic understanding. Alternatively, you can proceed to the next
section to learn how to compile and benchmark DeepChem models without delving into the internal details of the
compilation process.
Image taken from PyTorch2.0 Introductory Blog
The compilation process is split into multiple steps which uses many new technologies that were introduced in PyTorch
2.0. The process is as follows:
1. Graph Acquisition: During the compilation process, TorchDynamo and AOTAutograd are used for capturing the
forward and backward pass graphs respectively. AOTAutograd allows the backward graph to be captured ahead of
time without needing a backward pass to be performed.
2. Graph Lowering: The captured graph that could be composed of the 2000+ PyTorch operators is lowered into a
collection of ~250 Prim and ~750 ATen operators.
3. Graph Compilation: In this step optimised low-level kernels are generated for the target accelerator using a
suitable backend compiler. TorchInductor is the default backend compiler used for this purpose.
Deepchem uses the torch.compile() function that implements all the above steps internally to compile the models.
The compiled model can be used for training, evaluation and inference.
For more information on the compilation process, refer to PyTorch2.0 Introductory Blog that does a deep dive into the
compilation process, technical decisions and future features for the compile function. You can also refer to the
Huggingface blog, Optimize inference using torch.compile() that benchmarks many common PyTorch models and shows
the performance improvements when compiled.
Compiling Models
The compile function is only available in DeepChem for models that use PyTorch as the backend (i.e inherits
TorchModel class). You can see the complete list of models that are available in DeepChem and their backends in the
DeepChem Documentation here.
This tutorial contains the steps to load a DeepChem model, compile it and evaluate the performance improvements
when compiled for both training and inference. Refer to the documentation of DeepChem's compile function to read
more about the different parameters you can pass to the function and their usage.
If you just want to compile the model, you can add the line model.compile() after initialising the model. You DO NOT
have to make any changes to the rest of your code.
1. Selecting the right mode: The modes can be default , reduce-overhead , max-autotune or max-autotune-
no-cudagraphs . Out of this reduce-overhead and max-autotune modes requires triton to be installed. Refer
to the PyTorch docs on torch.compile for more information on the modes.
2. Setting fullgraph parameter: If True (default False ), torch.compile will require that the entire function be
capturable into a single graph. If this is not possible (that is, if there are graph breaks), then the function will raise
an error.
3. Experimenting with different parameter configuration: Different parameter configurations can give different
speedups based on the model, batch size and the device used for training/inference. Experiment with a few
parameter combinations to check which one gives better results.
In this tutorial, we will be using DMPNN model and Freesolv Dataset for training and inference of the models.
Collecting deepchem
Downloading deepchem-2.8.1.dev20240624214143-py3-none-any.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 6.9 MB/s eta 0:00:00
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.4.2)
Requirement already satisfied: numpy<2 in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.25.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from deepchem) (2.0.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.2.2)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.12.1)
Requirement already satisfied: scipy>=1.10.1 in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.11.4)
Collecting rdkit (from deepchem)
Downloading rdkit-2024.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.1/35.1 MB 14.6 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->d
eepchem) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->deepchem) (
2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->deepchem)
(2024.1)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from rdkit->deepchem) (9.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-lear
n->deepchem) (3.5.0)
Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->deep
chem) (1.3.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->deepchem) (1.16.0)
Installing collected packages: rdkit, deepchem
Successfully installed deepchem-2.8.1.dev20240624214143 rdkit-2024.3.1
Collecting torch_geometric
Downloading torch_geometric-2.5.3-py3-none-any.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 10.8 MB/s eta 0:00:00
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (4.66.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.25.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.11.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (2023.6.
0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.1.4)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.9.5)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (2.31.
0)
Requirement already satisfied: pyparsing in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.1.
2)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1
.2.2)
Requirement already satisfied: psutil>=5.8.0 in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (
5.9.5)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->torch_
geometric) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->torch_geo
metric) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->torch
_geometric) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->tor
ch_geometric) (6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->torch_ge
ometric) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp-
>torch_geometric) (4.0.3)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch_ge
ometric) (2.1.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from request
s->torch_geometric) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geo
metric) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->tor
ch_geometric) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->tor
ch_geometric) (2024.6.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->torc
h_geometric) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-lear
n->torch_geometric) (3.5.0)
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.5.3
Requirement already satisfied: triton in /usr/local/lib/python3.10/dist-packages (2.3.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from triton) (3.15.3)
import torch
import datetime
import numpy as np
import deepchem as dc
torch._dynamo.config.cache_size_limit = 64
model = dc.models.DMPNNModel()
The below line is the only addition you have to make to the code for compiling the model. You can pass in the other
arguments too to the compile() function if they are required.
model.compile()
model.fit(train_dataset, nb_epoch=10)
metrics = [dc.metrics.Metric(dc.metrics.mean_squared_error)]
print(f"Training MSE: {model.evaluate(train_dataset, metrics=metrics)}")
print(f"Validation MSE: {model.evaluate(valid_dataset, metrics=metrics)}")
print(f"Test MSE: {model.evaluate(test_dataset, metrics=metrics)}")
To account for the initial performance overhead of kernel compilation in compiled models, median values are employed
as the performance metric throughout the tutorial for calculating speedup.
The below two functions, time_torch_function and get_time_track_callback can be used for tracking the time
taken for inference and training respectively.
The implementation of time_torch_function is taken from the PyTorch official torch.compile tutorial here.
We use get_time_track_callback to make a callback that can track the time taken for each batch during training as
DeepChem does not provide a direct way to track the time taken per batch during training. We can use this callback by
passing it as an argument to model.fit() function.
def time_torch_function(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
track_dict = {}
prev_time_dict = {}
def get_time_track_callback(track_dict, track_name, track_interval):
track_dict[track_name] = []
prev_time_dict[track_name] = datetime.datetime.now()
def callback(model, step):
if step % track_interval == 0:
elapsed_time = datetime.datetime.now() - prev_time_dict[track_name]
track_dict[track_name].append(elapsed_time.total_seconds())
prev_time_dict[track_name] = datetime.datetime.now()
return callback
model = dc.models.DMPNNModel()
model_compiled = dc.models.DMPNNModel()
model_compiled.compile(mode='reduce-overhead')
track_interval = 20
eager_dict_name = "eager_train"
compiled_dict_name = "compiled_train"
0.06506308714548746
eager_train_times = track_dict[eager_dict_name]
compiled_train_times = track_dict[compiled_dict_name]
Eager Times (first 15): ['1.067', '0.112', '0.093', '0.097', '0.102', '0.098', '0.095', '0.097', '0.099', '0.098
', '0.097', '0.103', '0.095', '0.103', '0.096']
Compiled Times (first 15): ['29.184', '21.463', '11.503', '13.742', '1.951', '5.595', '7.568', '8.201', '7.761',
'0.083', '7.087', '2.421', '1.961', '0.079', '1.948']
Total Eager Time: 29.176121000000023
Total Compiled Time: 243.32460400000022
Eager Median: 0.100118
Compiled Median: 0.0843535
Median Speedup: 18.69%
model = dc.models.DMPNNModel()
model_compiled = dc.models.DMPNNModel()
model_compiled.compile(mode='reduce-overhead')
iters = 100
eager_predict_times = []
compiled_predict_times = []
for i in range(iters):
for X, y, w, ids in test_dataset.iterbatches(64, pad_batches=True):
with torch.no_grad():
_, eager_time = time_torch_function(lambda: model.predict_on_batch(X))
_, compiled_time = time_torch_function(lambda: model_compiled.predict_on_batch(X))
eager_predict_times.append(eager_time)
compiled_predict_times.append(compiled_time)
Eager Times (first 15): ['0.170', '0.173', '0.161', '0.160', '0.160', '0.165', '0.158', '0.159', '0.164', '0.161
', '0.162', '0.154', '0.159', '0.161', '0.162']
Compiled Times (first 15): ['47.617', '1.168', '26.927', '0.127', '0.134', '0.138', '0.130', '0.130', '0.133', '
0.125', '0.130', '0.132', '0.139', '0.128', '0.133']
Total Eager Time: 35.297711242675796
Total Compiled Time: 104.20891365814221
Eager Median: 0.1617226104736328
Compiled Median: 0.1332385482788086
Median Speedup: 21.38%
<matplotlib.legend.Legend at 0x7c7a040c9c30>
As with the results we got training, the first few runs for inference also takes significantly more time due to the same
reason as mentioned before.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
import deepchem as dc
dc.__version__
'2.4.0-rc1.dev'
What is a Fingerprint?
Deep learning models almost always take arrays of numbers as their inputs. If we want to process molecules with them,
we somehow need to represent each molecule as one or more arrays of numbers.
Many (but not all) types of models require their inputs to have a fixed size. This can be a challenge for molecules, since
different molecules have different numbers of atoms. If we want to use these types of models, we somehow need to
represent variable sized molecules with fixed sized arrays.
Fingerprints are designed to address these problems. A fingerprint is a fixed length array, where different elements
indicate the presence of different features in the molecule. If two molecules have similar fingerprints, that indicates they
contain many of the same features, and therefore will likely have similar chemistry.
DeepChem supports a particular type of fingerprint called an "Extended Connectivity Fingerprint", or "ECFP" for short.
They also are sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on
their direct properties and bonds. Each unique pattern is a feature. For example, "carbon atom bonded to two
hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any
molecule that contains that feature. It then iteratively identifies new features by looking at larger circular
neighborhoods. One specific feature bonded to two other specific features becomes a higher level feature, and the
corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most
often two.
Let's take a look at a dataset that has been featurized with ECFP.
<DiskDataset X.shape: (6264, 1024), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' '
NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>
The feature array X has shape (6264, 1024). That means there are 6264 samples in the training set. Each one is
represented by a fingerprint of length 1024. Also notice that the label array y has shape (6264, 12): this is a multitask
dataset. Tox21 contains information about the toxicity of molecules. 12 different assays were used to look for signs of
toxicity. The dataset records the results of all 12 assays, each as a different task.
train_dataset.w
array([[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
1.060388945752303, 1.1895710249165168, 1.0700990099009902],
[1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
0.0, 1.1895710249165168, 1.0700990099009902],
[0.0, 0.0, 0.0, ..., 1.060388945752303, 0.0, 0.0],
...,
[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
1.060388945752303, 0.0, 0.0],
[1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
1.060388945752303, 1.1895710249165168, 1.0700990099009902]],
dtype=object)
Notice that some elements are 0. The weights are being used to indicate missing data. Not all assays were actually
performed on every molecule. Setting the weight for a sample or sample/task pair to 0 causes it to be ignored during
fitting and evaluation. It will have no effect on the loss function or other metrics.
Most of the other weights are close to 1, but not exactly 1. This is done to balance the overall weight of positive and
negative samples on each task. When training the model, we want each of the 12 tasks to contribute equally, and on
each task we want to put equal weight on positive and negative samples. Otherwise, the model might just learn that
most of the training samples are non-toxic, and therefore become biased toward identifying other molecules as non-
toxic.
MultitaskClassifier is a simple stack of fully connected layers. In this example we tell it to use a single hidden layer
of width 1000. We also tell it that each input will have 1024 features, and that it should produce predictions for 12
different tasks.
Why not train a separate model for each task? We could do that, but it turns out that training a single model for multiple
tasks often works better. We will see an example of that in a later tutorial.
import numpy as np
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))
Not bad performance for such a simple model and featurization. More sophisticated models do slightly better on this
dataset, but not enormously better.
@manual{Intro4,
title={Molecular Fingerprints},
organization={DeepChem},
author={Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Molecular_Fingerprints.ipyn
year={2021},
}
Going Deeper On Molecular Featurizations
One of the most important steps of doing machine learning on molecular data is transforming the data into a form
amenable to the application of learning algorithms. This process is broadly called "featurization" and involves turning a
molecule into a vector or tensor of some sort. There are a number of different ways of doing that, and the choice of
featurization is often dependent on the problem at hand. We have already seen two such methods: molecular
fingerprints, and ConvMol objects for use with graph convolutions. In this tutorial we will look at some of the others.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
'2.6.0.dev'
Featurizers
In DeepChem, a method of featurizing a molecule (or any other sort of input) is defined by a Featurizer object. There
are three different ways of using featurizers.
1. When using the MoleculeNet loader functions, you simply pass the name of the featurization method to use. We
have seen examples of this in earlier tutorials, such as featurizer='ECFP' or featurizer='GraphConv' .
2. You also can create a Featurizer and directly apply it to molecules. For example:
import deepchem as dc
featurizer = dc.feat.CircularFingerprint()
print(featurizer(['CC', 'CCC', 'CCO']))
3. When creating a new dataset with the DataLoader framework, you can specify a Featurizer to use for processing the
data. We will see this in a future tutorial.
We use propane (CH3CH2CH3, represented by the SMILES string 'CCC' ) as a running example throughout this tutorial.
Many of the featurization methods use conformers of the molecules. A conformer can be generated using the
ConformerGenerator class in deepchem.utils.conformers .
RDKitDescriptors
RDKitDescriptors featurizes a molecule by using RDKit to compute values for a list of descriptors. These are basic
physical and chemical properties: molecular weight, polar surface area, numbers of hydrogen bond donors and
acceptors, etc. This is most useful for predicting things that depend on these high level properties rather than on
detailed molecular structure.
Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using
RDKitDescriptors.allowedDescriptors . The featurizer uses the descriptors in
rdkit.Chem.Descriptors.descList , checks if they are in the list of allowed descriptors, and computes the descriptor
value for the molecule.
Let's print the values of the first ten descriptors for propane.
MaxAbsEStateIndex 2.125
MaxEStateIndex 2.125
MinAbsEStateIndex 1.25
MinEStateIndex 1.25
qed 0.3854706587740357
SPS 6.0
MolWt 44.097
HeavyAtomMolWt 36.033
ExactMolWt 44.062600255999996
NumValenceElectrons 20.0
[09:07:17] DEPRECATION WARNING: please use MorganGenerator
[09:07:17] DEPRECATION WARNING: please use MorganGenerator
[09:07:17] DEPRECATION WARNING: please use MorganGenerator
DeepChem supports lots of different graph based models. Some of them require molecules to be featurized in slightly
different ways. Because of this, there are two other featurizers called WeaveFeaturizer and
MolGraphConvFeaturizer . They each convert molecules into a different type of Python object that is used by
particular models. When using any graph based model, just check the documentation to see what featurizer you need to
use with it.
CoulombMatrix
All the models we have looked at so far consider only the intrinsic properties of a molecule: the list of atoms that
compose it and the bonds connecting them. When working with flexible molecules, you may also want to consider the
different conformations the molecule can take on. For example, when a drug molecule binds to a protein, the strength of
the binding depends on specific interactions between pairs of atoms. To predict binding strength, you probably want to
consider a variety of possible conformations and use a model that takes them into account when making predictions.
The Coulomb matrix is one popular featurization for molecular conformations. Recall that the electrostatic Coulomb
interaction between two charges is proportional to
where
and
matrix where each element gives the strength of the electrostatic interaction between two atoms. It contains
information both about the charges on the atoms and the distances between them. More information on the functional
forms used can be found here.
To apply this featurizer, we first need a set of conformations for the molecule. We can use the ConformerGenerator
class to do this. It takes a RDKit molecule, generates a set of energy minimized conformers, and prunes the set to only
include ones that are significantly different from each other. Let's try running it for propane.
generator = dc.utils.ConformerGenerator(max_conformers=5)
propane_mol = generator.generate_conformers(Chem.MolFromSmiles('CCC'))
print("Number of available conformers for propane: ", len(propane_mol.GetConformers()))
Number of available conformers for propane: 1
It only found a single conformer. This shouldn't be surprising, since propane is a very small molecule with hardly any
flexibility. Let's try adding another carbon.
butane_mol = generator.generate_conformers(Chem.MolFromSmiles('CCCC'))
print("Number of available conformers for butane: ", len(butane_mol.GetConformers()))
coulomb_mat = dc.feat.CoulombMatrix(max_atoms=20)
features = coulomb_mat(propane_mol)
print(features)
[[[36.8581052 12.48684429 7.5619687 2.85945193 2.85804514
2.85804556 1.4674015 1.46740144 0.91279491 1.14239698
1.14239675 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[12.48684429 36.8581052 12.48684388 1.46551218 1.45850736
1.45850732 2.85689525 2.85689538 1.4655122 1.4585072
1.4585072 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 7.5619687 12.48684388 36.8581052 0.9127949 1.14239695
1.14239692 1.46740146 1.46740145 2.85945178 2.85804504
2.85804493 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 2.85945193 1.46551218 0.9127949 0.5 0.29325367
0.29325369 0.21256978 0.21256978 0.12268391 0.13960187
0.13960185 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 2.85804514 1.45850736 1.14239695 0.29325367 0.5
0.29200271 0.17113413 0.21092513 0.13960186 0.1680002
0.20540029 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 2.85804556 1.45850732 1.14239692 0.29325369 0.29200271
0.5 0.21092513 0.17113413 0.13960187 0.20540032
0.16800016 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.4674015 2.85689525 1.46740146 0.21256978 0.17113413
0.21092513 0.5 0.29351308 0.21256981 0.2109251
0.17113412 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.46740144 2.85689538 1.46740145 0.21256978 0.21092513
0.17113413 0.29351308 0.5 0.21256977 0.17113412
0.21092513 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0.91279491 1.4655122 2.85945178 0.12268391 0.13960186
0.13960187 0.21256981 0.21256977 0.5 0.29325366
0.29325365 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.14239698 1.4585072 2.85804504 0.13960187 0.1680002
0.20540032 0.2109251 0.17113412 0.29325366 0.5
0.29200266 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 1.14239675 1.4585072 2.85804493 0.13960185 0.20540029
0.16800016 0.17113412 0.21092513 0.29325365 0.29200266
0.5 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]]]
Notice that many elements are 0. To combine multiple molecules in a batch we need all the Coulomb matrices to be the
same size, even if the molecules have different numbers of atoms. We specified max_atoms=20 , so the returned matrix
has size (20, 20). The molecule only has 11 atoms, so only an 11 by 11 submatrix is nonzero.
CoulombMatrixEig
An important feature of Coulomb matrices is that they are invariant to molecular rotation and translation, since the
interatomic distances and atomic numbers do not change. Respecting symmetries like this makes learning easier.
Rotating a molecule does not change its physical properties. If the featurization does change, then the model is forced
to learn that rotations are not important, but if the featurization is invariant then the model gets this property
automatically.
Coulomb matrices are not invariant under another important symmetry: permutations of the atoms' indices. A
molecule's physical properties do not depend on which atom we call "atom 1", but the Coulomb matrix does. To deal
with this, the CoulumbMatrixEig featurizer was introduced, which uses the eigenvalue spectrum of the Coulumb
matrix and is invariant to random permutations of the atom's indices. The disadvantage of this featurization is that it
contains much less information (
eigenvalues instead of an
CoulombMatrixEig inherits from CoulombMatrix and featurizes a molecule by first computing the Coulomb matrices
for different conformers of the molecule and then computing the eigenvalues for each Coulomb matrix. These
eigenvalues are then padded to account for variation in number of atoms across molecules.
coulomb_mat_eig = dc.feat.CoulombMatrixEig(max_atoms=20)
features = coulomb_mat_eig(propane_mol)
print(features)
To prepare SMILES strings for a sequence model, we break them down into lists of substrings (called tokens) and turn
them into lists of integer values (numericalization). Sequence models use those integer values as indices of an
embedding matrix, which contains a vector of floating-point numbers for each token in the vocabulary. These
embedding vectors are updated during model training. This process allows the sequence model to learn its own
representations of the molecular properties implicit in the training data.
We will use DeepChem's BasicSmilesTokenizer and the Tox21 dataset from MoleculeNet to demonstrate the process
of tokenizing SMILES.
import numpy as np
<DiskDataset X.shape: (6264,), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-Ah
R' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>
We loaded the datasets with featurizer="Raw" . Now we obtain the SMILES from their ids attributes.
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
test_smiles = test_dataset.ids
print(train_smiles[:5])
['CC(O)(P(=O)(O)O)P(=O)(O)O' 'CC(C)(C)OOC(C)(C)CCC(C)(C)OOC(C)(C)C'
'OC[C@H](O)[C@@H](O)[C@H](O)CO'
'CCCCCCCC(=O)[O-].CCCCCCCC(=O)[O-].[Zn+2]' 'CC(C)COC(=O)C(C)C']
Next we define our tokenizer and map it onto all our data to convert the SMILES strings into lists of tokens. The
BasicSmilesTokenizer breaks down SMILES roughly at atom level.
tokenizer = dc.feat.smiles_tokenizer.BasicSmilesTokenizer()
train_tok = list(map(tokenizer.tokenize, train_smiles))
valid_tok = list(map(tokenizer.tokenize, valid_smiles))
test_tok = list(map(tokenizer.tokenize, test_smiles))
print(train_tok[0])
len(train_tok)
['C', 'C', '(', 'O', ')', '(', 'P', '(', '=', 'O', ')', '(', 'O', ')', 'O', ')', 'P', '(', '=', 'O', ')', '(', '
O', ')', 'O']
6264
Now we have tokenized versions of all SMILES strings in our dataset. To convert those into lists of integer values we first
need to create a list of all possible tokens in our dataset. That list is called the vocabulary. We also add the empty string
"" to our vocabulary in order to correctly handle trailing zeros when decoding zero-padded numericalized SMILES.
['', '#', '(', ')', '-', '.', '/', '1', '2', '3', '4', '5'] ... ['[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]'
, '[se]', '\\', 'c', 'n', 'o', 's']
128
To numericalize tokenized SMILES strings we create a str2int dictionary which assigns a number to each token in the
dictionary. We also create the reverse int2str dictionary and define the corresponding encode and decode
functions. Finally we map the encode function on the tokenized data to obtain numericalized SMILES data.
CC(O)(P(=O)(O)O)P(=O)(O)O
[19, 19, 2, 24, 3, 2, 25, 2, 16, 24, 3, 2, 24, 3, 24, 3, 25, 2, 16, 24, 3, 2, 24, 3, 24]
CC(O)(P(=O)(O)O)P(=O)(O)O
[19, 19, 2, 24, 3, 2, 25, 2, 16, 24, 3, 2, 24, 3, 24, 3, 25, 2, 16, 24, 3, 2, 24, 3, 24]
Lastly, we would like to combine all molecules in a dataset in an np.array so they can be served to a model in
batches. To achieve that, all sequences have to be of the same length. As in the CoulombMatrix section, we achieve that
by appending zeros up to a fixed value.
240
The longest sequence across all Tox21 datasets has length 240 , so we use that as our fixed length. We create a
zero_pad function, map it to all numericalized SMILES, and turn them into np.array s.
Cc1cc(C(C)(C)c2ccc(O)c(C)c2)ccc1O
Cc1cc(C(C)(C)c2ccc(O)c(C)c2)ccc1O
The padded data passes the test. It is now in the correct format to be used for training of a sequence model, but it
doesn't yet interface nicely with DeepChem's training framework. To change that, we define a tokenize_smiles
function that combines all the steps spelled out above to process a single datapoint. Additionally, we define a
SmilesFeaturizer that uses our custom tokenize_smiles function in its _featurize method and instanciate it as
smiles_featurizer passing it our vocab and max_len .
class SmilesFeaturizer(dc.feat.Featurizer):
def __init__(self, feat_func, vocab, max_len):
self.feat_func = feat_func
self.vocab = vocab
self.max_len = max_len
Finally, we use the smiles_featurizer to create new Tox21 datasets that contain tokenized and numericalized
SMILES in their X attribute.
The datasets are now ready to be used with your custom DeepChem sequence model. Don't forget to wrap your model
into the appropriate DeepChem model class.
@manual{Intro7,
title={Going Deeper on Molecular Featurizations},
organization={DeepChem},
author={Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Going_Deeper_on_Molecular_F
year={2021},
}
Learning Unsupervised Embeddings for Molecules
In this tutorial, we will use a SeqToSeq model to generate fingerprints for classifying molecules. This is based on the
following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An
Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
A SeqToSeq model performs sequence to sequence translation. For example, they are often used to translate text from
one language to another. It consists of two parts called the "encoder" and "decoder". The encoder is a stack of recurrent
layers. The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the
"embedding vector". The decoder is another stack of recurrent layers that performs the inverse operation: it takes the
embedding vector as input, and generates the output sequence. By training it on appropriately chosen input/output
pairs, you can create a model that performs many sorts of transformations.
In this case, we will use SMILES strings describing molecules as the input sequences. We will train the model as an
autoencoder, so it tries to make the output sequences identical to the input sequences. For that to work, the encoder
must create embedding vectors that contain all information from the original sequence. That's exactly what we want in
a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!
Let's start by loading the data. We will use the MUV dataset. It includes 74,501 molecules in the training set, and 9313
molecules in the validation set, so it gives us plenty of SMILES strings to work with.
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
We need to define the "alphabet" for our SeqToSeq model, the list of all tokens that can appear in sequences. (It's also
possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder,
they're identical in this case.) Make a list of every character that appears in any training sequence.
tokens = set()
for s in train_smiles:
tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))
Create the model and define the optimization method to use. In this case, learning works much better if we gradually
decrease the learning rate. We use an ExponentialDecay to multiply the learning rate by 0.9 after each epoch.
Let's train it! The input to fit_sequences() is a generator that produces input/output pairs. On a good GPU, this
should take a few hours or less.
def generate_sequences(epochs):
for i in range(epochs):
for s in train_smiles:
yield (s, s)
model.fit_sequences(generate_sequences(40))
Let's see how well it works as an autoencoder. We'll run the first 500 molecules from the validation set through it, and
see how many of them are exactly reproduced.
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
if ''.join(p) == s:
count += 1
print('reproduced', count, 'of 500 validation SMILES strings')
Now we'll trying using the encoder as a way to generate molecular fingerprints. We compute the embedding vectors for
all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.
The amount of data is small enough that we can just store everything in memory.
import numpy as np
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
train_dataset.y,
train_dataset.w.astype(np.float32),
train_dataset.ids)
valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
valid_dataset.y,
valid_dataset.w.astype(np.float32),
valid_dataset.ids)
For classification, we'll use a simple fully connected network with one hidden layer.
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
n_features=256,
layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)
0.0014195525646209716
Find out how well it worked. Compute the ROC AUC for the training and validation datasets.
The idea of the model is to train on pairs of molecules where one molecule is "more complex" than the other. The neural
network then can make scores which attempt to keep this pairwise ordering of molecules. The final result is a model
which can give a relative complexity of a molecule.
The paper trains on every reaction in reaxys, declaring products more complex than reactions. Since this training set is
prohibitively expensive we will instead train on arbitrary molecules declaring one more complex if its SMILES string is
longer. In the real world you can use whatever measure of complexity makes sense for the project.
In this tutorial, we'll use the Tox21 dataset to train our simple synthetic feasibility model.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='Raw', splitter=None)
molecules = datasets[0].X
Because ScScore is trained on relative complexities, we want the X tensor in our dataset to have 3 dimensions
(sample_id, molecule_id, features) . The molecule_id dimension has size 2 because a sample is a pair of
molecules. The label is 1 if the first molecule is more complex than the second molecule. The function create_dataset
we introduce below pulls random pairs of SMILES strings out of a given list and ranks them according to this complexity
measure.
In the real world you could use purchase cost, or number of reaction steps required as your complexity score.
returns:
dc.data.Dataset for input into ScScore Model
Dataset.X
shape is (sample_id, molecule_id, features)
Dataset.y
shape is (sample_id,)
values is 1 if the 0th index molecule is more complex
0 if the 1st index molecule is more complex
"""
X, y = [], []
all_data = list(zip(fingerprints, smiles_lens))
while len(y) < ds_size:
i1= random.randrange(0, len(smiles_lens))
i2= random.randrange(0, len(smiles_lens))
m1= all_data[i1]
m2= all_data[i2]
ifm1[1] == m2[1]:
continue
if m1[1] > m2[1]:
y.append(1.0)
else:
y.append(0.0)
X.append([m1[0], m2[0]])
return dc.data.NumpyDataset(np.array(X), np.expand_dims(np.array(y), axis=1))
With our complexity ranker in place we can now construct our dataset. Let's start by randomly splitting the list of
molecules into training and test sets.
molecule_ds = dc.data.NumpyDataset(np.array(molecules))
splitter = dc.splits.RandomSplitter()
train_mols, test_mols = splitter.train_test_split(molecule_ds)
We'll featurize all our molecules with the ECFP fingerprint with chirality (matching the source paper), and will then
construct our pairwise dataset using the function defined above. We are using Circular Fingerprint featurizer, and
defining parameters such as the fingerprint size n_features, fingerprint radius radius, and whether to consider chirality
chiral. The Circular Fingerprint is a popular type of molecular fingerprint that encodes the structural information of
molecules.
n_features = 1024
featurizer = dc.feat.CircularFingerprint(size=n_features, radius=2, chiral=True)
train_features = featurizer.featurize(train_mols.X)
train_smiles_len = [len(Chem.MolToSmiles(x)) for x in train_mols.X]
train_dataset = create_dataset(train_features, train_smiles_len)
Now that we have our dataset created, let's train a ScScoreModel on this dataset.
model = dc.models.ScScoreModel(n_features=n_features)
model.fit(train_dataset, nb_epoch=20)
0.03494557857513428
Model Performance
Lets evaluate how well the model does on our holdout molecules. The SaScores should track the length of SMILES
strings from never before seen molecules.
mol_scores = model.predict_mols(test_mols.X)
smiles_lengths = [len(Chem.MolToSmiles(x)) for x in test_mols.X]
Let's now plot the length of the smiles string of the molecule against the SaScore using matplotlib.
plt.figure(figsize=(20,16))
plt.scatter(smiles_lengths, mol_scores)
plt.xlim(0,80)
plt.xlabel("SMILES length")
plt.ylabel("ScScore")
plt.show()
As we can see the model generally tracks SMILES length. It has good enrichment between 8 and 30 characters and gets
both small and large SMILES strings extremes dead on.
Now you can train your own models on more meaningful metrics than SMILES length!
Bibliography:
[1] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00622
Calculating Atomic Contributions for Molecules Based on a
Graph Convolutional QSAR Model
In an earlier tutorial we introduced the concept of model interpretability: understanding why a model produced the
result it did. In this tutorial we will learn about atomic contributions, a useful tool for interpreting models that operate on
molecules.
The idea is simple: remove a single atom from the molecule and see how the model's prediction changes. The "atomic
contribution" for an atom is defined as the difference in activity between the whole molecule, and the fragment
remaining after atom removal. It is a measure of how much that atom affects the prediction.
Contributions are also known as "attributions", "coloration", etc. in the literature. This is a model interpretation method
[1], analogous to Similarity maps [2] in the QSAR domain, or occlusion methods in other fields (image classification, etc).
Present implementation was used in [4].
Mariia Matveieva, Pavel Polishchuk. Institute of Molecular and Translational Medicine, Palacky University, Olomouc,
Czech Republic.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to
run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case,
don't run these cells since they will download and install Anaconda on your local machine.
First let's create the dataset. The molecules are stored in an SDF file.
import os
import pandas as pd
import deepchem as dc
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw, PyMol, rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from deepchem import metrics
from IPython.display import Image, display
from rdkit.Chem.Draw import SimilarityMaps
import tensorflow as tf
current_dir = os.path.dirname(os.path.realpath('__file__'))
dc.utils.download_url(
'https://raw.githubusercontent.com/deepchem/deepchem/master/examples/tutorials/assets/atomic_contributions_tutori
current_dir,
'logBB.sdf'
)
DATASET_FILE =os.path.join(current_dir, 'logBB.sdf')
# Create RDKit mol objects, since we will need them later.
mols = [m for m in Chem.SDMolSupplier(DATASET_FILE) if m is not None ]
loader = dc.data.SDFLoader(tasks=["logBB_class"],
featurizer=dc.feat.ConvMolFeaturizer(),
sanitize=True)
dataset = loader.create_dataset(DATASET_FILE, shard_size=2000)
np.random.seed(2020)
tf.random.set_seed(2020)
current_dir = os.path.dirname(os.path.realpath('__file__'))
dc.utils.download_url(
'https://raw.githubusercontent.com/deepchem/deepchem/master/examples/tutorials/assets/atomic_contributions_tutori
current_dir,
'logBB_test_.sdf'
)
TEST_DATASET_FILE = os.path.join(current_dir, 'logBB_test_.sdf')
loader = dc.data.SDFLoader(tasks=["p_np"], sanitize=True,
featurizer=dc.feat.ConvMolFeaturizer())
test_dataset = loader.create_dataset(TEST_DATASET_FILE, shard_size=2000)
pred = m.predict(test_dataset)
pred = np.argmax(np.squeeze(pred),axis=1)
ba = metrics.balanced_accuracy_score(y_true=test_dataset.y, y_pred=pred)
print(ba)
0.7444444444444445
The balanced accuracy is high enough. Now let's proceed to model interpretation and estimate the contributions of
individual atoms to the prediction.
A fragment dataset
Now let's prepare a dataset of fragments based on the training set. (Any other unseen data set of interest can also be
used). These fragments will be used to evaluate the contributions of individual atoms.
For each molecule we will generate a list of ConvMol objects. Specifying per_atom_fragmentation=True tells it to
iterate over all heavy atoms and featurize a single-atom-depleted version of the molecule with each one removed.
loader = dc.data.SDFLoader(tasks=[],# dont need task (moreover, passing the task can lead to inconsitencies in data s
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=True),
sanitize=True)
frag_dataset = loader.create_dataset(DATASET_FILE, shard_size=5000)
The dataset still has the same number of samples as the original training set, but each sample is now represented as a
list of ConvMol objects (one for each fragment) rather than a single ConvMol.
IMPORTANT: The order of fragments depends on the input format. If SDF, the fragment order is the same as the atom
order in corresponding mol blocks. If SMILES (i.e. csv with molecules represented as SMILES), then the order is given by
RDKit CanonicalRankAtoms
print(frag_dataset.X.shape)
(298,)
We really want to treat each fragment as a separate sample. We can use a FlatteningTransformer to flatten the
fragments lists.
tr = dc.trans.FlatteningTransformer(frag_dataset)
frag_dataset = tr.transform(frag_dataset)
print(frag_dataset.X.shape)
(5111,)
Note: Here, in classification context, we use the probability output of the model as the activity. So the contribution is the
probability difference, i.e. "how much a given atom increases/decreases the probability of the molecule being active."
# whole molecules
pred = np.squeeze(m.predict(dataset))[:, 1] # probabilitiy of class 1
pred = pd.DataFrame(pred, index=dataset.ids, columns=["Molecule"]) # turn to dataframe for convinience
# fragments
pred_frags = np.squeeze(m.predict(frag_dataset))[:, 1]
pred_frags = pd.DataFrame(pred_frags, index=frag_dataset.ids, columns=["Fragment"])
df
We can use the SimilarityMaps feature of RDKit to visualize the results. Each atom is colored by how it affects activity.
np.random.seed(2000)
maps = vis_contribs(np.random.choice(np.array(mols),10), df)
We can see that aromatics or aliphatics have a positive impact on blood-brain barrier permeability, while polar or
charged heteroatoms have a negative influence. This is generally consistent with literature data.
A regression task
The example above used a classification model. The same techniques can also be used for regression models. Let's look
at a regression task, aquatic toxicity (towards the water organism T. pyriformis).
Toxicity is defined as log10(IGC50) (concentration that inhibits colony growth by 50%). Toxicophores for T. pyriformis
will be identified by atomic contributions.
All the above steps are the same: load data, featurize, build a model, create dataset of fragments, find contributions,
and visualize them.
Note: this time as it is regression, contributions will be in activity units, not probability.
current_dir = os.path.dirname(os.path.realpath('__file__'))
dc.utils.download_url(
'https://raw.githubusercontent.com/deepchem/deepchem/master/examples/tutorials/assets/atomic_contributions_tutori
current_dir,
'Tetrahymena_pyriformis_Work_set_OCHEM.sdf'
)
DATASET_FILE =os.path.join(current_dir, 'Tetrahymena_pyriformis_Work_set_OCHEM.sdf')
np.random.seed(2020)
tf.random.set_seed(2020)
m = dc.models.GraphConvModel(1, mode="regression", batch_normalize=False)
m.fit(dataset, nb_epoch=40)
current_dir = os.path.dirname(os.path.realpath('__file__'))
dc.utils.download_url(
'https://raw.githubusercontent.com/deepchem/deepchem/master/examples/tutorials/assets/atomic_contributions_tutori
current_dir,
'Tetrahymena_pyriformis_Test_set_OCHEM.sdf'
)
0.2381780323921622
0.784334539071699
Load the training set again, but this time set per_atom_fragmentation=True .
# whole molecules
pred = m.predict(dataset)
pred = pd.DataFrame(pred, index=dataset.ids, columns=["Molecule"]) # turn to dataframe for convenience
# fragments
pred_frags = m.predict(frag_dataset)
pred_frags = pd.DataFrame(pred_frags, index=frag_dataset.ids, columns=["Fragment"]) # turn to dataframe for convenie
Lets take some molecules with moderate activity (not extremely active/inactive) and visualize the atomic contributions.
Appendix
In this tutorial we operated on SDF files. However, if we use CSV files with SMILES as input, the order of the atoms in the
dataframe DOES NOT correspond to the original atom order. If we want to recover the original atom order for each
molecule (to have it in our main dataframe), we need to use RDKit's Chem.rdmolfiles.CanonicalRankAtoms. Here are
some utilities to do this.
We can add a column with atom ids (as in input molecules) and use the resulting dataframe for analysis with any other
software, outside the "python-rdkit-deepchem" environment.
Bibliography:
1. Polishchuk, P., O. Tinkov, T. Khristova, L. Ognichenko, A. Kosinskaya, A. Varnek & V. Kuz’min (2016) Structural and
Physico-Chemical Interpretation (SPCI) of QSAR Models and Its Comparison with Matched Molecular Pair Analysis.
Journal of Chemical Information and Modeling, 56, 1455-1469.
2. Riniker, S. & G. Landrum (2013) Similarity maps - a visualization strategy for molecular fingerprints and machine-
learning methods. Journal of Cheminformatics, 5, 43.
4. Matveieva, M., Polishchuk, P. Benchmarks for interpretation of QSAR models. J Cheminform 13, 41 (2021).
https://doi.org/10.1186/s13321-021-00519-x
Evaluating models on new data, including corner cases, is a critical step toward model deployment. However,
generating new molecules to test in an interactive way is rarely straightforward. TCW provides several tools to help
subset larger datasets and draw new molecules to test against your models. You can find the full documentation for the
Trident Chemwidgets library here.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
For this tutorial, you'll need Trident Chemwidgets version 0.2.0 or greater. We can check the installed version with the
following command:
0.2.1
Throughout this tutorial, we'll use the convention tcw to call the classes from the Trident Chemwidgets package.
import deepchem as dc
We can then use RDKit to calculate some additional features for each of the training examples. Specifically, we'll
compute the logP and molecular weight of each molecule and return this new data in a dataframe.
data = []
mol_data = pd.DataFrame(data)
mol_data.head()
One-dimensional distributions
We can examine one-dimensional distributions using a histogram. Unlike histograms from static plotting libraries like
Matplotlib or Seaborn, the TCW Histogram provides interactive functionality. TCW enables subsetting of the data,
plotting chemical structures in a gallery next to the plot, and saving a reference to the subset portion of the dataframe.
Unfortunately, this interactivity comes at the price of portability, so we have included screenshots for this tutorial in
addition to providing the code to generate the interactive visuals. If you run this tutorial yourself (either locally or on
Colab), you'll be able to display and interact with full demo plots.
In the plot below, you can see the histogram of the molecular weight distribution from the combined dataset on the left.
If you click and drag within plot area in the live widget, you can subset a portion of the distribution for further
examination. The background of the selected portion will turn gray and the selected data points will be shown in teal
within the bars of the plot. The x axis of the Histogram widget is compatible with either numeric or date data types,
which makes it a convenient choice for splitting your ML datasets based on a property or the date the experimental data
were collected.
Histogram example
To generate an interactive example of the widget, run the next cell:
If you select subset of the data by clicking and dragging, you can view the selected structures in the gallery to the right
by pressing the SHOW STRUCTURES button beneath the plot. You can extract this subset of the original dataframe by
pressing SAVE SELECTION and accessing the hist.selection property as shown in the next cell. This workflow is
convenient for applications like data splitting based on a single dimension.
hist.selection
In the image below, we have selected a portion of dataset with large molecular weight values, but minimal training
examples (displayed points in orange), to demonstrate how the Scatter widget can be useful for outlier identification. In
addition to selection by bounding box, you can also hover over individual points to display a drawing of the underlying
structure.
Scatter example
If you select subset of the data by clicking and dragging, you can view the selected structures in the gallery to the right
by pressing the SHOW STRUCTURES button beneath the plot. You can extract this subset of the original dataframe by
pressing SAVE SELECTION and accessing the scatter.selection property as shown in the next cell.
scatter.selection
Training a GraphConvModel
Now that we've had a look at the training data, we can train a GraphConvModel to predict the 12 Tox21 classes. We'll
replicate the training procedure exactly from the Introduction to Graph Covolutions tutorial. We'll train for 50 epochs,
just as in the original tutorial.
# Now we'll set the tensorflow seed to make sure the results of this notebook are reproducible
import tensorflow as tf; tf.random.set_seed(27)
n_tasks = len(tasks)
model = dc.models.GraphConvModel(n_tasks, mode='classification')
model.fit(train_dataset, nb_epoch=50)
Now that we have a trained model, we can check AUROC values for the training and test datasets:
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print(f'Training set score: {model.evaluate(train_dataset, [metric], transformers)["roc_auc_score"]:.2f}')
print(f'Test set score: {model.evaluate(test_dataset, [metric], transformers)["roc_auc_score"]:.2f}')
Just as in the original tutorial, we see that the model performs reasonably well on the predefined train/test splits. Now
we'll use this model to evaluate compounds that are outside the training distribution, just as we might in a real-world
drug discovery scenario.
We can use the JSME widget provided by TCW to quickly test our model again some molecules of interest. We'll start
with a known therapeutic molecule: ibuprofen. We can see that ibuprofen is not included in any of the datasets that we
have evaluated our model against so far:
To simulate a drug discovery application, let's say you're a chemist tasked with identifying potential new therapeutics
derived from ibuprofen. Ideally, the molecules you test would have limited toxicity. You've just developed the model
above to predict the tox outcomes from Tox21 data and now you want to use it to do some first-pass screening of your
derivatives. The standard workflow for a task like this might include drawing the molecules in a program like ChemDraw,
exporting to SMILES format, importing into the notebook, then prepping the data and running it through your model.
With TCW, we can shortcut the first few steps of that workflow by using the JSME widget to draw molecules and convert
to SMILES directly in the notebook. We can even use the base_smiles argument to specify a base molecular structure,
which is great for generating derivatives. Here we'll set the base_smiles value to 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O' ,
the SMILES string for ibuprofen. Below is a screenshot using JSME to generate a few derivative molecules to test against
our toxicity model.
JSME example
To generate your own set of derivatives, run the cell below. To add a SMILES string to the saved set, click the ADD TO
SMILES LIST button below the interface. If you want to regenerate the original base molecule, in this case ibuprofen,
click the RESET TO BASE SMILES button below the interface. By using this button, it's easy to generate distinct
derivatives from a shared starting structure. Go ahead and create some ibuprofen derivatives to test against the tox
model:
jsme = tcw.JSME(base_smiles='CC(C)CC1=CC=C(C=C1)C(C)C(=O)O')
jsme
JSME(base_smiles='CC(C)CC1=CC=C(C=C1)C(C)C(=O)O')
You can access the smiles using the jsme.smiles property. This call will return a list of the SMILES strings that have
been added to the SMILES list of the widget (the ones shown in the molecule gallery to the right of the JSME interface).
print(jsme.smiles)
[]
To ensure the rest of this notebook runs correctly, the following cell sets the new test SMILES set to the ones from the
screenshot above in the case that you have not defined your own set using the widget. Otherwise, it will use the
molecules you have drawn.
# This cell will provide a preset list of SMILES strings in case you did not create your own.
if len(jsme.smiles) > 1:
drawn_smiles = jsme.smiles
else:
drawn_smiles = [
'CC(C)Cc1ccc(C(C)C(=O)O)cc1',
'CC(C)C(S)c1ccc(C(C)C(=O)O)cc1',
'CCSC(c1ccc(C(C)C(=O)O)cc1)C(C)CC',
'CCSC(c1ccc(C(C)C(=O)O)cc1)C(C)C(=O)O',
'CC(C(=O)O)c1ccc(C(S)C(C)C(=O)O)cc1'
]
Next we have to create a dataset that is compatible with our model to test these new molecules.
featurizer = dc.feat.ConvMolFeaturizer()
loader = dc.data.InMemoryLoader(tasks=list(train_dataset.tasks), featurizer=featurizer)
dataset = loader.create_dataset(drawn_smiles, shard_size=1)
Finally, we can generate our predictions of positive results here and plot them.
<AxesSubplot:>
Now we can get the predicted most toxic compound/assay result for further inspection. Below we extract the highest
predicted positive hit (most toxic) and display the assay name, SMILES string, and an image of the structure.
import numpy as np
Building on the tutorial Calculating Atomic Contributions for Molecules Based on a Graph Convolutional QSAR Model, we
can calculate the relative contribution of each atom in a molecule to the predicted output value. This attribution
strategy enables us to determine whether the molecular features that a chemist may identify as important and those
most affecting the predictions are in alignment. If the chemist's interpretation and the model's interpretation metrics
are consistent, that may indicate that the model is a good fit for the task at hand. However, the inverse is not
necessarily true either. A model may have the capacity to make accurate predictions that a trained chemist cannot fully
understand. This is just one tool in a machine learning practitioner's toolbox.
We'll start by using the built-in per_atom_fragmentation argument for the ConvMolFeaturizer . This will generate a
list of ConvMol objects that have each had a single atom removed.
featurizer = dc.feat.ConvMolFeaturizer(per_atom_fragmentation=True)
mol_list = featurizer(smiles)
loader = dc.data.InMemoryLoader(tasks=list(train_dataset.tasks),
featurizer=dc.feat.DummyFeaturizer())
dataset = loader.create_dataset(mol_list[0], shard_size=1)
We can then run these predictions through the model and retrieve the predicted values for the molecule and assay
specified in the last section.
We can use the InteractiveMolecule widget from TCW to superimpose the contribution scores on the molecule itself,
allowing us to easily asses the relative importance of each atom to the final prediction. If you click on one of the atoms,
you can retrieve the contribution data in a card shown to the right of the structure. In this panel you can also select a
variable by which to color the atoms in the plot.
InteractiveMolecule example
You can generate the interactive widget by running the cell below.
Wrapping up
In this tutorial, we learned how to incorporate Trident Chemwidgets into your DeepChem-based ML workflow. While TCW
was built with molecular ML workflows in mind, the library also works well for general cheminformatics notebooks as
well.
Deep learning for chemistry and materials science remains a novel field with lots of potiential. However, the popularity
of transfer learning based methods in areas such as natural language processing (NLP) and computer vision have not
yet been effectively developed in computational chemistry + machine learning. Using HuggingFace's suite of models
and the ByteLevel tokenizer, we are able to train a large-transformer model, RoBERTa, on a large corpus of 10,000,000
SMILES strings from a commonly known benchmark chemistry dataset, PubChem.
Training RoBERTa over 10 epochs, the model achieves a pretty good loss of 0.198, and may likely continue to converge
if trained for a larger number of epochs. The model can predict masked/corrupted tokens within a SMILES
sequence/molecule, allowing for variants of a molecule within discoverable chemical space to be predicted.
By applying the representations of functional groups and atoms learned by the model, we can try to tackle problems of
toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as
features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT.
Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly
identify important substructures in various chemical properties.
Additionally, visualization of the attention mechanism have been seen through previous research as incredibly valuable
towards chemical reaction classification. The applications of open-sourcing large-scale transformer models such as
RoBERTa with HuggingFace may allow for the acceleration of these individual research directions.
A link to a repository which includes the training, uploading and evaluation notebook (with sample predictions on
compounds such as Remdesivir) can be found here. All of the notebooks can be copied into a new Colab runtime for
easy execution. This repository will be updated with new features, such as attention visualization, easier benchmarking
infrastructure, and more. The work behind this tutorial has been published on Arxiv, and was accepted for a poster
presentation at NeurIPS 2020's ML for Molecules Workshop.
For the sake of this tutorial, we'll be fine-tuning a pre-trained ChemBERTa on a small-scale molecule dataset, Clintox, to
show the potiential and effectiveness of HuggingFace's NLP-based transfer learning applied to computational chemistry.
Output for some cells are purposely cleared for readability, so do not worry if some output messages for your cells
differ!
In short, there are three major components we'll be going over in this notebook.
Don't worry if you aren't familiar with some of these terms. We will explain them later in the tutorial!
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
We want to install NVIDIA's Apex tool, for the training pipeline used by simple-transformers and Weights and Biases.
This package enables us to use 16-bit training, mixed precision, and distributed training without any changes to our
code. Generally GPUs are good at doing 32-bit(single precision) math, not at 16-bit(half) nor 64-bit(double precision).
Therefore traditionally deep learning model trainings are done in 32-bit. By switching to 16-bit, we’ll be using half the
memory and theoretically less computation at the expense of the available number range and precision. However, pure
16-bit training creates a lot of problems for us (imprecise weight updates, gradient underflow and overflow). Mixed
precision training, with Apex, alleviates these problems.
We will be installing simple-transformers , a library which builds ontop of HuggingFace's transformers package
specifically for fine-tuning ChemBERTa.
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line
# !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
sys.path += ['bertviz_repo']
!pip install regex
FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (2019.12.20)
We're going to clone an auxillary repository, bert-loves-chemistry, which will enable us to use the MolNet dataloader for
ChemBERTa, which automatically generates scaffold splits on any MoleculeNet dataset!
fatal: destination path 'bert-loves-chemistry' already exists and is not an empty directory.
!nvidia-smi
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Now, to ensure our model demonstrates an understanding of chemical syntax and molecular structure, we'll be testing it
on predicting a masked token/character within the SMILES molecule for benzene.
What is a tokenizer?
A tokenizer is in charge of preparing the inputs for a natural language processing model. For many scientific
applications, it is possible to treat inputs as “words”/”sentences” and use NLP methods to make meaningful predictions.
For example, SMILES strings or DNA sequences have grammatical structure and can be usefully modeled with NLP
techniques. DeepChem provides some scientifically relevant tokenizers for use in different applications. These
tokenizers are based on those from the Huggingface transformers library (which DeepChem tokenizers inherit from).
The base classes PreTrainedTokenizer and PreTrainedTokenizerFast in HuggingFace implements the common methods
for encoding string inputs in model inputs and instantiating/saving python tokenizers either from a local file or directory
or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).
PreTrainedTokenizer (transformers.PreTrainedTokenizer)) thus implements the main methods for using all the
tokenizers:
Tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and
encoding/decoding (i.e. tokenizing + convert to integers),
Adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE,
SentencePiece…),
Managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in
the tokenizer for easy access and making sure they are not split during tokenization)
The default tokenizer used by ChemBERTa, is a Byte-Pair-Encoder (BPE). It is a hybrid between character and word-level
representations, which allows for the handling of large vocabularies in natural language corpora. Motivated by the
intuition that rare and unknown words can often be decomposed into multiple known subwords, BPE finds the best word
segmentation by iteratively and greedily merging frequent pairs of characters.
First, lets load the model's Byte-Pair Encoding tokenizer, and model, and setup a Huggingface pipeline for masked
tokeni prediction.
model = AutoModelForMaskedLM.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
tokenizer = AutoTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
With the emergence of BERT by Google AI in 2018, transformers have quickly shot to the top of emerging deep learning
methods, outperforming Neural Machine Translation models such as seq2seq and recurrent neural networks at dozens
of tasks.
The biggest benefit, however, comes from how The Transformer lends itself to efficient pre-training. Using the same
pre-training procedure used by RoBERTa, a follow-up work of BERT, which masks 15% of the tokens, we mask 15% of
the tokens in each SMILES string and assign a maximum sequence length of 256 characters.
The model then learns to predict masked tokens consisting of atoms and functional groups, or specific groups of
atoms within molecules which have their own characteristic properties. Through this, the model learns the relevant
molecular context for transferable tasks, such as property prediction.
ChemBERTa employs a bidirectional training context to learn context-aware representations of the PubChem 10M
dataset, downloadable through MoleculeNet for self-supervised pre-training (link). Our variant of the BERT transformer
uses 12 attention heads and 6 layers, resulting in 72 distinct attention mechanisms.
The Transformer was proposed in the paper Attention is All You Need.
Now, to ensure our the ChemBERTa model demonstrates an understanding of chemical syntax and molecular structure,
we'll be testing it on predicting a masked token/character within the SMILES molecule for benzene. Using the
Huggingface pipeline we initialized earlier we can fetch a list of the model's predictions by confidence score:
smiles_mask = "C1=CC=CC<mask>C1"
smiles = "C1=CC=CC=C1"
masked_smi = fill_mask(smiles_mask)
Here, we get some interesting results. The final branch, C1=CC=CC=C1 , is a benzene ring. Since its a pretty common
molecule, the model is easily able to predict the final double carbon bond with a score of 0.98. Let's get a list of the top
5 predictions (including the target, Remdesivir), and visualize them (with a highlighted focus on the beginning of the
final benzene-like pattern). To visualize them, we'll be using the RDKit cheminoformatics package we installed earlier,
specifically the rdkit.chem.Draw module.
import torch
import rdkit
import rdkit.Chem as Chem
from rdkit.Chem import rdFMCS
from matplotlib import colors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import MolToImage
from PIL import Image
def get_mol(smiles):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
Chem.Kekulize(mol)
return mol
def find_matches_one(mol,submol):
#find all matching atoms for each submol in submol_list in mol.
match_dict = {}
mols = [mol,submol] #pairwise search
res=rdFMCS.FindMCS(mols) #,ringMatchesRingOnly=True)
mcsp = Chem.MolFromSmarts(res.smartsString)
matches = mol.GetSubstructMatches(mcsp)
return matches
sequence = f"C1=CC=CC={tokenizer.mask_token}1"
substructure = "CC=CC"
image_list = []
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
C1=CC=CC=CC1
C1=CC=CC=CCC1
C1=CC=CC=CN1
C1=CC=CC=CCCC1
C1=CC=CC=CCO1
However, further training on a more specific dataset (say leads for a specific target) may generate a stronger chemical
transformer model. Let's now fine-tune our model on a dataset of our choice, ClinTox. You can run ChemBERTa on any
MoleculeNet dataset, but for the sake of convinience, we will use ClinTox as it is small and trains quickly.
What is attention?
Previously, recurrent models struggled with generating a fixed-length vector for large sequences, leading to
deteriorating performance as the length of an input sequence increased.
Attention is, to some extent, motivated by how we pay visual attention to different regions of our vision or how we
correlate words in a sentence. Human visual attention allows us to focus on a certain subregion with a higher focus
while perceiving the surrounding image in with a lower focus, and then adjust the focal point.
Similarly, we can explain the relationship between words in one sentence or close context. When we see “eating”, we
expect to read a food word very soon. The color term describes the food, but probably not as directly as “eating” does:
The attention mechanism extends on the encoder-decoder model, by taking in three values for a SMILES sequence: a
value vector (V), a query vector (Q) and a key vector (K).
Each vector is similar to a type of word embedding, specifically for determining the compatibility of neighbouring
tokens. From these vectors, a dot production attention is derived from the dot product of the query vector of one word,
and the key vector of the other.
A scaling factor of
is added to the dot product attention such that the value doesn't grow too large in respect to
, the dimension of the key. The softmax normalization function is applied to return a score between 0 to 1 for each
individual token:
Using this tool, we can easily plug in ChemBERTa from the HuggingFace model hub and visualize the attention patterns
produced by one or more attention heads in a given transformer layer. This is known as the attention-head view.
Lets start by obtaining a Javascript object for d3.js and jquery to create interactive visualizations:
%%javascript
require.config({
paths: {
d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
}
});
def call_html():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
"d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
},
});
</script>
'''))
Now, we create an instance of ChemBERTa, tokenize a set of SMILES strings, and compute the attention for each head in
the transformer. There are two available models hosted by DeepChem on HuggingFace's model hub, one being
seyonec/ChemBERTa-zinc-base-v1 which is the ChemBERTa model trained via masked lagnuage modelling (MLM) on
the ZINC100k dataset, and the other being seyonec/ChemBERTa-zinc250k-v1 , which is trained via MLM on the larger
ZINC250k dataset.
In the following example, we take two SMILES molecules from the ZINC database with nearly identical chemical
structure, the only difference being rooted in chiral specification (hence the additional ‘@‘ symbol). This is a feature of
molecules which indicates that there exists tetrahedral centres. ‘@' tells us whether the neighbours of a molecule
appear in a counter-clockwise order, whereas ‘@@‘ indicates that the neighbours are ordered in a clockwise direction.
The model should ideally refer to similar substructures in each SMILES string with a higher attention weightage.
m = Chem.MolFromSmiles('CCCCC[C@@H](Br)CC')
fig = Draw.MolToMPL(m, size=(200, 200))
And the second SMILES string, CCCCC[C@H](Br)CC :
m = Chem.MolFromSmiles('CCCCC[C@H](Br)CC')
fig = Draw.MolToMPL(m, size=(200,200))
The visualization below shows the attention induced by a sample input SMILES. This view visualizes attention as lines
connecting the tokens being updated (left) with the tokens being attended to (right), following the design of the figures
above. Color intensity reflects the attention weight; weights close to one show as very dark lines, while weights close to
zero appear as faint lines or are not visible at all. The user may highlight a particular SMILES character to see the
attention from that token only. This visualization is called the attention-head view. It is based on the
excellent Tensor2Tensor visualization tool, and are all generated by the Bertviz library.
model_version = 'seyonec/PubChem10M_SMILES_BPE_450k'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)
sentence_a = "CCCCC[C@@H](Br)CC"
sentence_b = "CCCCC[C@H](Br)CC"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()
head_view(attention, tokens)
Layer:
Smiles-Tokenizer Attention by Head View
The visualization shows that attention is highest between words that don’t cross a boundary between the two SMILES
strings; the model seems to understand that it should relate tokens to other tokens in the same molecule in order to
best understand their context.
There are many other fascinating visualizations we can do, such as a neuron-by neuron analysis of attention or a model
overview that visualizes all of the heads at once:
Model View:
Neuron-by-neuron view:
You can try out the ChemBERTa attention visualization demo's in more detail, with custom SMILES/SELFIES strings,
tokenizers, and more in the public library, here.
By pre-training directly on SMILES strings, and teaching ChemBERTa to recognize masked tokens in each string, the
model learns a strong molecular representation. We then can take this model, trained on a structural chemistry task,
and apply it to a suite of classification tasks in the MoleculeNet suite, from Tox21 to BBBP!
The ClinTox dataset consists of 1478 binary labels for toxicity, using the SMILES representations for identifying
molecules. The computational models produced from the dataset could become decision-making tools for government
agencies in determining which drugs are of the greatest potential concern to human health. Additionally, these models
can act as drug screening tools in the drug discovery pipelines for toxicity.
Let's start by importing the MolNet dataloder from bert-loves-chemistry , before importing apex and transformers,
the tool which will allow us to import the ChemBERTA language model (LM) trained on PubChem-10M.
%cd /content/bert-loves-chemistry
/content/bert-loves-chemistry
!pwd
/content/bert-loves-chemistry
import os
import numpy as np
import pandas as pd
Though this result suggests that a more semantically relevant tokenization may provide performance benefits, further
benchmarking on additional datasets is needed to validate this finding. In this tutorial, we aim to do so, by testing
this alternate model on the ClinTox dataset.
Let's fetch the Smiles Tokenizer's character per line vocabulary file, which can bve loaded from the DeepChem S3 data
bucket:
!wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/vocab.txt
Lets use the MolNet dataloader to generate scaffold splits from the ClinTox dataset.
If you're only running the toxicity prediction portion of this tutorial, make sure you install transformers here. If you've
ran all the cells before, you can ignore this install as we've already done pip install transformers before.
text labels
0 CC(C)C[C@H](NC(=O)CNC(=O)c1cc(Cl)ccc1Cl)B(O)O 0
1 O=C(NCC(O)CO)c1c(I)c(C(=O)NCC(O)CO)c(I)c(N(CCO... 1
2 Clc1cc(Cl)c(OCC#CI)cc1Cl 1
3 N#Cc1cc(NC(=O)C(=O)[O-])c(Cl)c(NC(=O)C(=O)[O-])c1 1
4 NS(=O)(=O)c1cc(Cl)c(Cl)c(S(N)(=O)=O)c1 1
1177 CC(C[NH2+]C1CCCCC1)OC(=O)c1ccccc1 1
1178 CC(C(=O)[O-])c1ccc(C(=O)c2cccs2)cc1 1
1179 CC(c1cc2ccccc2s1)N(O)C(N)=O 1
1180 CC(O)C(CO)NC(=O)C1CSSCC(NC(=O)C([NH3+])Cc2cccc... 1
1181 CC(C)OC(=O)CCC/C=C\C[C@H]1[C@@H](O)C[C@@H](O)[... 1
valid_df
text labels
0 CC(C)OC(=O)CCC/C=C\C[C@H]1[C@@H](O)C[C@@H](O)[... 1
1 CC(C)Nc1cccnc1N1CCN(C(=O)c2cc3cc(NS(C)(=O)=O)c... 1
2 CC(C)n1c(/C=C/[C@H](O)C[C@H](O)CC(=O)[O-])c(-c... 1
3 CC(C)COCC(CN(Cc1ccccc1)c1ccccc1)[NH+]1CCCC1 1
4 CSCC[C@H](NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)... 1
143 C[C@H](OC(=O)c1ccccc1)C1=CCC23OCC[NH+](C)CC12C... 1
144 C[C@@H](c1ncncc1F)[C@](O)(Cn1cncn1)c1ccc(F)cc1F 1
145 CC(C)C[C@@H](NC(=O)[C@H](C)NC(=O)CNC(=O)[C@@H]... 1
146 C[C@H](O)[C@H](O)[C@H]1CNc2[nH]c(N)nc(=O)c2N1 1
147 C[NH+]1C[C@H](C(=O)N[C@]2(C)O[C@@]3(O)[C@@H]4C... 1
test_df
text labels
0 C[NH+]1C[C@H](C(=O)N[C@]2(C)O[C@@]3(O)[C@@H]4C... 1
1 C[C@]1(Cn2ccnn2)[C@H](C(=O)[O-])N2C(=O)C[C@H]2... 1
2 C[NH+]1CCC[C@@H]1CCO[C@](C)(c1ccccc1)c1ccc(Cl)cc1 1
3 Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1 1
4 OC[C@H]1O[C@@H](n2cnc3c2NC=[NH+]C[C@H]3O)C[C@@... 1
143 O=C1O[C@H]([C@@H](O)CO)C([O-])=C1O 1
144 C#CCC(Cc1cnc2nc(N)nc(N)c2n1)c1ccc(C(=O)N[C@@H]... 1
145 C#CC[NH2+][C@@H]1CCc2ccccc21 1
146 [H]/[NH+]=C(\N)c1ccc(OCCCCCOc2ccc(/C(N)=[NH+]/... 1
147 [H]/[NH+]=C(\N)C1=CC(=O)/C(=C\C=c2ccc(=C(N)[NH... 1
From here, lets set up a logger to record if any issues occur, and notify us if there are any problems with the arguments
we've set for the model.
from simpletransformers.classification import ClassificationModel
import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
Now, using simple-transformer , let's load the pre-trained model from HuggingFace's useful model-hub. We'll set the
number of epochs to 10 in the arguments, but you can train for longer, and pass early-stopping as an argument to
prevent overfitting. Also make sure that auto_weights is set to True to do automatic weight balancing, as we are
dealing with imbalanced toxicity datasets.
print(model.tokenizer)
# check if our train and evaluation dataframes are setup properly. There should only be two columns for the SMILES st
print("Train Dataset: {}".format(train_df.shape))
print("Eval Dataset: {}".format(valid_df.shape))
print("TEST Dataset: {}".format(test_df.shape))
Now that we've set everything up, lets get to the fun part: training the model! We use Weights and Biases, which is
optional (simply remove wandb_project from the list of args ). Its a really useful tool for monitering the model's
training results (such as accuracy, learning rate and loss), alongside custom visualizations of attention and gradients.
When you run this cell, Weights and Biases will ask for an account, which you can setup through a Github account,
giving you an authorization API key which you can paste into the output of the cell. Again, this is completely optional
and it can be removed from the list of arguments.
!wandb login
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Finally, the moment we've been waiting for! Let's train the model on the train scaffold set of ClinTox, and monitor our
runs using W&B. We will evaluate the performance of our model each epoch using the validation set.
# Create directory to store model weights (change path accordingly to where you want!)
!mkdir BPE_PubChem_10M_ClinTox_run
Let's install scikit-learn now, to evaluate the model we've trained. We will be using the accuracy and PRC-AUC metrics
(average precision score).
import sklearn
# accuracy
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.accuracy_score)
# ROC-PRC
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.average_precision_score)
Run summary:
lr 0.0
global_step 1450
_runtime 116
_timestamp 1616079332
_step 28
Run history:
lr ▅██▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁
global_step ▁▁▁▂▂▂▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇██
_runtime ▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▇▇▇▇▇██
_timestamp ▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▇▇▇▇▇██
_step ▁▁▁▂▂▂▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇██
Synced 5 W&B file(s), 1 media file(s), 0 artifact file(s) and 0 other file(s)
_runtime 3
_timestamp 1616079341
_step 2
Run history:
_runtime ▁▁▁
_timestamp ▁▁▁
_step ▁▅█
Synced 5 W&B file(s), 3 media file(s), 0 artifact file(s) and 0 other file(s)
The model performs pretty well, averaging above 97% ROC-PRC after training on only ~1400 data samples and 150
positive leads in a couple of minutes! We can clearly see the predictive power of transfer learning, and approaches like
these are becoming increasing popular in the pharmaceutical industry where larger datasets are scarce. By training on
more epochs and tasks, we can probably boost the accuracy as well!
Lets evaluate the model on one last string from ClinTox's test set for toxicity. The model should predict 1, meaning the
drug failed clinical trials for toxicity reasons and wasn't approved by the FDA.
print(predictions)
print(raw_outputs)
[1]
[[-4.51171875 4.58203125]]
The model predicts the sample correctly! Some future tasks may include using the same model on multiple tasks (Tox21
provides multiple tasks relating to different biochemical pathways for toxicity, as an example), through multi-task
classification, as well as training on a larger dataset such as HIV, one of the other harder tasks in molecular machine
learning. This will be expanded on in future work!
print(model.tokenizer)
# check if our train and evaluation dataframes are setup properly. There should only be two columns for the SMILES st
print("Train Dataset: {}".format(train_df.shape))
print("Eval Dataset: {}".format(valid_df.shape))
print("TEST Dataset: {}".format(test_df.shape))
Now that we've set everything up, lets get to the fun part: training the model! We use Weights and Biases, which is
optional (simply remove wandb_project from the list of args ). Its a really useful tool for monitering the model's
training results (such as accuracy, learning rate and loss), alongside custom visualizations of attention and gradients.
When you run this cell, Weights and Biases will ask for an account, which you can setup through a Github account,
giving you an authorization API key which you can paste into the output of the cell. Again, this is completely optional
and it can be removed from the list of arguments.
!wandb login
wandb: Currently logged in as: seyonec (use `wandb login --relogin` to force relogin)
# Create directory to store model weights (change path accordingly to where you want!)
!mkdir SmilesTokenizer_PubChem_10M_ClinTox_run
# Train the model
model.train_model(train_df, eval_df=valid_df, output_dir='/content/SmilesTokenizer_PubChem_10M_ClinTox_run', args={
Run summary:
_runtime 3
_timestamp 1616079348
_step 2
Run history:
_runtime ▁██
_timestamp ▁██
_step ▁▅█
Synced 5 W&B file(s), 3 media file(s), 0 artifact file(s) and 0 other file(s)
Let's install scikit-learn now, to evaluate the model we've trained. We will be using the accuracy and PRC-AUC metrics
(average precision score).
import sklearn
# accuracy
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.accuracy_score)
# ROC-PRC
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.average_precision_score)
Run summary:
lr 0.0
global_step 2200
_runtime 175
_timestamp 1616079546
_step 43
Run history:
lr ▄▆███▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁
global_step ▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_runtime ▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
_timestamp ▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
_step ▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
Synced 5 W&B file(s), 1 media file(s), 0 artifact file(s) and 0 other file(s)
Run summary:
_runtime 3
_timestamp 1616079554
_step 2
Run history:
_runtime ▁▁▁
_timestamp ▁▁▁
_step ▁▅█
Synced 5 W&B file(s), 3 media file(s), 0 artifact file(s) and 0 other file(s)
The model performs incredibly well, averaging above 96% PRC-AUC after training on only ~1400 data samples and 150
positive leads in a couple of minutes! This model was also trained on 1/10th the amount of pre-training data as the
PubChem-10M BPE model we used previously, but it still showcases robust performance. We can clearly see the
predictive power of transfer learning, and approaches like these are becoming increasing popular in the pharmaceutical
industry where larger datasets are scarce. By training on more epochs and tasks, we can probably boost the accuracy
as well!
Lets evaluate the model on one last string from ClinTox's test set for toxicity. The model should predict 1, meaning the
drug failed clinical trials for toxicity reasons and wasn't approved by the FDA.
print(predictions)
print(raw_outputs)
[1]
[[-4.546875 4.83984375]]
The model predicts the sample correctly! Some future tasks may include using the same model on multiple tasks (Tox21
provides multiple tasks relating to different biochemical pathways for toxicity, as an example), through multi-task
classification, as well as training on a larger dataset such as HIV, one of the other harder tasks in molecular machine
learning. This will be expanded on in future work!
In this tutorial, we will train a Normalizing Flow (NF) on the QM9 dataset. The dataset comprises 133,885 stable small
organic molecules made up of CHNOF atoms. We will try to train a network that is an invertible transformation between
a simple base distribution and the distribution of molecules in QM9. One of the key advantages of normalizing flows is
that they can be constructed to efficiently sample from a distribution (generative modeling) and do probability density
calculations (exactly compute log-likelihoods), whereas other models make tradeoffs between the two or can only
approximate probability densities. This work has been published and considered as FastFlows see reference.
NFs are useful whenever we need a probabilistic model with one or both of these capabilities. Note that because NFs are
completely invertible, there is no "latent space" in the sense used when referring to generative adversarial networks or
variational autoencoders. For more on NFs, we refer to this review paper.
To encode the QM9 dataset, we'll make use of the SELFIES (SELF-referencIng Embedded Strings) representation, which
is a 100% robust molecular string representation. SMILES strings produced by generative models are often syntactically
invalid (they do not correspond to a molecular graph), or they violate chemical rules like the maximum number of bonds
between atoms. SELFIES are designed so that even totally random SELFIES strings correspond to valid molecular graphs,
so they are a great framework for generative modeling. For more details about SELFIES, see the GitHub repo and the
associated paper.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
import deepchem as dc
from deepchem.models.normalizing_flows import NormalizingFlow, NormalizingFlowModel
from deepchem.models.optimizers import Adam
from deepchem.data import NumpyDataset
from deepchem.splits import RandomSplitter
from deepchem.molnet import load_tox21
import rdkit
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import selfies as sf
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
tfk = tf.keras
tfk.backend.set_floatx('float64')
First, let's get a dataset of 2500 small organic molecules from the QM9 dataset. We'll then convert the molecules to
SELFIES, one-hot encode them, and dequantize the inputs so they can be processed by a normalizing flow. 2000
molecules will be used for training, while the remaining 500 will be split into validation and test sets. We'll use the
validation set to see how our architecture is doing at learning the underlying the distribution, and leave the test set
alone. You should feel free to experiment with this notebook to get the best model you can and evaluate it on the test
set when you're done!
SELFIES defines a dictionary called bond_constraints that enforces how many bonds every atom or ion can make.
E.g., 'C': 4, 'H': 1, etc. The ? symbol is used for any atom or ion that isn't defined in the dictionary, and it defaults to 8
bonds. Because QM9 contains ions and we don't want to allow those ions to form up to 8 bonds, we'll constrain them to
3. This will really improve the percentage of valid molecules we generate. You can read more about setting constraints
in the SELFIES documentation.
sf.set_semantic_constraints(constraints)
constraints
{'?': 3,
'B': 3,
'B+1': 2,
'B-1': 4,
'Br': 1,
'C': 4,
'C+1': 5,
'C-1': 3,
'Cl': 1,
'F': 1,
'H': 1,
'I': 1,
'N': 3,
'N+1': 4,
'N-1': 2,
'O': 2,
'O+1': 3,
'O-1': 1,
'P': 5,
'P+1': 6,
'P-1': 4,
'S': 6,
'S+1': 7,
'S-1': 5}
def preprocess_smiles(smiles):
return sf.encoder(smiles)
def keys_int(symbol_to_int):
d={}
i=0
for key in symbol_to_int.keys():
d[i]=key
i+=1
return d
data['selfies'] = data['smiles'].apply(preprocess_smiles)
Let's take a look at some short SMILES strings and their corresponding SELFIES representations. We can see right away
that there is a key difference in how the two representations deal with Rings and Branches. SELFIES is designed so that
branch length and ring size are stored locally with the Branch and Ring identifiers, and the SELFIES grammar
prevents invalid strings.
To convert SELFIES to a one-hot encoded representation, we need to construct an alphabet of all the characters that
occur in the list of SELFIES strings. We also have to know what the longest SELFIES string is, so that all the shorter
SELFIES can be padded with '[nop]' to be equal length.
selfies_list = np.asanyarray(data.selfies)
selfies_alphabet = sf.get_alphabet_from_selfies(selfies_list)
selfies_alphabet.add('[nop]') # Add the "no operation" symbol as a padding character
selfies_alphabet.add('.')
selfies_alphabet = list(sorted(selfies_alphabet))
largest_selfie_len = max(sf.len_selfies(s) for s in selfies_list)
symbol_to_int = dict((c, i) for i, c in enumerate(selfies_alphabet))
int_mol=keys_int(symbol_to_int)
selfies has a handy utility function to translate SELFIES strings into one-hot encoded vectors.
onehots=sf.batch_selfies_to_flat_hot(selfies_list, symbol_to_int,largest_selfie_len)
Next, we "dequantize" the inputs by adding random noise from the interval [0, 1) to every input in the encodings.
This allows the normalizing flow to operate on continuous inputs (rather than discrete), and the original inputs can easily
be recovered by applying a floor function.
The dequantized data is ready to be processed as a DeepChem dataset and split into training, validation, and test sets.
We'll also keep track of the SMILES strings for the training set so we can compare the training data to our generated
molecules later on.
(2000, 2596)
Next we'll set up the normalizing flow model. The base distribution is a multivariate Normal distribution. The
permutation layer permutes the dimensions of the input so that the normalizing flow layers will operate along multiple
dimensions of the inputs. To understand why the permutation is needed, we need to know a bit about how the
normalizing flow architecture works.
For this simple example, we'll set up a flow of repeating Masked Autoregressive Flow layers. The autoregressive
property is enforced by using the Masked Autoencoder for Distribution Estimation architecture. The layers of the flow
are a bijector, an invertible mapping between the base and target distributions.
MAF takes the inputs from the base distribution and transforms them with a simple scale-and-shift (affine) operation, but
crucially the scale-and-shift for each dimension of the output depends on the previously generated dimensions of the
output. That independence of future dimensions preserves the autoregressive property and ensures that the
normalizing flow is invertible. Now we can see that we need permutations to change the ordering of the inputs, or else
the normalizing flow would only transform certain dimensions of the inputs.
Batch Normalization layers can be added for additional stability in training, but may have strange effects on the outputs
and require some input reshaping to work properly. Increasing num_layers and hidden_units can make more
expressive flows capable of modeling more complex target distributions.
num_layers = 8
flow_layers = []
Made = tfb.AutoregressiveNetwork(params=2,
hidden_units=[512, 512], activation='relu')
for i in range(num_layers):
flow_layers.append(
(tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=Made)
))
flow_layers.append(tfb.Permute(permutation=permutation))
# if (i + 1) % int(2) == 0:
# flow_layers.append(tfb.BatchNormalization())
We can draw samples from the untrained distribution, but for now they don't have any relation to the QM9 dataset
distribution.
%%time
nf = NormalizingFlow(base_distribution=base_dist,
flow_layers=flow_layers)
CPU times: user 280 ms, sys: 10.2 ms, total: 290 ms
Wall time: 289 ms
Now to train the model! We'll try to minimize the negative log likelihood loss, which measures the likelihood that
generated samples are drawn from the target distribution, i.e. as we train the model, it should get better at modeling
the target distribution and it will generate samples that look like molecules from the QM9 dataset.
losses = []
val_losses = []
%%time
max_epochs = 10 # maximum number of epochs of the training
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shap
e=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), bu
t it was called on an input with incompatible shape (1, 128, 2596).
CPU times: user 13min 40s, sys: 20.9 s, total: 14min 1s
Wall time: 7min 27s
f, ax = plt.subplots()
ax.scatter(range(len(losses)), losses, label='train loss')
ax.scatter(range(len(val_losses)), val_losses, label='val loss')
plt.legend(loc='upper right');
The normalizing flow is learning a mapping between the multivariate Gaussian and the target distribution! We can see
this by visualizing the loss on the validation set. We can now use nfm.flow.sample() to generate new QM9-like
molecules and nfm.flow.log_prob() to evaluate the likelihood that a molecule was drawn from the underlying
distribution.
Now we transform the generated samples back into SELFIES. We have to quantize the outputs and add padding
characters to any one-hot encoding vector that has all zeros.
selfies has another utility function to translate one-hot encoded representations back to SELFIES strings.
mols=sf.batch_flat_hot_to_selfies(mols_list, int_mol)
We can use RDKit to find valid generated molecules. Some have unphysical valencies and should be discarded. If you've
ever tried to generate valid SMILES strings, you'll notice right away that this model is doing much better than we would
expect! Using SELFIES, 90% of the generated molecules are valid, even though our normalizing flow architecture doesn't
know any rules that govern chemical validity.
valid_count = 0
valid_selfies, invalid_selfies = [], []
for idx, selfies in enumerate(mols):
try:
if Chem.MolFromSmiles(sf.decoder(mols[idx]), sanitize=True) is not None:
valid_count += 1
valid_selfies.append(selfies)
else:
invalid_selfies.append(selfies)
except Exception:
pass
print('%.2f' % (valid_count / len(mols)), '% of generated samples are valid molecules.')
Let's take a look at some of the generated molecules! We'll borrow some helper functions from the Modeling Solubility
tutorial to display molecules with RDKit.
gen_mols = [Chem.MolFromSmiles(sf.decoder(vs)) for vs in valid_selfies]
def display_images(filenames):
"""Helper to pretty-print images."""
for file in filenames:
display(Image(file))
display_mols = []
for i in range(10):
display_mols.append(gen_mols[i])
display_images(mols_to_pngs(display_mols))
Finally, we can compare generated molecules with our training data via a similarity search with Tanimoto similarity. This
gives an indication of how "original" the generated samples are, versus simply producing samples that are extremely
similar to molecules the model has already seen. We have to keep in mind that QM9 contains all stable small molecules
with up to 9 heavy atoms (CONF). So anything new we generate either exists in the full QM9 dataset, or else will not
obey the charge neutrality and stability criteria used to generated QM9.
similarities = []
return similarities
We'll consider our generated molecules and look at the top 3 most similar molecules from the training data by Tanimoto
similarity. Here's an example where the Tanimoto similarity scores are medium. There are molecules in our training set
that are similar to our generated sample. This might be interesting, or it might mean that the generated molecule is
unrealistic.
display_images(mols_to_pngs(similar_mols, 'qm9_mol'))
0.521
0.471
0.468
Molecules of the previous tutorial:
These molecules were obteined through sampling.
0.243
0.243
0.241
Further reading
So far we have looked at a measure of validity and done a bit of investigation into the novelty of the generated
compounds. There are more dimensions along which we can and should evaluate the performance of a generative
model. For an example of some standard benchmarks, see the GuacaMol evaluation framework.
For more information about FastFlows look at this paper where the workflow is crearly explained.
For examples of normalizing flow-based molecular graph generation frameworks, check out the MoFlow, GraphAF, and
GraphNVP papers.
This tutorial is unlike the previous tutorials in that it's designed to be run on AWS rather than on Google Colab. That's
because we'll need access to a large machine with many cores to do this computation efficiently. We'll try to provide
details about how to do this throughout the tutorial.
WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python
/ops/math_grad.py:318: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecate
d and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:9
3: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amo
unt of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:9
3: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amo
unt of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:9
3: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amo
unt of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
0.0
y_true = np.squeeze(valid.y)
y_pred = model.predict(valid)[:,0,1]
print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
sorted_results = sorted(zip(y_pred, y_true), reverse=True)
hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
print("Hit Rate Top 100: %s" % hit_rate_100)
2. Create Work-Units
1. Download All of ZINC15.
Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format. I found it easiest to
download the wget script and then run the wget script. For the rest of this tutorial I will assume zinc was downloaded to
/tmp/zinc.
The way zinc downloads the data isn't great for inference. We want "Work-Units" which a single CPU can execute that
takes a resonable amount of time (10 minutes to an hour). To accomplish this we are going to split the zinc data into
files each with 500 thousand lines.
mkdir /tmp/zinc/screen
find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \
| split -l 500000 /tmp/zinc/screen/segment
This bash command
inference.py
import sys
import deepchem as dc
import numpy as np
from rdkit import Chem
import pickle
import os
def evaluate(fname):
fout_name = "%s_out.smi" % fname
model = dc.models.TensorGraph.load_from_dir('screen_model')
for ds, lines in create_dataset(fname):
y_pred = np.squeeze(model.predict(ds), axis=1)
with open(fout_name, 'a') as fout:
for index, line in enumerate(lines):
line.append(y_pred[index][1])
line = [str(x) for x in line]
line = "\t".join(line)
fout.write("%s\n" % line)
if __name__ == "__main__":
evaluate(sys.argv[1])
4. Load "Work-Unit" into a "Work Queue"
We are going to use a flat file as our distribution mechanism. It will be a bash script calling our inference script for every
work unit. If you are at an academic institution this would be queing your jobs in pbs/qsub/slurm. An option for cloud
computing would be rabbitmq or kafka.
import os
work_units = os.listdir('/tmp/zinc/screen')
with open('/tmp/zinc/work_queue.sh', 'w') as fout:
fout.write("#!/bin/bash\n")
for work_unit in work_units:
full_path = os.path.join('/tmp/zinc', work_unit)
fout.write("python inference.py %s" % full_path)
process_pool.py
import multiprocessing
import sys
from multiprocessing.pool import Pool
import delegator
def run_command(args):
q, command = args
cpu_id = q.get()
try:
command = "taskset -c %s %s" % (cpu_id, command)
print("running %s" % command)
c = delegator.run(command)
print(c.err)
print(c.out)
except Exception as e:
print(e)
q.put(cpu_id)
if __name__ == "__main__":
processors = multiprocessing.cpu_count()
main(processors, sys.argv[1])
>> python process_pool.py /tmp/zinc/work_queue.sh
6. Gather Results
Since we logged our results to *_out.smi we now need to gather all of them up and sort them by our predictions. The
resulting file wile be > 40GB. To analyze the data further you can use dask, or put the data in a rdkit postgres cartridge.
Here I show how to join the and sort the data to get the "best" results.
print(best_scores[0])
best_mols[0]
0.98874843
print(best_scores[0])
best_mols[1]
0.98874843
print(best_scores[0])
best_mols[2]
0.98874843
print(best_scores[0])
best_mols[3]
0.98874843
The screen seems to favor molecules with one or multiple sulfur trioxides. The top scoring molecules also have low
diversity. When creating a "buy list" we want to optimize for more things than just activity, for instance diversity and
drug like MPO.
#We use the code from https://github.com/PatWalters/rd_filters, detailed explanation is here: http://practicalcheminf
#We will run the PAINS filter on best_mols as suggested by Issue 1355 (https://github.com/deepchem/deepchem/issues/13
import os
import pandas as pd
from rdkit import Chem
from rdkit.Chem.Descriptors import MolWt, MolLogP, NumHDonors, NumHAcceptors, TPSA
from rdkit.Chem.rdMolDescriptors import CalcNumRotatableBonds
#First we get the rules from alert_collection.csv and then filter to get PAINS filter
rule_df = pd.read_csv(os.path.join(os.path.abspath(''), 'assets', 'alert_collection.csv'))
rule_df = rule_df[rule_df['rule_set_name']=='PAINS']
rule_list = []
for rule_id, smarts, max_val, desc in rule_df[["rule_id", "smarts", "max", "description"]].values.tolist():
smarts_mol = Chem.MolFromSmarts(smarts)
if smarts_mol:
rule_list.append((smarts_mol, max_val, desc))
def evaluate(smile):
mol = Chem.MolFromSmiles(smile)
if mol is None:
return [smile, "INVALID", -999, -999, -999, -999, -999, -999]
desc_list = [MolWt(mol), MolLogP(mol), NumHDonors(mol), NumHAcceptors(mol), TPSA(mol), CalcNumRotatableBonds(mol
for patt, max_val, desc in rule_list:
if len(mol.GetSubstructMatches(patt)) > max_val:
return [smiles, desc + " > %d" % (max_val)] +desc_list
return [smiles, "OK"]+desc_list
In this tutorial, we will explore how to train MAT, and predict hydration enthalpy values for molecules from the freesolv
hydration enthalpy dataset with MAT.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
wandb: WARNING W&B installed but not logged in. Run `wandb login` or set the WANDB_API_KEY env variable.
wandb: WARNING W&B installed but not logged in. Run `wandb login` or set the WANDB_API_KEY env variable.
featurizer = dc.feat.MATFeaturizer()
# Let us now take an example array of smile strings and featurize it.
smile_string = ["CCC"]
output = featurizer.featurize(smile_string)
print(type(output[0]))
print(output[0].node_features)
print(output[0].adjacency_matrix)
print(output[0].distance_matrix)
<class 'deepchem.feat.molecule_featurizers.mat_featurizer.MATEncoding'>
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 1. 1. 0.]]
[[1.e+06 1.e+06 1.e+06 1.e+06]
[1.e+06 0.e+00 2.e+00 1.e+00]
[1.e+06 2.e+00 0.e+00 1.e+00]
[1.e+06 1.e+00 1.e+00 0.e+00]]
train_dataset
<DiskDataset X.shape: (513,), y.shape: (513, 1), w.shape: (513, 1), ids: ['CCCCNCCCC' 'CCOC=O' 'CCCCCCCCC' ...
'COC' 'CCCCCCCCBr'
'CCCc1ccc(c(c1)OC)O'], task_names: ['y']>
device = 'cpu'
model = MATModel(device = device)
%%time
max_epochs = 10
# The warnings are not relevant to this tutorial thus we can safely skip them.
f, ax = plt.subplots()
ax.scatter(range(len(losses)), losses, label='train loss')
ax.scatter(range(len(val_losses)), val_losses, label='val loss')
plt.legend(loc='upper right');
Testing the model
Optimally, MAT should be trained for a lot more epochs with a GPU. Due to computational constraints, we train this
model for very few epochs in this tutorial. Let us now see how to predict the hydration enthalpy values for molecues
now with MAT.
# We will be predicting the enthalpy value for the smile string we featurized earlier in the MATFeaturizer section.
model.predict_on_batch(output)
The architecture consits of 3 main sections: a generator, a discriminator, and a reward network.
The generator takes a sample (z) from a standard normal distribution to generate an a graph using a MLP (this limits the
network to a fixed maximum size) to generate the graph at once. Sepcifically a dense adjacency tensor A (bond types)
and an annotation matrix X (atom types) are produced. Since these are probabilities, a discrete, sparse x and a are
generated through categorical sampling.
The discriminator and reward network have the same architectures and recieve graphs as inputs. A Relational-GCN and
MLPs are used to produce the singular output.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
from collections import OrderedDict
import deepchem as dc
import deepchem.models
import torch
from deepchem.models.torch_models import BasicMolGANModel as MolGAN
from deepchem.models.optimizers import ExponentialDecay
from torch.nn.functional import one_hot
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
Download, load, and extract the SMILES strings from the tox21 dataset. The original paper used the QM9 dataset,
however we use the tox21 dataset here to save time.
Specify the maximum number of atoms to enocde for the featurizer and the MolGAN network. The higher the number of
atoms, the more data you'll have in the dataset. However, this also increases the model complexity as the input
dimensions become higher.
num_atoms = 12
df
smiles
0 CC(O)(P(=O)(O)O)P(=O)(O)O
1 CC(C)(C)OOC(C)(C)CCC(C)(C)OOC(C)(C)C
2 OC[C@H](O)[C@@H](O)[C@H](O)CO
3 CCCCCCCC(=O)[O-].CCCCCCCC(=O)[O-].[Zn+2]
4 CC(C)COC(=O)C(C)C
... ...
6259 CC1CCCCN1CCCOC(=O)c1ccc(OC2CCCCC2)cc1
6260 Cc1cc(CCCOc2c(C)cc(-c3noc(C(F)(F)F)n3)cc2C)on1
6261 O=C1OC(OC(=O)c2cccnc2Nc2cccc(C(F)(F)F)c2)c2ccc...
6262 CC(=O)C1(C)CC2=C(CCCC2(C)C)CC1C
6263 CC(C)CCC[C@@H](C)[C@H]1CC(=O)C2=C3CC[C@H]4C[C@...
Uncomment the first line if you want to subsample from the full dataset.
#data = df[['smiles']].sample(4000, random_state=42)
data = df
Initialize the featurizer with the maxmimum number of atoms per molecule. atom_labels is a parameter to pass the
atomic number of atoms you want to be able to parse. Similar to the num_atoms parameter above, more atom_labels
means more data, though the model gets more complex/unstable.
# create featurizer
feat = dc.feat.MolGanFeaturizer(max_atom_count=num_atoms, atom_labels=[0, 5, 6, 7, 8, 9, 11, 12, 13, 14]) #15, 16, 17
smiles = data['smiles'].values
Filter out the molecules with too many atoms to reduce the number of unnecessary error messages in later steps.
The next cell featurizes the filtered molecules, however, since we have limited the atomic numbers to [5, 6, 7, 8,
9, 11, 12, 13, 14] which is B, C, N, O, F, Na, Mg, Al and Si, the featurizer fails to featurize several molecules in the
dataset. Feel free to experiment with more atomic numbers!
# featurize molecules
features = feat.featurize(filtered_smiles)
Instantiate the MolGAN model and set the learning rate and maximum number of atoms as the size of the vertices.
Then, we create the dataset in the format of the input to MolGAN.
# create model
gan = MolGAN(learning_rate=ExponentialDecay(0.001, 0.9, 5000), vertices=num_atoms)
dataset = dc.data.NumpyDataset([x.adjacency_matrix for x in features],[x.node_features for x in features])
Define the iterbatches function because the gan_fit function requires an iterable for the batches.
def iterbatches(epochs):
for i in range(epochs):
for batch in dataset.iterbatches(batch_size=gan.batch_size, pad_batches=True):
flattened_adjacency = torch.from_numpy(batch[0]).view(-1).to(dtype=torch.int64) # flatten the input becau
invalid_mask = (flattened_adjacency < 0) | (flattened_adjacency >= gan.edges) # edge type cannot be negat
clamped_adjacency = torch.clamp(flattened_adjacency, 0, gan.edges-1) # clamp the input so it can be fed t
adjacency_tensor = one_hot(clamped_adjacency, num_classes=gan.edges) # actual one_hot
adjacency_tensor[invalid_mask] = torch.zeros(gan.edges, dtype=torch.long) # make the invalid entries, a v
adjacency_tensor = adjacency_tensor.view(*batch[0].shape, -1) # reshape to original shape and change dtyp
flattened_node = torch.from_numpy(batch[1]).view(-1).to(dtype=torch.int64)
invalid_mask = (flattened_node < 0) | (flattened_node >= gan.nodes)
clamped_node = torch.clamp(flattened_node, 0, gan.nodes-1)
node_tensor = one_hot(clamped_node, num_classes=gan.nodes)
node_tensor[invalid_mask] = torch.zeros(gan.nodes, dtype=torch.long)
node_tensor = node_tensor.view(*batch[1].shape, -1)
Train the model with the fit_gan function and generate molecules with the predict_gan_generator function.
nmols = feat.defeaturize(generated_data)
print("{} molecules generated".format(len(nmols)))
Print out the number of valid molecules, but training can be unstable so some the number can vary significantly.
img
This is an example of what the molecules should look like.
Introduction to GROVER
In this tutorial, we will go over what Grover is, and how to get it up and running.
GROVER, or, Graph Representation frOm selfsuperVised mEssage passing tRansformer, is a novel framework proposed
by Tencent AI Lab. GROVER utilizes self-supervised tasks in the node, edge and graph level in order to learn rich
structural and semantic information of molecules from large unlabelled molecular datasets. GROVER integrates Message
Passing Networks into a Transformer-style architecture to deliver more expressive molecular encoding.
Reference Paper: Rong, Yu, et al. "Grover: Self-supervised message passing transformer on large-scale molecular data."
Advances in Neural Information Processing Systems (2020).
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to
run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case,
don't run these cells since they will download and install Anaconda on your local machine.
NOTE: The original GROVER repository does not contain a setup.py file, thus we are currently using a fork which does.
/content/drive/MyDrive
fatal: destination path 'grover' already exists and is not an empty directory.
/content/drive/MyDrive/grover
Obtaining file:///content/drive/MyDrive/grover
Installing collected packages: grover
Running setup.py develop for grover
Successfully installed grover-1.0.0
Collecting deepchem
Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
Predicting output
Extracting molecular features
If the finetuned model uses the molecular feature as input, we need to generate the molecular feature for the target
molecules as well.
Output
The output will be saved in a file called data_pre.csv .
This DeepChem tutorial serves as a starting point for exploring the world of PROTACs and the exciting field of targeted
protein degradation. The tutorial is divided into five partitions:
1. Background literature
2. Data extraction
3. Featurization
4. Model deployment
5. References
With that in mind, let's jump into how we can predict efficacy of PROTAC degraders!
1. Background literature
Traditional drug modalities, such as small-molecule drugs or monoclonal antibodies, are limited to certain modes of
action, like targeting specific receptors or blocking particular pathways. Targeted protein degradation (TPD) represents
a promising new approach to modulate proteins that have been traditionally difficult to target. TPD has given rise to
major classes of molecules that have emerged as promising therapeutic approaches against various disease contexts.
Figure 1: Molecular structure of PROTACs molecules designed to inhibit epidermal growth factor receptor (EGFR). The
PROTAC linker connects the EGFR ligand and E3 ligase which are highlighted in yellow and gray, respectively [1].
Figure 2: The ubiquitin proteasome system is one of the cell's internal degradation mechanism crucial for targetting
dysfunctional proteins. Naturally this opens up opportunities to leverage this in a therapeutic context [2].
Furthermore, after the POI is degraded by the proteasome, PROTACs can disassociate and continue to induce further
degradation, enabling low concentrations to be efficacious. This catalytic mechanism of action and event-drive
pharmacology prevents PROTACs from suffering the same limitations as conventional therapeutic strategies such as
drug resistance and off-target effects.
Figure 3: The mechanism of action of PROTACs center around the UPS. In a heterobifunctional manner, recruiting both
a target protein of interest and an E3 ligase, PROTACs are able to promote protein degradation in diseases [3].
For a more in-depth dive into PROTACs, ubiquitin proteasome system, and targeted protein degradation, readers are
referred to [5] and [6].
2. Data extraction
Before we proceed, let's install deepchem into our colab environment.
Collecting deepchem
Downloading deepchem-2.8.0-py3-none-any.whl (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 5.8 MB/s eta 0:00:00
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.4.2)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.25.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from deepchem) (2.0.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.2.2)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.12.1)
Requirement already satisfied: scipy>=1.10.1 in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.11.4)
Collecting rdkit (from deepchem)
Downloading rdkit-2023.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.9/34.9 MB 12.0 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->d
eepchem) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->deepchem) (
2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->deepchem)
(2024.1)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from rdkit->deepchem) (9.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-lear
n->deepchem) (3.5.0)
Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->deep
chem) (1.3.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->deepchem) (1.16.0)
Installing collected packages: rdkit, deepchem
Successfully installed deepchem-2.8.0 rdkit-2023.9.6
Now let's download this dataset on PROTACs, curated by [7], which includes 3270 PROTACs.
os.system('wget https://deepchemdata.s3.us-west-1.amazonaws.com/datasets/protac_10_06_24.csv')
protac_db = pd.read_csv('protac_10_06_24.csv')
Note that there exists a many-to-many mapping between PROTAC compounds and target proteins. A single PROTAC
compound can be designed to target multiple proteins, and conversely, multiple PROTAC compounds can be developed
to target the same protein. This many-to-many relationship allows for greater flexibility and adaptability in the design
and application of PROTACs.
print('''In this dataset, there are {} unique PROTAC compounds, targeting {} unique proteins for a total of {} combin
len(protac_db['Target'].unique()),
protac_db
In this dataset, there are 3270 unique PROTAC compounds, targeting 323 unique proteins for a total of 5388 combi
nations
Compound E3 DC50
Uniprot Target PDB Name Smiles
ID ligase (nM)
BCR-
5384 3267 NaN FEM1B NaN NaN CC1=NC(NC2=NC=C(C(=O)NC3=C(C)C=CC=C3Cl)S2)=CC(... NaN
ABL
BCR-
5385 3268 NaN FEM1B NaN NaN CC1=NC(NC2=NC=C(C(=O)NC3=C(C)C=CC=C3Cl)S2)=CC(... NaN
ABL
ARV- O=C1CC[C@H]
5386 3269 P03372 ER CRBN NaN 2
471 (N2CC3=CC(N4CCN(CC5CCN(C6=CC=C([C@@...
ARV-
5387 3270 P10275 AR CRBN NaN N#CC1=CC=C(O[C@H]2CC[C@H](NC(=O)C3=CC=C(N4CCC(... 1
110
Taking a closer look at the dataset, each PROTAC compound has a SMILEs representation along with its target protein of
interest and E3 ligase. For reference, here is an example:
example = protac_db.iloc[0]
print('''Here is the SMILEs representation of a PROTAC compound: {}
designed to target {} protein through ubiquitination by {} E3 ligase.'''.format(example['Smiles'], example['Target'
Here is the SMILEs representation of a PROTAC compound: COC1=CC(C2=CN(C)C(=O)C3=CN=CC=C23)=CC(OC)=C1CN1CCN(CCOCC
OCC(=O)N[C@H](C(=O)N2C[C@H](O)C[C@H]2C(=O)NCC2=CC=C(C3=C(C)N=CS3)C=C2)C(C)(C)C)CC1
designed to target BRD7 protein through ubiquitination by VHL E3 ligase.
protac_db.columns
In general, the PROTAC-DB dataset contains information for a variety of different physiochemical and biochemical
properties of PROTAC structures. Several useful ones to point out are
which measures the concentration of a ligand to achieve 50% occupancy of the protein binding sites, and
which measures a compound's solubility, an indication of its absorption and distribution characteristics.
Before we proceed, let's plot the distribution of each of these properties to get a better sense of our PROTAC dataset
starting with ΔG values.
[]
Let's take a closer look at the distribution of PROTAC molecules around the -10 range of ΔG values.
x_min = -15
x_max = -5
bin_size = 1
bins = np.arange(x_min, x_max, bin_size)
plt.hist(delta_G, bins=bins)
plt.xlabel('ΔG (kcal/mol)')
plt.ylabel('Frequency')
plt.title('Distribution of ΔG ranged from -15 to -5 across PROTAC molecules')
plt.plot()
[]
There does not appear to be a lot of information on the spontaneity of PROTAC reactions but it is worth noting that the
ones with recorded ΔGs appear energetically favorable, as expected.
Let's now take a look at the
values.
[]
Similar to ΔG values, there does not appear to be a lot of information on the affinity of formed PROTAC complexes. Since
the range is so large, let's plot a second histogram focused on the PROTACs with low
# limit range
x_max = 1500
x_min = 0
bin_size = 25
bins = np.arange(x_min, x_max, bin_size)
plt.hist(kd_data, bins=bins)
[]
The improved resolution of values illustrates a much cleaner distribution of
values indicating that the PROTAC linker can form a strong connection with the E3 ligase and target protein.
Let's now take a look at XLogP3 values. Note that this is slightly different than the typical LogP partition coefficient.
Recall that LogP is defined
In other words, LogP is the measured ratio of the concentration of a compound in the organic phase to the its
concentration in the aqueous phase, measuring the compound's solubility. XLogP3 is a knowledge-based method for
calculating the partition coefficient by accounting for the molecular structure, presence of functional groups, and
bonding [8]. Both properties estimate a compound's liphophilicity, giving insight into how a compound may behave in
biological systems.
plt.hist(protac_db['XLogP3'])
plt.xlabel('XLogP3 Values')
plt.ylabel('Frequency')
plt.title('Distribution of XLogP3 values across PROTAC molecules')
plt.plot()
[]
All PROTAC compounds have a recorded XLogP3 value. The distribution looks normally distributed with few molecules
with extreme logP profiles.
Now, let's take a look at the PROTAC degradation properties. "DC50 (nM)" and "Dmax (%)" represent the half maximal
degradation concentration and maximal degradation of the target protein of interest, respectively. Let's take a quick
look at their distributions.
Notice that the values are all in string format with non-numerical characters such as '<', '/', and '>'. For the time being,
let's remove these values.
raw_dc50 = raw_dc50[~raw_dc50.str.contains('<|>|/|~|-')]
raw_dc50 = raw_dc50.astype(float)
plt.hist(raw_dc50.values, bins=75)
plt.xlabel('PROTACs')
plt.ylabel('DC50 (nM)')
plt.title('DC50 for all PROTACs')
plt.plot()
[]
The distribution is certainly skewed and has a few outliers. Let's log normalize.
lognorm_dc50 = np.log(raw_dc50)
plt.hist(lognorm_dc50, bins=15)
plt.xlabel('Log normalized DC50 values (log nM)')
plt.ylabel('Frequency')
plt.title('Distribution of log normalized DC50 values')
plt.plot()
[]
Now, let's take a look at Dmax percentage which represents the maximal degradation a PROTAC can elicit relative to
the total activity of the target protein of interest [7].
# Using the same row indices as our cleaned DC50 data
dmax = protac_db.iloc[lognorm_dc50.index]['Dmax (%)']
plt.hist(dmax.values, bins=10)
plt.xlabel('Dmax (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Dmax (%)')
plt.plot()
[]
Notice that Dmax is represented as a percentage. For now, let's continue with regressing on DC50. We are now ready to
featurize!
protac_smiles = cleaned_data['Smiles']
dc_vals = lognorm_dc50
3. Featurization
Let's featurize using CircularFingerprint which is incorporated in DeepChem! CircularFingerprint is a common featurizer
for molecules that encode local information about each atom and their neighborhood. For more information, the reader
is refered to [9].
features = featurizer.featurize(protac_smiles)
splitter = dc.splits.RandomSplitter()
train_random, val_random, test_random = splitter.train_valid_test_split(dataset, seed=42)
Along with a random split, let's also use a scaffold split which ensures that the split contain a structurally diverse array
of compounds. Scaffold split groups molecules according to presence of rings, linkers, combinations of rings and linkers,
as well as atomic properties. In general, scaffold splits are a good way of ensuring generalizability of our models.
# Scaffold split
splitter = dc.splits.ScaffoldSplitter()
train_scaffold, val_scaffold, test_scaffold = splitter.train_valid_test_split(dataset, seed=42)
To see the scaffold split in action, let's visualize the chosen compounds across the splits.
There are certainly functional group differences spread throughout the splits. Notice the presence of the nitrile group in
the train set, amine group in the validation set, as well as the sulfonamide group in the test set.
Additionally, notice the structural and conformational differences among the various data splits. It will be interesting to
see how well our model generalizes.
4. Model deployment
We have successfully generated our train and test datasets. Let's now create a simple MLP model to predict PROTAC
degradation properties!
n_tasks = 1
n_features = train_random.X.shape[1]
layer_sizes = [256, 32, 1]
dropouts = [0.0, 0.2, 0]
activation_fns = [nn.ReLU(), nn.ReLU(), nn.Identity()]
optimizer = dc.models.optimizers.Adam()
# L2 loss is default
protac_model_random = dc.models.MultitaskRegressor(n_tasks, n_features, layer_sizes, dropouts=dropouts, activation_fn
optimizer=optimizer, batch_size=10, log_frequency=log_freq)
protac_model_scaffold = dc.models.MultitaskRegressor(n_tasks, n_features, layer_sizes, dropouts=dropouts, activation_
optimizer=optimizer, batch_size=10, log_frequency=log_freq)
Let's now wrap everything together to instantiate a DeepChem model! Note that due to the small sample size, a smaller
batch size actually helps performance.
train_losses_random = []
val_losses_random = []
train_losses_scaffold = []
val_losses_scaffold = []
metric = [dc.metrics.Metric(dc.metrics.mean_squared_error)]
n_epochs=100
for i in range(n_epochs):
protac_model_random.fit(train_random, nb_epoch=1, all_losses=train_losses_random)
We can easily look at how the training went through plotting the recorded losses.
plt.plot()
[]
We can see that the model performs less well on the scaffold validation set which makes sense as the scaffold splits
ensures that more validation molecules are out of distribution relative to the train distribution.
Let's now perform some inference on our test set to evaluate our models!
for k, v in eval_metrics.items():
print('{}: {}'.format(k, v))
mean_squared_error: 3.074001339280645
pearsonr: 0.818568671566446
pearson_r2_score: 0.6700546700700561
# Adjust the position of the title to avoid overlap with the plot
plt.tight_layout()
plt.show()
The random split appears to do fairly well. Let's see how well our model does on the scaffold split.
for k, v in eval_metrics.items():
print('{}: {}'.format(k, v))
mean_squared_error: 5.991774828135091
pearsonr: -0.10286796793151554
pearson_r2_score: 0.010581818826359309
# Adjust the position of the title to avoid overlap with the plot
plt.tight_layout()
plt.show()
The model does significantly worse on the held out scaffold test set which was expected given the simplicity of the
model. Developing far more complex models which can generalize out of distribution is a key area of focus in many
areas of research from molecule property prediction to computer vision to natural language processing. In general, I
hope this tutorial was a informative introduction into the world of PROTACs. Follow along as we explore how we can
think about PROTAC design in the next tutorial!
5. References
[1] Kelm, J.M., Pandey, D.S., Malin, E. et al. PROTAC’ing oncoproteins: targeted protein degradation for cancer therapy.
Mol Cancer. 2023, 22, 62. https://doi.org/10.1186/s12943-022-01707-5
[2] Tu, Y., Chen, C., Pan, J., Xu, J., Zhou, Z. G., & Wang, C. Y. The Ubiquitin Proteasome Pathway (UPP) in the regulation
of cell cycle control and DNA damage repair and its implication in tumorigenesis. International journal of clinical and
experimental pathology. 2012, 5, 8.
[3] Sun, X., Gao, H., Yang, Y. et al. PROTACs: great opportunities for academia and industry. Sig Transduct Target Ther.
2019, 4, 64. https://doi.org/10.1038/s41392-019-0101-6
[4] Che Y, Gilbert AM, Shanmugasundaram V, Noe MC. Inducing protein-protein interactions with molecular glues. Bioorg
Med Chem Lett. 2018, 28, 15. https://doi.org/10.1016/j.bmcl.2018.04.046.
[5] Békés, M., Langley, D.R. & Crews, C.M. PROTAC targeted protein degraders: the past is prologue. Nat Rev Drug
Discov. 2022, 21, 181–200. https://doi.org/10.1038/s41573-021-00371-6
[6] Liu, Z., Hu, M., Yang, Y. et al. An overview of PROTACs: a promising drug discovery paradigm. Mol Biomed. 2022, 3
(46). https://doi.org/10.1186/s43556-022-00112-0
[7] Gaoqi Weng, Xuanyan Cai, Dongsheng Cao, Hongyan Du, Chao Shen, Yafeng Deng, Qiaojun He, Bo Yang, Dan Li,
Tingjun Hou, PROTAC-DB 2.0: an updated database of PROTACs, Nucleic Acids Research. 2023, 51 (D1), Pages D1367–
D1372, https://doi.org/10.1093/nar/gkac946
[8] Cheng T, Zhao Y, Li X, Lin F, Xu Y, Zhang X, Li Y, Wang R, Lai L. Computation of octanol-water partition coefficients
by guiding an additive model with knowledge. J Chem Inf Model. 2007, 47 (6), 2140-8.
https://doi.org/10.1021/ci700257y.
[9] Glem RC, Bender A, Arnby CH, Carlsson L, Boyer S, Smith J. Circular fingerprints: flexible molecular descriptors with
applications from physical chemistry to ADME. IDrugs. 2006, 9 (3).
Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue
working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the
DeepChem community in the following ways:
Table of Contents:
Introduction
Understanding Druggability
Methods to assess drugabililty
Application of Machine Learning in Druggability Assessment
Practical Application of Machine Learning for Druggability Prediction
Building a dataset
Fpocket to idneitfy binding pockets
ML model to classify the binding pockets
Introduction
In this tutorial, we will explore the concept of druggability and its crucial role in identifying successful drug targets. We
will then apply a machine learning model to classify drug targets as highly druggable or less druggable, helping us
assess the potential of a protein to be effectively targeted by a drug.
Understanding Druggability
Protein Pockets
Protein pockets, also known as binding pockets or active sites, are regions on the surface of a protein where small
molecules, such as drugs, can bind. These pockets are formed by the three-dimensional folding of the protein. Protein
pockets are characterized by specific amino acids lining the pocket that interact with ligands through various forces,
such as hydrogen bonds, hydrophobic interactions, van der Waals forces, and ionic bonds. These pockets are crucial as
they can be active sites where catalytic activity occurs, or allosteric sites where binding modulates the protein's function
without directly involving the active site.
Identifying protein pockets or binding sites on disease-related proteins is essential for selecting targets for new drugs.
Once a binding site is known, drugs can be designed to fit precisely into these sites, enhancing their efficacy and
reducing side effects. Understanding the binding site helps in modifying drug molecules to increase their affinity and
specificity.
Binding sites are central to the concept of druggability, as they are the points of interaction between a drug and its
target protein. The characteristics of binding sites, such as their geometric and chemical properties, determine whether
a protein can be effectively targeted by a drug. By understanding and analyzing these sites, we can identify druggable
targets, design and optimize drugs, and predict the druggability of new proteins, ultimately facilitating the development
of effective and safe therapeutic agents.
To learn more about binding sites, check out this additional DeepChem tutorial on the topic: Introduction to Binding
Sites
Druggability
Druggability is the measure of whether a biological drug target, like a protein, can be effectively targeted and
modulated by a drug to treat a disease. It basically refers to how suitable a protein is for being targeted by a drug. Not
all proteins are good drug targets. A druggable protein has certain characteristics that make it possible to design a drug
to interact with it effectively. For instance, a druggable protein must have accessible and well-defined binding sites or
pockets that can interact with drug molecules.
Fig. 1: Druggable pocket correspond to a protein region capable of binding a drug-like molecule.(source)
Structurally, a druggable target must have well-defined binding pockets where potential drugs can bind. These pockets,
identified through techniques like X-ray crystallography or computational modeling, should be of suitable size, shape,
and chemical composition to accommodate drug-like molecules.
The identification and characterization of binding pockets involve a detailed analysis of their key properties, including
volume, hydrophobicity, and the presence of polar residues. The volume of a binding pocket dictates the size of the
ligands that can be accommodated, with larger pockets able to bind larger or more complex molecules, offering more
points of interaction. However, excessively large pockets can sometimes be less selective, leading to off-target effects.
Hydrophobic regions within the binding pocket interact with non-polar parts of drug molecules through van der Waals
forces and hydrophobic interactions, crucial for the binding stability of many drugs, particularly those targeting
intracellular proteins where the environment is less aqueous. Polar residues within the pocket can form hydrogen bonds
and ionic interactions with the drug, which are often key determinants of binding affinity and specificity. The distribution
and accessibility of these polar residues are carefully analyzed to optimize drug design.
Another critical aspect of binding pockets is their dynamic behavior and flexibility. Binding pockets are not always static;
they can undergo conformational changes upon ligand binding. This dynamic behavior, known as induced fit, allows the
pocket to better accommodate different ligands, enhancing binding affinity and specificity. Molecular dynamics
simulations are particularly useful in studying these conformational changes, providing insights into how flexible pockets
can adapt to various drug molecules. Understanding this flexibility is essential for designing drugs that can bind
effectively even as the protein changes shape.
The balance between hydrophobic and hydrophilic areas within the pocket also influences the type of ligands that can
bind effectively. Hydrophobic pockets are better suited for non-polar ligands, while hydrophilic or polar pockets favor
ligands that can form hydrogen bonds and ionic interactions. The density and distribution of alpha spheres, geometric
constructs used to model the cavities within binding pockets, help in understanding the compactness and accessibility
of the pocket. A high alpha sphere density typically indicates a well-defined pocket with the potential for strong ligand
interactions. Additionally, the surface area of the pocket, particularly the solvent-accessible surface area (SASA), is
crucial as it indicates how much of the pocket is exposed and available for binding, providing further insights into the
druggability of the target.
Before investing a lot of time and money into developing a new drug, scientists want to ensure that the target they are
aiming at has a good chance of responding to a drug. If a target is druggable, it means there's a better chance that a
drug can bind to it, affect its function, and ultimately help treat the disease. This helps prioritize targets that are more
likely to lead to successful drug development.
Despite decades of experimental investigations in the drug discovery domain, about 96% overall failure rate has been
recorded in drug development due to the “undruggability” of various identified disease targets and other challenges.
Druggability assessment of a target protein is crucial for several reasons:
1. Prioritizing "Druggable" Pockets: Not all regions of a promising target protein are suitable for drug binding.
Druggability assessment tools, such as fpocket or SiteMap, are employed to identify pockets on the protein surface
that are amenable to drug interaction. These pockets should be accessible, possess favorable physicochemical
properties (such as hydrophobicity and the presence of hydrogen bond donors/acceptors), and ideally, should not be
essential for the protein's normal function to avoid potential side effects if the pocket is targeted by a drug.
2. Reducing Risks of Off-Target Effects: Off-target effects occur when a drug interacts with unintended proteins,
leading to adverse side effects. By conducting a thorough druggability assessment, researchers can identify target
proteins with minimal risk of such interactions. This involves analyzing the protein’s structure and sequence to
detect potential promiscuous binding sites that might interact with a broad range of molecules, thereby increasing
the risk of off-target effects.
3. Predicting Potential Safety Issues: Druggability assessment can also highlight potential safety concerns related
to targeting specific proteins. For instance, if the target protein shares significant structural or sequence similarity
with proteins involved in critical biological processes, inhibiting it could lead to unintended consequences. This
consideration is essential to avoid disrupting essential functions that could lead to toxicity or other adverse effects .
Sequence-based methods analyze the amino acid sequence of a protein to predict its potential as a drug target. The
sequence of amino acids determines the protein's function and can reveal conserved regions that may form binding
pockets for drugs. This method also provides insights into essential physicochemical properties such as solvent
accessibility, hydrophobicity, charge, and polarity.
Sequence-based assessments are often used in machine-learning algorithms to predict druggability. They can help
identify functional domains within a protein. However, relying solely on sequence data can be limiting, as it does not
provide a full picture of the protein's structure or how accessible these domains are to drug molecules.
Examples include CHEMBL, LncRNA2Target, and MiRBase, which are databases and tools that aid in predicting
druggability based on protein sequences.
Structure-based methods examine the 3D structure of a protein to identify and evaluate potential drug-binding pockets.
For a small drug-like molecule to effectively bind, the protein must have a pocket that is appropriately sized, with a
deep hydrophobic cavity to encapsulate the drug. Large, exposed polar sites are generally less druggable compared to
smaller, more hydrophobic pockets.
This method provides a more detailed and reliable prediction compared to sequence-based methods, as it considers the
physical characteristics of the protein's binding sites. The identified pockets are then compared against a reference set
of known biological targets to assess their druggability.
Notable tools include DOGSiteScorer, Metapocket, Fpocket, PockDrug Server, SiteMap, and Open Targets, which help
identify and analyze potential drug-binding pockets.
Ligand-based methods focus on the likelihood that a protein can bind to known drug-like molecules, called ligands. By
examining endogenous compounds and their interactions with the target, researchers can predict how well a new drug
might interact with the protein.
This approach leverages existing data on ligands and their binding capabilities, which can provide valuable insights
when predicting the druggability of new targets.
Examples include BindingDB, PubChem, SwissTargetPrediction and TargetHunter, which are comprehensive databases
of ligand-protein interactions.
This method relies on historical data of proteins that have already been successfully targeted by drugs. If a similar
protein has proven to be druggable in the past, it is more likely that a new, related target will also be druggable.
Precedence-based methods offer the highest confidence in druggability predictions because they are based on proven,
established targets. However, while this method provides a strong basis for predicting success, it does not guarantee
that new drugs targeting similar proteins will succeed.
Databases such as DrugBank, ClinicalTrials.gov, and DrugCentral store detailed information on existing drug targets and
compounds currently undergoing clinical trials.
The ML methods are categorized into supervised and unsupervised learning techniques, each serving different purposes
within the drug discovery pipeline.
The supervised learning focuses on tasks where the outcome is known and involves models like decision trees,
random forests, SVM, and Bayesian networks for tasks ranging from disease-druggability predictions to target
identification.
Unsupervised Learning involves clustering techniques like K-Means, hierarchical clustering, and HMM for tasks like
molecular designing and feature selection.
Note: For information on various ML models, please refer to the ML resources provided at the end of this tutorial. In this
section, we will concentrate on the application of ML approaches specifically in druggability assessment.
Fig 2 illustrates how various AI and machine learning (ML) techniques are applied in the assessment of druggability.
In Supervised Learning, models like Decision Trees and Random Forest are used to predict the druggability of a target,
such as a protein involved in a disease. These models can also predict the disease-drug response, which helps
determine how well a drug might work in treating a specific disease. Classification methods, like Nearest Neighbour and
SVM (Support Vector Machine), are used for drug target association, identifying which drugs are likely to interact with
which targets. NLP (Natural Language Processing) helps analyze vast amounts of scientific literature to uncover
potential drug targets, while Bayesian Networks assist in target identification, pinpointing which proteins or molecules
are the best candidates for drug development.
Unsupervised Learning focuses on Clustering techniques, such as K-Means and Hierarchical Clustering, which group
similar molecules or biological data together. This is crucial for molecular designing, where scientists design new drugs
based on the properties of these clusters. Hidden Markov Models (HMM) are used for feature selection, identifying the
most important characteristics of a protein that determine its druggability.
Supervised Learning: Nearest Neighbour, SVM, and Random Forest can be used to predict the druggability of a
target based on its sequence features (like amino acid composition and conserved regions). NLP techniques can
process and extract relevant information from genetic databases or literature to predict potential binding sites
based on sequence data.
Unsupervised Learning : Hierarchical Clustering and K-Means can group protein sequences into clusters based on
their similarities, which helps in identifying conserved regions or sequence motifs linked to druggability.
Supervised Learning: Decision Trees and Bayesian Networks can predict whether specific structural features (like
the size and hydrophobicity of binding pockets) make a protein druggable. Random Forest models can aggregate
predictions about various structural features to give a more accurate overall assessment.
Unsupervised Learning: Hidden Markov Models (HMMs) can model protein structural dynamics and predict how likely
a given binding site is to interact with a drug. K-Means clustering can identify common structural features across
different proteins that correlate with druggability.
Supervised Learning: SVM and NLP can be used to predict drug-target associations by analyzing known ligands and
their binding affinities with various proteins. Random Forest models can improve the accuracy of these predictions
by considering multiple ligand features simultaneously.
Unsupervised Learning: K-Means and Hierarchical Clustering can group ligands based on their chemical properties,
aiding in the identification of new ligand-based druggable targets.
Supervised Learning: Bayesian Networks and Decision Trees can help in predicting the success of new targets by
analyzing previous data on established drug targets and their associated compounds. Random Forests can combine
different data points from established targets to predict the druggability of new, similar targets.
Unsupervised Learning: HMMs can be used to model the progression of drug development for targets with existing
precedents, helping in feature selection and identifying key characteristics of successful targets.
Our dataset consists of proteins paired with their corresponding druggability labels - 'highly druggable' and 'less
druggable'. To identify the druggable pockets within these proteins, we’ll utilize Fpocket, a structure-based druggability
assessment tool. Fpocket excels at identifying and characterizing pockets on the surface of proteins, which are potential
binding sites for small molecules. These pockets are key regions where drug molecules might interact with the protein
to exert a therapeutic effect.
1. Identifying Druggable Pockets: Using Fpocket, we will analyze the protein structures to find pockets that might
serve as effective binding sites for small molecules. Fpocket evaluates various characteristics of these pockets, such
as size, shape, depth, and hydrophobicity, which are crucial indicators of druggability.
2. Training the Random Forest Model: Once we have characterized the pockets, the Random Forest model will be
trained on these features along with the corresponding druggability labels. Random Forest is a machine learning
algorithm that uses multiple learning algorithms to obtain better predictive performance than could be obtained
from any of the constituent learning algorithms alone. It operates by constructing a multitude of decision trees
during training and outputs either the mode of the classes (for classification) or the mean prediction (for regression)
of the individual trees. The algorithm improves the accuracy and robustness of predictions by averaging the results
from multiple decision trees, reducing the risk of overfitting and handling complex data with higher reliability. The
model will learn to distinguish between 'highly druggable' and 'less druggable' pockets based on the patterns it
identifies in the training data. After training, the model can classify new protein pockets as either highly druggable
or less druggable with a high degree of accuracy.
While this model will provide valuable insights into whether a protein pocket is likely to be druggable, it’s crucial to
remember that druggability is just one aspect of a protein’s potential as a drug target. Other important factors to
consider include:
Biological Relevance: The role of the protein in disease processes and whether modulating this protein will have a
therapeutic effect.
Feasibility: Practical considerations such as how easily a drug can reach the target protein in the body, and whether
the protein is expressed in the right tissues at the right levels.
Off-target effects: The potential for off-target effects and toxicity, which could arise if the protein is similar to other
essential proteins in the body.
Building the Dataset
In this tutorial we'll use the NRLD dataset which has been widely used to study the druggability. It is a comprehensive,
nonredundant data set containing crystal structures of 71 highly druggable and 44 less druggable proteins compiled by
literature search and data mining published in the paper : DrugPred: A Structure-Based Approach To Predict Protein
Druggability Developed Using an Extensive Nonredundant Data Set
The authors have only published the list of PDB code along with the labels. So, we'll first fetch the protein structure
using Biopython. The labels are 'D' for highly druggable and 'N' for less druggable protein targets.
You can also use your own dataset if you have the labels. To obtain the structure of protein from your dataset you can
refer the DeepChem tutorial: Protein Structure Prediction with ESMFold.
Collecting biopython
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from biopython) (1.25.2)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 23.9 MB/s eta 0:00:00
Installing collected packages: biopython
Successfully installed biopython-1.84
proteins_list = ['1pwm', '1lox', '3etr', '3f1q', '3ia4', '2cl5', '1uou', '1t46', '1unl', '1q41', '2i1m', '1pmn', '1fk
labels = ['D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D',
Parameters:
pdb_code (str): The PDB code of the protein structure to fetch.
save_dir (str): The directory where the PDB file will be saved.
Returns:
dict: A dictionary with the PDB code as the key and the structure as the value.
"""
try:
if not os.path.exists(save_dir):
os.makedirs(save_dir)
pdbl = PDBList()
# Retrieve the PDB file and save it with a .pdb extension
pdb_file_path = pdbl.retrieve_pdb_file(pdb_code, pdir=save_dir, file_format='pdb')
new_pdb_file_path = os.path.join(save_dir, f"{pdb_code}.pdb")
os.rename(pdb_file_path, new_pdb_file_path)
except Exception as e:
print(f'Error fetching structure for PDB code {pdb_code}: {e}')
return {pdb_code: None}
save_directory = '/content/pdb_files/'
Run Fpocket for each of the protein and save the output in output directory
import subprocess
def run_fpocket(pdb_file_path):
"""
Runs fpocket on the given PDB file to find binding pockets.
Parameters:
pdb_file_path (str): The path to the PDB file.
Returns:
str: The path to the fpocket output directory.
"""
try:
# Run fpocket
command = ["bin/fpocket", "-f", pdb_file_path]
subprocess.run(command, check=True)
Specify the base directory where the pdb files are stored
base_pdb_files = '/content/pdb_files/'
Run Fpocket
fpocket_info_dir = '/content/pdb_files/1ajs_out/1ajs_info.txt'
with open(fpocket_info_dir, 'r') as file:
for line in file:
print(line.strip())
Pocket 1 :
Score : 0.924
Druggability Score : 0.535
Number of Alpha Spheres : 95
Total SASA : 9.960
Polar SASA : 7.544
Apolar SASA : 2.415
Volume : 493.527
Mean local hydrophobic density : 21.440
Mean alpha sphere radius : 3.774
Mean alp. sph. solvent access : 0.449
Apolar alpha sphere proportion : 0.263
Hydrophobicity score: 26.957
Volume score: 4.348
Polarity score: 14
Charge score : 4
Proportion of polar atoms: 49.020
Alpha sphere density : 5.127
Cent. of mass - Alpha Sphere max dist: 11.895
Flexibility : 0.031
Pocket 2 :
Score : 0.323
Druggability Score : 0.615
Number of Alpha Spheres : 83
Total SASA : 157.049
Polar SASA : 99.083
Apolar SASA : 57.966
Volume : 479.790
Mean local hydrophobic density : 15.053
Mean alpha sphere radius : 3.795
Mean alp. sph. solvent access : 0.423
Apolar alpha sphere proportion : 0.229
Hydrophobicity score: 21.000
Volume score: 4.100
Polarity score: 6
Charge score : 1
Proportion of polar atoms: 54.167
Alpha sphere density : 5.868
Cent. of mass - Alpha Sphere max dist: 17.292
Flexibility : 0.090
Pocket 3 :
Score : 0.308
Druggability Score : 0.848
Number of Alpha Spheres : 67
Total SASA : 172.894
Polar SASA : 69.038
Apolar SASA : 103.856
Volume : 695.777
Mean local hydrophobic density : 23.312
Mean alpha sphere radius : 3.936
Mean alp. sph. solvent access : 0.537
Apolar alpha sphere proportion : 0.478
Hydrophobicity score: 29.350
Volume score: 4.450
Polarity score: 9
Charge score : 0
Proportion of polar atoms: 37.736
Alpha sphere density : 6.774
Cent. of mass - Alpha Sphere max dist: 19.126
Flexibility : 0.087
Pocket 4 :
Score : 0.170
Druggability Score : 0.001
Number of Alpha Spheres : 37
Total SASA : 77.069
Polar SASA : 32.387
Apolar SASA : 44.682
Volume : 291.443
Mean local hydrophobic density : 5.000
Mean alpha sphere radius : 3.776
Mean alp. sph. solvent access : 0.544
Apolar alpha sphere proportion : 0.162
Hydrophobicity score: 29.100
Volume score: 3.700
Polarity score: 4
Charge score : 2
Proportion of polar atoms: 44.444
Alpha sphere density : 3.532
Cent. of mass - Alpha Sphere max dist: 9.721
Flexibility : 0.067
Pocket 5 :
Score : 0.124
Druggability Score : 0.016
Number of Alpha Spheres : 39
Total SASA : 77.117
Polar SASA : 31.227
Apolar SASA : 45.890
Volume : 303.785
Mean local hydrophobic density : 16.000
Mean alpha sphere radius : 4.001
Mean alp. sph. solvent access : 0.587
Apolar alpha sphere proportion : 0.436
Hydrophobicity score: 45.154
Volume score: 5.000
Polarity score: 4
Charge score : 1
Proportion of polar atoms: 42.857
Alpha sphere density : 2.914
Cent. of mass - Alpha Sphere max dist: 6.045
Flexibility : 0.084
Pocket 6 :
Score : 0.116
Druggability Score : 0.002
Number of Alpha Spheres : 16
Total SASA : 66.472
Polar SASA : 19.375
Apolar SASA : 47.097
Volume : 273.670
Mean local hydrophobic density : 9.000
Mean alpha sphere radius : 3.896
Mean alp. sph. solvent access : 0.607
Apolar alpha sphere proportion : 0.625
Hydrophobicity score: 32.429
Volume score: 3.714
Polarity score: 3
Charge score : -1
Proportion of polar atoms: 30.000
Alpha sphere density : 3.763
Cent. of mass - Alpha Sphere max dist: 6.233
Flexibility : 0.132
Pocket 7 :
Score : 0.100
Druggability Score : 0.006
Number of Alpha Spheres : 23
Total SASA : 55.512
Polar SASA : 20.490
Apolar SASA : 35.021
Volume : 180.286
Mean local hydrophobic density : 9.000
Mean alpha sphere radius : 3.700
Mean alp. sph. solvent access : 0.507
Apolar alpha sphere proportion : 0.435
Hydrophobicity score: 18.714
Volume score: 4.857
Polarity score: 5
Charge score : 1
Proportion of polar atoms: 38.889
Alpha sphere density : 2.712
Cent. of mass - Alpha Sphere max dist: 7.423
Flexibility : 0.082
Pocket 8 :
Score : 0.096
Druggability Score : 0.040
Number of Alpha Spheres : 46
Total SASA : 120.073
Polar SASA : 45.200
Apolar SASA : 74.873
Volume : 356.338
Mean local hydrophobic density : 15.889
Mean alpha sphere radius : 3.856
Mean alp. sph. solvent access : 0.467
Apolar alpha sphere proportion : 0.391
Hydrophobicity score: 40.909
Volume score: 4.545
Polarity score: 3
Charge score : 1
Proportion of polar atoms: 35.714
Alpha sphere density : 4.130
Cent. of mass - Alpha Sphere max dist: 9.926
Flexibility : 0.114
Pocket 9 :
Score : 0.094
Druggability Score : 0.626
Number of Alpha Spheres : 59
Total SASA : 149.738
Polar SASA : 50.713
Apolar SASA : 99.026
Volume : 515.888
Mean local hydrophobic density : 30.188
Mean alpha sphere radius : 3.879
Mean alp. sph. solvent access : 0.490
Apolar alpha sphere proportion : 0.542
Hydrophobicity score: 9.812
Volume score: 4.375
Polarity score: 8
Charge score : 2
Proportion of polar atoms: 41.667
Alpha sphere density : 4.421
Cent. of mass - Alpha Sphere max dist: 10.792
Flexibility : 0.146
Pocket 10 :
Score : 0.082
Druggability Score : 0.001
Number of Alpha Spheres : 23
Total SASA : 53.189
Polar SASA : 30.244
Apolar SASA : 22.945
Volume : 215.752
Mean local hydrophobic density : 2.000
Mean alpha sphere radius : 3.912
Mean alp. sph. solvent access : 0.482
Apolar alpha sphere proportion : 0.130
Hydrophobicity score: 18.091
Volume score: 4.182
Polarity score: 5
Charge score : -1
Proportion of polar atoms: 50.000
Alpha sphere density : 2.367
Cent. of mass - Alpha Sphere max dist: 5.442
Flexibility : 0.070
Pocket 11 :
Score : 0.067
Druggability Score : 0.005
Number of Alpha Spheres : 33
Total SASA : 98.192
Polar SASA : 49.907
Apolar SASA : 48.285
Volume : 230.851
Mean local hydrophobic density : 9.000
Mean alpha sphere radius : 3.640
Mean alp. sph. solvent access : 0.400
Apolar alpha sphere proportion : 0.303
Hydrophobicity score: 20.091
Volume score: 3.818
Polarity score: 5
Charge score : 2
Proportion of polar atoms: 50.000
Alpha sphere density : 3.626
Cent. of mass - Alpha Sphere max dist: 8.639
Flexibility : 0.149
Pocket 12 :
Score : 0.066
Druggability Score : 0.000
Number of Alpha Spheres : 18
Total SASA : 70.906
Polar SASA : 40.715
Apolar SASA : 30.191
Volume : 276.108
Mean local hydrophobic density : 3.000
Mean alpha sphere radius : 3.912
Mean alp. sph. solvent access : 0.764
Apolar alpha sphere proportion : 0.222
Hydrophobicity score: -5.625
Volume score: 4.000
Polarity score: 6
Charge score : -2
Proportion of polar atoms: 42.857
Alpha sphere density : 3.259
Cent. of mass - Alpha Sphere max dist: 6.493
Flexibility : 0.248
Pocket 13 :
Score : 0.061
Druggability Score : 0.000
Number of Alpha Spheres : 16
Total SASA : 53.881
Polar SASA : 32.144
Apolar SASA : 21.737
Volume : 207.140
Mean local hydrophobic density : 0.000
Mean alpha sphere radius : 3.859
Mean alp. sph. solvent access : 0.483
Apolar alpha sphere proportion : 0.000
Hydrophobicity score: 6.857
Volume score: 3.714
Polarity score: 4
Charge score : 0
Proportion of polar atoms: 46.667
Alpha sphere density : 2.604
Cent. of mass - Alpha Sphere max dist: 5.709
Flexibility : 0.040
Pocket 14 :
Score : 0.052
Druggability Score : 0.056
Number of Alpha Spheres : 34
Total SASA : 106.933
Polar SASA : 52.590
Apolar SASA : 54.343
Volume : 337.196
Mean local hydrophobic density : 19.000
Mean alpha sphere radius : 3.914
Mean alp. sph. solvent access : 0.594
Apolar alpha sphere proportion : 0.588
Hydrophobicity score: 16.182
Volume score: 4.182
Polarity score: 5
Charge score : 0
Proportion of polar atoms: 40.000
Alpha sphere density : 3.739
Cent. of mass - Alpha Sphere max dist: 9.922
Flexibility : 0.107
Pocket 15 :
Score : 0.049
Druggability Score : 0.001
Number of Alpha Spheres : 30
Total SASA : 105.156
Polar SASA : 68.927
Apolar SASA : 36.229
Volume : 305.590
Mean local hydrophobic density : 8.000
Mean alpha sphere radius : 3.764
Mean alp. sph. solvent access : 0.478
Apolar alpha sphere proportion : 0.300
Hydrophobicity score: 32.600
Volume score: 4.000
Polarity score: 6
Charge score : -1
Proportion of polar atoms: 44.444
Alpha sphere density : 3.793
Cent. of mass - Alpha Sphere max dist: 8.473
Flexibility : 0.126
Pocket 16 :
Score : 0.038
Druggability Score : 0.002
Number of Alpha Spheres : 26
Total SASA : 87.662
Polar SASA : 35.734
Apolar SASA : 51.928
Volume : 386.685
Mean local hydrophobic density : 6.000
Mean alpha sphere radius : 3.878
Mean alp. sph. solvent access : 0.578
Apolar alpha sphere proportion : 0.269
Hydrophobicity score: 34.875
Volume score: 4.875
Polarity score: 5
Charge score : 2
Proportion of polar atoms: 40.909
Alpha sphere density : 3.155
Cent. of mass - Alpha Sphere max dist: 8.944
Flexibility : 0.084
Pocket 17 :
Score : 0.027
Druggability Score : 0.005
Number of Alpha Spheres : 22
Total SASA : 93.222
Polar SASA : 40.087
Apolar SASA : 53.136
Volume : 316.036
Mean local hydrophobic density : 12.000
Mean alpha sphere radius : 3.913
Mean alp. sph. solvent access : 0.541
Apolar alpha sphere proportion : 0.591
Hydrophobicity score: 20.750
Volume score: 4.500
Polarity score: 6
Charge score : 0
Proportion of polar atoms: 47.619
Alpha sphere density : 3.575
Cent. of mass - Alpha Sphere max dist: 8.418
Flexibility : 0.199
Pocket 18 :
Score : 0.023
Druggability Score : 0.005
Number of Alpha Spheres : 24
Total SASA : 64.654
Polar SASA : 23.594
Apolar SASA : 41.059
Volume : 215.579
Mean local hydrophobic density : 10.000
Mean alpha sphere radius : 3.658
Mean alp. sph. solvent access : 0.416
Apolar alpha sphere proportion : 0.458
Hydrophobicity score: 50.889
Volume score: 3.889
Polarity score: 2
Charge score : 1
Proportion of polar atoms: 43.750
Alpha sphere density : 2.216
Cent. of mass - Alpha Sphere max dist: 6.211
Flexibility : 0.081
Pocket 19 :
Score : 0.004
Druggability Score : 0.000
Number of Alpha Spheres : 34
Total SASA : 97.832
Polar SASA : 50.735
Apolar SASA : 47.098
Volume : 258.882
Mean local hydrophobic density : 4.000
Mean alpha sphere radius : 3.867
Mean alp. sph. solvent access : 0.559
Apolar alpha sphere proportion : 0.147
Hydrophobicity score: 33.889
Volume score: 3.333
Polarity score: 3
Charge score : 1
Proportion of polar atoms: 45.455
Alpha sphere density : 2.835
Cent. of mass - Alpha Sphere max dist: 7.099
Flexibility : 0.045
Pocket 20 :
Score : -0.001
Druggability Score : 0.001
Number of Alpha Spheres : 21
Total SASA : 90.847
Polar SASA : 30.466
Apolar SASA : 60.381
Volume : 299.652
Mean local hydrophobic density : 9.000
Mean alpha sphere radius : 4.087
Mean alp. sph. solvent access : 0.603
Apolar alpha sphere proportion : 0.476
Hydrophobicity score: 25.750
Volume score: 4.125
Polarity score: 4
Charge score : 0
Proportion of polar atoms: 40.000
Alpha sphere density : 3.251
Cent. of mass - Alpha Sphere max dist: 7.450
Flexibility : 0.080
Pocket 21 :
Score : -0.002
Druggability Score : 0.000
Number of Alpha Spheres : 23
Total SASA : 114.955
Polar SASA : 89.595
Apolar SASA : 25.360
Volume : 358.544
Mean local hydrophobic density : 0.000
Mean alpha sphere radius : 3.920
Mean alp. sph. solvent access : 0.533
Apolar alpha sphere proportion : 0.000
Hydrophobicity score: -5.375
Volume score: 4.375
Polarity score: 7
Charge score : 2
Proportion of polar atoms: 72.727
Alpha sphere density : 4.083
Cent. of mass - Alpha Sphere max dist: 8.764
Flexibility : 0.137
Pocket 22 :
Score : -0.013
Druggability Score : 0.000
Number of Alpha Spheres : 18
Total SASA : 89.372
Polar SASA : 53.143
Apolar SASA : 36.229
Volume : 317.468
Mean local hydrophobic density : 3.000
Mean alpha sphere radius : 4.007
Mean alp. sph. solvent access : 0.567
Apolar alpha sphere proportion : 0.222
Hydrophobicity score: 26.444
Volume score: 4.000
Polarity score: 4
Charge score : 1
Proportion of polar atoms: 42.105
Alpha sphere density : 3.200
Cent. of mass - Alpha Sphere max dist: 8.017
Flexibility : 0.158
Pocket 23 :
Score : -0.015
Druggability Score : 0.000
Number of Alpha Spheres : 19
Total SASA : 98.373
Polar SASA : 62.145
Apolar SASA : 36.229
Volume : 333.904
Mean local hydrophobic density : 0.000
Mean alpha sphere radius : 4.018
Mean alp. sph. solvent access : 0.736
Apolar alpha sphere proportion : 0.053
Hydrophobicity score: 0.143
Volume score: 4.143
Polarity score: 6
Charge score : -2
Proportion of polar atoms: 50.000
Alpha sphere density : 3.504
Cent. of mass - Alpha Sphere max dist: 6.972
Flexibility : 0.329
Pocket 24 :
Score : -0.015
Druggability Score : 0.035
Number of Alpha Spheres : 36
Total SASA : 105.121
Polar SASA : 50.956
Apolar SASA : 54.165
Volume : 314.308
Mean local hydrophobic density : 20.000
Mean alpha sphere radius : 3.935
Mean alp. sph. solvent access : 0.440
Apolar alpha sphere proportion : 0.583
Hydrophobicity score: 25.700
Volume score: 4.000
Polarity score: 4
Charge score : 2
Proportion of polar atoms: 41.667
Alpha sphere density : 2.846
Cent. of mass - Alpha Sphere max dist: 8.157
Flexibility : 0.076
Pocket 25 :
Score : -0.023
Druggability Score : 0.005
Number of Alpha Spheres : 26
Total SASA : 93.656
Polar SASA : 35.690
Apolar SASA : 57.966
Volume : 305.760
Mean local hydrophobic density : 11.000
Mean alpha sphere radius : 3.949
Mean alp. sph. solvent access : 0.631
Apolar alpha sphere proportion : 0.462
Hydrophobicity score: 24.429
Volume score: 4.571
Polarity score: 4
Charge score : 1
Proportion of polar atoms: 34.783
Alpha sphere density : 2.826
Cent. of mass - Alpha Sphere max dist: 7.508
Flexibility : 0.129
Pocket 26 :
Score : -0.040
Druggability Score : 0.020
Number of Alpha Spheres : 55
Total SASA : 197.532
Polar SASA : 47.786
Apolar SASA : 149.746
Volume : 676.425
Mean local hydrophobic density : 13.375
Mean alpha sphere radius : 4.011
Mean alp. sph. solvent access : 0.557
Apolar alpha sphere proportion : 0.291
Hydrophobicity score: 14.467
Volume score: 3.800
Polarity score: 8
Charge score : -2
Proportion of polar atoms: 34.146
Alpha sphere density : 4.940
Cent. of mass - Alpha Sphere max dist: 11.982
Flexibility : 0.043
Pocket 27 :
Score : -0.089
Druggability Score : 0.000
Number of Alpha Spheres : 22
Total SASA : 121.997
Polar SASA : 68.861
Apolar SASA : 53.136
Volume : 404.828
Mean local hydrophobic density : 4.000
Mean alpha sphere radius : 4.034
Mean alp. sph. solvent access : 0.488
Apolar alpha sphere proportion : 0.227
Hydrophobicity score: -1.300
Volume score: 3.700
Polarity score: 8
Charge score : 0
Proportion of polar atoms: 52.381
Alpha sphere density : 3.501
Cent. of mass - Alpha Sphere max dist: 7.081
Flexibility : 0.056
Pocket 28 :
Score : -0.116
Druggability Score : 0.001
Number of Alpha Spheres : 16
Total SASA : 81.657
Polar SASA : 32.144
Apolar SASA : 49.513
Volume : 263.256
Mean local hydrophobic density : 7.000
Mean alpha sphere radius : 4.122
Mean alp. sph. solvent access : 0.557
Apolar alpha sphere proportion : 0.500
Hydrophobicity score: -11.833
Volume score: 5.333
Polarity score: 5
Charge score : 3
Proportion of polar atoms: 28.571
Alpha sphere density : 1.896
Cent. of mass - Alpha Sphere max dist: 5.439
Flexibility : 0.071
Now, let's find the pocket with the highest druggability score in each of the target proteins and store its features in a
DataFrame for training a model in the next step.
import re
def identify_most_druggable_pocket(pocket_df):
# Find the pocket with the highest druggability score
pocket_df['Druggability Score'] = pocket_df['Druggability Score'].astype(float)
best_pocket_df = pocket_df.loc[pocket_df['Druggability Score'].idxmax()]
best_pocket_df = pd.DataFrame(best_pocket_df).T
return best_pocket_df
def extract_features(pocket_info):
pocket_data = []
# Read the file content line by line
with open(pocket_info, 'r') as file:
current_pocket_info = {}
for line in file:
if "Pocket" in line:
if current_pocket_info:
pocket_data.append(current_pocket_info)
current_pocket_info = {'Pocket': line.strip()}
else:
if ':' in line:
key, value = line.split(':')
current_pocket_info[key.strip()] = value.strip()
pocket_dataset = {}
for protein in proteins_list:
pocket_info = f'{base_pdb_files}{protein}_out/{protein}_info.txt'
pocket_df = extract_features(pocket_info)
best_pocket_df = identify_most_druggable_pocket(pocket_df)
pocket_dataset[protein] = best_pocket_df
# Combine all dataframes into one with a new column for the keys
dataset_fpocket = pd.concat(pocket_dataset.values(), keys=pocket_dataset.keys()).reset_index(level=0).rename(columns
dataset_fpocket.rename(columns={'Key': 'pdb code'}, inplace=True)
# # Set the key column as the index our dataframe from intial csv file with pdb code and labels and the dataframe we
dataset_fpocket.set_index('pdb code', inplace=True)
# dataset.set_index('pdb code', inplace=True)
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
Accuracy: 0.8695652173913043
Classification Report:
precision recall f1-score support
accuracy 0.87 23
macro avg 0.83 0.86 0.84 23
weighted avg 0.88 0.87 0.87 23
The model achieved an accuracy of 86.96%. The precision, recall, and F1-score were 0.94, 0.88, and 0.91 for class
1(highly druggable), and 0.71, 0.83, and 0.77 for class 1(less druggable), indicating strong performance overall but
slightly lower precision for class 1.
Let's find out if your protein is highly druggable or less druggable by following these steps. Assuming you already have
the 3D structure of the protein in .pdb file format, we'll use the ML model to classify the protein pockets effectively.
In case you do not have the 3D structure of your protein, don't worry! You have two options:
Fetch the Structure Using PDB Code: You can use the fetch_protein_structure method defined above to fetch
the structure using the PDB code of the protein.
Predict the Structure Using ESMfold: If you have the protein sequence, you can use ESMfold to obtain the structure
of the protein. For more information, you can refer to the DeepChem tutorial: Protein Structure Prediction with
ESMFold.
By following these steps, you can systematically determine the druggability of a protein pocket, combining advanced
computational tools like Fpocket with the predictive power of machine learning. This process provides a robust method
for identifying promising drug targets.
fetch_protein_structure('1mbn', '/content/1mbn/')
run_fpocket('/content/1mbn/1mbn.pdb')
output_dir = '/content/1mbn'
pdb_code = '1mbn'
target_pocket_info = f'{output_dir}{pdb_code}_out/{protein}_info.txt'
target_pocket_df = extract_features(pocket_info)
best_pocket_df = identify_most_druggable_pocket(pocket_df)
prediction = label_encoder.inverse_transform(model.predict(best_pocket_df))
print(prediction)
Downloading PDB structure '1mbn'...
['N']
The predicted label is 'N' which means protein '1mbn' isn't highly druggable. Hence, it shouldn't be used as a drug
target.
Supervised Learning
Regression
Classification
Unsupervised Learning
Clustering
Feel free to explore these resources as you progress through the notebook to deepen your understanding of the ML
methods used in druggability assessment.
References
1. Hopkins, A. L., & Groom, C. R. (2002). The druggable genome. Nature Reviews Drug Discovery, 1(9), 727-730. DOI:
10.1038/nrd892
2. Yu, L., Xue, L., Liu, F., Li, Y., Jing, R., & Luo, J. (2022). The applications of deep learning algorithms on in silico
druggable proteins identification. Journal of Advanced Research, 41, 219-231. DOI: 10.1016/j.jare.2022.01.009
3. Hajduk, P. J., Huth, J. R., & Tse, C. (2005). Predicting protein druggability. Drug Discovery Today, 10(23-24), 1675-
1682. DOI: 10.1016/S1359-6446(05)03624-203624-2)
4. Halgren, T. A. (2009). Identifying and characterizing binding sites and assessing druggability. Journal of Chemical
Information and Modeling, 49(2), 377-389. DOI: 10.1021/ci800324m
5. Peters, J. U. (2013). Polypharmacology–foe or friend? Journal of Medicinal Chemistry, 56(22), 8955-8971. DOI:
10.1021/jm400856t
6. Ashley, E. A. (2016). Towards precision medicine. Nature Reviews Genetics, 17(9), 507-522. DOI:
10.1038/nrg.2016.86
7. Agoni, C., Olotu, F.A., Ramharack, P. et al. Druggability and drug-likeness concepts in drug design: are biomodelling
and predictive tools having their say?. J Mol Model 26, 120 (2020). DOI: 10.1007/s00894-020-04385-6
8. Abi Hussein H, Geneix C, Petitjean M, Borrel A, Flatters D, Camproux AC. Global vision of druggability issues:
applications and perspectives. Drug Discov Today. 22, 404–415 (2017)
9. Arrowsmith, J. Phase III and submission failures: 2007-2010. Nature Reviews Drug Discovery. 10(2), 1-2 (2011)
10. Excelra. (2024). Identifying Druggable Therapeutic Targets: Unveiling Promising Avenues in Drug Discovery. Excelra
White Paper. Retrieved from https://www.excelra.com/whitepaper/identifying-druggable-therapeutic-targets-
unveiling-promising-avenues-in-drug-discovery/
11. Aguti R, Gardini E, Bertazzo M, Decherchi S and Cavalli A (2022) Probabilistic Pocket Druggability Prediction via One-
Class Learning. Front. Pharmacol. 13:870479. DOI: 10.3389/fphar.2022.870479)
@manual{Bioinformatics,
title={Druggability Assessment with Fpocket and Machine Learning},
organization={DeepChem},
author={Yadav, Anamika },
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Druggablity_Assessment_with_Fpo
year={2024},
}
Protein Deep Learning
by David Ricardo Figueroa Blanco
In this tutorial we will compare protein sequences featurization such as one hot encoders and aminoacids composition.
We will use some tools of DeepChem and additional packages to create a model to predict melting temperature of
proteins ( a good measurement of protein stability )
The melting temperature (MT) of a protein is a measurement of protein stability. This measure could vary from a big
variety of experimental conditions, however, curated databases cand be found in literature
https://aip.scitation.org/doi/10.1063/1.4947493. In this paper we can find a lot of thermodynamic information of proteins
and therefore a big resource for the study of protein stability. Other information related with protein stability could be
the change in Gibbs Free Energy
due to a mutation.
The study of protein stability is important in areas such as protein engineering and biocatalysis because catalytic
efficiency could be directly related to the tertiary structure of the protein in study.
Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to
run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case,
don't run these cells since they will download and install Anaconda on your local machine.
Data extraction
In this cell, we download the dataset published in the paper https://aip.scitation.org/doi/10.1063/1.4947493 from the
DeepChem dataset repository
import deepchem as dc
import os
from deepchem.utils import download_url
data_dir = dc.utils.get_data_dir()
download_url("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pucci-proteins-appendixtable1.csv",dest_dir=
print('Dataset Dowloaded at {}'.format(data_dir))
dataset_file = os.path.join(data_dir, "pucci-proteins-appendixtable1.csv")
A closer look of the dataset: Contains the PDBid and the respective mutation and change in thermodynamical properties
in each studied protein
import pandas as pd
data = pd.read_csv(dataset_file)
data
Unnamed: Tmexp
N PDBid Chain RESN RESwt RESmut ΔTmexp ΔΔHmexp ... ΔΔGexp(T) T Nres
0 [wt]
2 NaN 3 1aky A 77 THR HIS -1.1 47.6 130 ... 9.0 25 220
3 NaN 4 1aky A 110 THR HIS -4.8 47.6 165 ... 11.0 25 220
4 NaN 5 1aky A 169 ASN ASP -0.6 47.6 140 ... 9.0 25 220
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1621 NaN 1622 5pti_m52l A 15 LYS SER -1.3 91.7 -5 ... 1.2 25 58
1622 NaN 1623 5pti_m52l A 15 LYS THR -1.1 91.7 -9 ... -3.6 25 58
1623 NaN 1624 5pti_m52l A 15 LYS VAL -6.3 91.7 4 ... 4.7 25 58
1624 NaN 1625 5pti_m52l A 15 LYS TRP -7.5 91.7 17 ... 8.5 25 58
1625 NaN 1626 5pti_m52l A 15 LYS TYR -6.6 91.7 4 ... 4.6 25 58
Here we extract a small DataFrame that only contains a unique PDBid code and its respective melting temperature
WT_Tm
Tmexp [wt]
PDBid
1aky 47.6
1aky 47.6
1aky 47.6
1aky 47.6
1aky 47.6
... ...
5pti_m52l 91.7
5pti_m52l 91.7
5pti_m52l 91.7
5pti_m52l 91.7
5pti_m52l 91.7
Here we create a dictionary that contains as keys, the pdbid of each protein and as values, the wild type melting
temperature
dict_WT_TM = {}
for k,v in WT_Tm.itertuples():
if(k not in dict_WT_TM):
dict_WT_TM[k]=float(v)
pdbs = data[data['PDBid'].str.len()<5]
pdbs = pdbs[pdbs['Chain'] == "A"]
pdbs[['RESN','RESwt','RESmut']]
RESN RESwt RESmut
0 8 VAL ILE
1 48 GLN GLU
2 77 THR HIS
This cell extracts the total number of mutations and changes in MT. In addition, we use a dicctionary to convert the
residue mutation in a one letter code.
alls=[]
for resnum,wt in pdbs[['RESN','RESwt','RESmut','PDBid','ΔTmexp']].iteritems():
alls.append(wt.values)
d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N',
'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W',
'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}
resnum=alls[0]
wt=[d[x.strip()] for x in alls[1]] # extract the Wildtype aminoacid with one letter code
mut=[d[x.strip()] for x in alls[2]] # extract the Mutation aminoacid with one letter code
codes=alls[3] # PDB code
tms=alls[4] # Melting temperature
PDB Download
Here we download all the pdbs by PDBID using the pdbfixer tool
!mkdir PDBs
Using the fixer from pdbfixer we download each protein from its PDB code and fix some common problems present in
the Protein Data Bank Files. This process will take around 15 minutes and 100 Mb. The use of the PDBFixer can be found
in https://htmlpreview.github.io/?https://github.com/openmm/pdbfixer/blob/master/Manual.html . In our case, we
download the pdb file from the pdb code and perform some curation such as find Nonstandar or missing residues, fix
missing atoms
import os
import time
t0 = time.time()
downloaded = os.listdir("PDBs")
PDBs_ids= set(pdbs['PDBid'])
pdb_list = []
print("Start Download ")
for pdbid in PDBs_ids:
name=pdbid+".pdb"
if(name in downloaded):
continue
try:
fixer = PDBFixer(pdbid=pdbid)
fixer.findMissingResidues()
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.removeHeterogens(True)
fixer.findMissingAtoms()
fixer.addMissingAtoms()
PDBFile.writeFile(fixer.topology, fixer.positions, open('./PDBs/%s.pdb' % (pdbid), 'w'),keepIds=True)
except:
print("Problem with {}".format(pdbid))
print("Total Time {}".format(time.time()-t0))
The following function help us to mutate a sequence denoted as A###B where A is the wildtype aminoacid, ### the
position and, B the new aminoacid
import re
def MutateSeq(seq,Mutant):
'''
Mutate a sequence based on a string (Mutant) that has the notation :
A###B where A is the wildtype aminoacid ### the position and B the mutation
'''
aalist = re.findall('([A-Z])([0-9]+)([A-Z])', Mutant)
#(len(aalist)==1):
newseq=seq
listseq=list(newseq)
for aas in aalist:
wildAA = aas[0]
pos = int(aas[1]) -1
if(pos >= len(listseq)):
print("Mutation not in the range of the protein")
return None
MutAA = aas[-1]
if(listseq[pos]==wildAA):
listseq[pos]=MutAA
else:
#print("WildType AA does not match")
return None
return("".join(listseq))
The following function help us to identify a sequence of aminoacids base on PDB structures
Some examples of the described functions : GetSeqFromPDB. Take one pdb that we previously downloaded and extract
the sequence in one letter code
1ezm
Original Sequence
AEAGGPGGNQKIGKYTYGSDYGPLIVNDRCEMDDGNVITVDMNSSTDDSKTTPFRFACPTNTYKQVNGAYSPLNDAHFFGGVVFKLYRDWFGTSPLTHKLYMKVHYGRSVEN
AYWDGTAMLFGDGATMFYPLVSLDVAAHEVSHGFTEQNSGLIYRGQSGGMNEAFSDMAGEAAEFYMRGKNDFLIGYDIKKGSGALRYMDQPSRDGRSIDNASQYYNGIDVHH
SSGVYNRAFYLLANSPGWDTRKAFEVFVDANRYYWTATSNYNSGACGVIRSAQNRNYSAADVTRAFSTVGVTCPSAL
informSeq=GetSeqFromPDB(test+".pdb")[0].__repr__()
print("Seq information",informSeq)
start = re.findall('[0-9]+',informSeq)[0]
print("Reported Mutation {}{}{}".format("R",179,"A"))
numf =179 - int(start) + 1 # fix some cases of negative aminoacid numbers
mutfinal = "R{}A".format(numf)
print("Real Mutation = ",mutfinal)
mutseq = MutateSeq(seq,mutfinal)
print(mutseq)
In this for loop we extract the sequences of all proteins in the dataset. In addition we created the mutated sequences
and append the change in MT. In some cases, gaps in pdbs will cause that mutateSeq function fails, therefore this
entries will be avoided. This is an important step in the whole process because creates a final tabulated data that
contains the sequence and the Melting temperature ( our label)
information = {}
count = 1
failures=[]
for code,tm,numr,wt_val,mut_val in zip(codes,tms,resnum,wt,mut):
count += 1
seq = GetSeqFromPDB("{}.pdb".format(code))[0].get_sequence()
mutfinal="WT"
if("{}-{}".format(code,mutfinal) not in information):
informSeq=GetSeqFromPDB(code+".pdb")[0].__repr__()
start = re.findall('[-0-9]+',informSeq)[0]
if(int(start)<0):
numf =numr - int(start) # if start is negative 0 is not used as resnumber
else:
numf =numr - int(start) + 1
mutfinal = "{}{}{}".format(wt_val,numf,mut_val)
mutseq = MutateSeq(seq,mutfinal)
if(mutseq==None):
failures.append((code,mutfinal))
continue
information["{}-{}".format(code,mutfinal)]=[mutseq,dict_WT_TM[code]-float(tm)]
Here we extract two list, sequences (data) and melting temperature (label)
seq_list=[]
deltaTm=[]
for i in information.values():
seq_list.append(i[0])
deltaTm.append(i[1])
max_seq= 0
for i in seq_list:
if(len(i)>max_seq):
max_seq=len(i)
codes = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',
'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
OneHotFeaturizer = dc.feat.OneHotFeaturizer(codes,max_length=max_seq)
features = OneHotFeaturizer.featurize(seq_list)
Note that the OneHotFeaturizer produces a matrix that contains the OneHot Vector for each sequence.
features_vector = []
for i in range(len(features)):
features_vector.append(features[i].flatten())
dc_dataset = dc.data.NumpyDataset(X=features_vector,y=deltaTm)
dc_dataset
<NumpyDataset X.shape: (1497, 13188), y.shape: (1497,), w.shape: (1497,), task_names: [0]>
model.compile(loss='mae', optimizer='adam')
print(model.summary())
history = model.fit(
train.X, train.y,
validation_data=(test.X,test.y),
batch_size=100,
epochs=30,
)
10.399123382568359
History_df = pd.DataFrame(model_dc.model.history.history)
History_df[['loss', 'val_loss']].plot()
<AxesSubplot:>
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print('test dataset R2:', model_dc.evaluate(test, [metric]))
In the following cell, we are creating and pyPro Object based on the protein sequence. Pypro allows us the calculation of
amino acid composition vectors
Here we create a list with the aminoacido composition vector for each sequence used in the previous model.
import numpy as np
aaComplist = []
CTDList =[]
for seq in seq_list:
Obj = PyPro.GetProDes(seq)
aaComplist.append(np.array(list(Obj.GetAAComp().values())))
CTDList.append(np.array(list(Obj.GetCTD().values())))
dc_dataset_aacomp = dc.data.NumpyDataset(X=aaComplist,y=deltaTm)
dc_dataset_ctd = dc.data.NumpyDataset(X=CTDList,y=deltaTm)
RandomForestRegressor
Train score is : {'mae_score': 1.7916551501995608}
Test score is : {'mae_score': 3.8967191996673947}
In the following cell we create a Suport Vector Regressor and the deepchem SklearnModel. As it was used in the DL
models, here we use "MAE" score to evaluate the results of the regression
print("SupportVectorMachineRegressor")
from sklearn.svm import SVR
svr_sklearn = SVR(kernel="poly",degree=4)
svr_sklearn.random_state = seed
model = dc.models.SklearnModel(svr_sklearn)
model.fit(train)
metric = dc.metrics.Metric(dc.metrics.mae_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
print("Train score is : {}".format(train_score))
print("Test score is : {}".format(test_score))
SupportVectorMachineRegressor
Train score is : {'mae_score': 3.275727325767219}
Test score is : {'mae_score': 4.058136267284038}
This tutorial aims to provide a quick overview of key immunology concepts needed to understand antibody structure
and function in the broader context of the immune system. We make some assumptions with familiarity with large
language models. Take a look at our other tuorial. For the sake of brevity we provide links on non-essential topics that
point to external sources wherever possible. Follow along to learn more about the immune system, and protein
language models for guided Ab design.
Note: This tutorial is loosely based on the 2023 Nature Biotechnology paper titled "Efficient evolution of human
antibodies from general protein language models" [1] by Hie et al. We thank the authors for making their methods and
data available and accessible.
1. Immunology 101
If you would like to learn more about the complex problem of self non-self discrimination, and appreciate the theory
behind the immune system's organization and function, we recommend checking out the following works:
algorithmic
Adaptive Immune Algorithm
Note: A helpful distinction between antigens and epitopes is that an antigen is something that broadly generates an
immune response and can have multiple epitopes. Epitopes are specific molecular patterns that have a matching
paratope (binding surface of an adaptive immune receptor).
Image Source: Creative Diagnostics
Over the course of the COVID pandemic, whether we wanted to or not were exposed to the concept of antibodies and
learned of their association with some sort of protective capacity against SARS-COV-2. But what are they, and where do
they originate from?
Antibodies (Abs) are typically represented as Y-shaped proteins that bind to their cognate epitope surfaces with high
specificity and affinity, similar to how TCRs and BCRs bind to their epitopes. This is because antibodies are the soluble
form of the B-cell receptor that is secreted into the blood upon B-cell activation in the presence of its cognate antigen.
The secretion of large amounts of antibodies is the primary effector function of B-cells. Upon activation, a B-cell will
divide, with the daughter cells inheriting the same BCR, and some of these cells will differentiate into plasma cells,
which are the Ab factories capable of secreting thousands of Abs/min. This is especially useful upon antigen re-
encounter where a large amount of antibodies are released by memory cells which neutralize the pathogen even before
we develop the symptoms of infection (this is what most common vaccines are designed to do).
Neutralizing mechanisms of pathogenesis is only one way that antibody tagging is useful to immune defense. Antibody
tagging plays a key role in a number of humoral immunity processes:
1. Neutralization: De-activation of pathogenic function by near complete coating of the functional component of
pathogens or toxins by antibodies to inhibit interaction with host cells (i.e. and antibody that binds to the surface
glycoproteins on SARS-COV2 now inhibit that virus particle's ability to enter cells expressing ACE2).
2. Opsonization: Partial coating of pathogens enhances rates phagocytosis and removal from the blood by cells of the
innate immune system.
3. Agglutination/Precipitation): Since antibodies have 2 arms (each arm of the Y), they can cross-link and form
anitbody-antigens chains which can precipitate out of the plasma and increase their chances of being recognized as
aberrant and cleared by phagocytes.
4. Complement Activation: The complement system is a collection of inactive proteins and protein precursors are self-
amplifying on activation and help with multiple aspects of humoral immunity. Yet another function of antibodies is
their role in initiating the complement cascade that ends in the lysis or phagocytosis of pathogens.
Image Source: The Immune System: Innate and Adaptive Body Defenses Figure 21.15 pulled from [Source]
Given the importance of B-cell mediated immunity, as operationalized by the body's antibody repertoire, it's clear that
the diversity of BCR clones plays a critical role in our ability to mount an effective response against a pathogen. The
maintenance of a robust BCR repertoire highlights not only the complexity of the immune response but also underscores
the potential for leveraging the modularity of this mechanism to introduce new clones for their extraordinary precision
in therapeutics such as vaccine development.
Structurally, antibodies are composed of two identical light chains and two identical heavy chains, linked by disulfide
bonds. Each chain contributes to the formation of the antigen-binding site, located in the variable regions. Within these
regions, hypervariable loops known as complementarity determining regions (CDRs) dictate the specificity and affinity of
the antibody-antigen interaction. This specificity is measured in terms of affinity using the dissociation constant (Kd),
and the avidity (affinity over multi-valent binding sites, see IgM, IgA). The antibody molecule is divided into two main
functional regions:
1. Fab Region (Fragment, antigen-binding): Contains the variable regions of the light and heavy chains, responsible for
antigen recognition and binding.
2. Fc Region (Fragment, crystallizable): Composed of the constant regions of the heavy chains, mediates interactions
with innate immune cells and the complement system.
Image Source: Dianova: Antibody Structure \
By harnessing the selection of evolutionary pressures during somatic hypermutation, the B-cell compartment uses a
powerful method of further tuning the antibody specificity to have some of the highest affinity interactions in the known
protein universe [12]. Their high precision and binding affinities have caused their broad adoption in not only
therapeutics but commercial and research applications as well as to tag proteins in solution in flow cytometry, CyTOF,
immuno-precipitation, and other target identification assays.
2.1 Overview
Now that we have the minimal backgrounded needed to understand the antibody design proble and the necessary
language model background, we can jump right into antibody design via directed evolution, as shown in the figure
below:
\
Image Source: Figure 1. Outeiral et. al
2.2 Setup/Methodology
In Hie et al. the authors decide to use a general protein language model instead of one trained specifically on antibody
sequences. They use the ESM-1b and ESM-1v models which were trained on UniRef50 and UniRef90 [14], respectively.
For their directed evolution studies they select seven therapeutic antibodies associated with viral infection spanning
Influenza, Ebolavirus, and SARS-COv2. The authors use a straightforward and exhaustive mutation scheduler in
mutating every residue in the antigen binding region to every other residue and computing the likelihood of the
sequence. Sequences with likelihoods greater than or equal to the WT sequence were kept for experimental validation.
For our purposes, we need not be as thorough and can use a slightly expedited method by taking the top-k mutations at
a specific point.
Inspired by the work of Hie et al., we first define the pLM driven directed evolution task as simply passing in a masked
antibody sequence to a pLM that was previously trained on the masked language modeling objective and examining the
token probabilities for the masked amino acids. It really is that easy!
For reference we break the task down into the following steps:
*Modification of antibodies needs to focus only on the variable regions as the amino acids at the interface are the ones
responsible for driving affinity. Making edits to the constant region would actually be detrimental to antibodies' effector
function in the complement system as well as potentially disrupt binding to innate immune receptors. \
of 768, max position embedding of 160, and 12 transformer block layers, totaling ~86M parameters.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/
tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 0%| | 0.00/367 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/71.0 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/3.02k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
config.json: 0%| | 0.00/848 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/343M [00:00<?, ?B/s]
config.json: 0%| | 0.00/848 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/343M [00:00<?, ?B/s]
)
)
)
)
)
(lm_head): RobertaLMHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(decoder): Linear(in_features=768, out_features=24, bias=True)
)
)
# Lets take the variable regions of the heavy and light chains
heavy_chain_example = 'EVQLQESGPGLVKPSETLSLTCTVSGGPINNAYWTWIRQPPGKGLEYLGYVYHTGVTNYNPSLKSRLTITIDTSRKQLSLSLKFVTAADSAVY
light_chain_example = 'GSELTQDPAVSVALGQTVRITCQGDSLRNYYASWYQQKPRQAPVLVFYGKNNRPSGIPDRFSGSSSGNTASLTISGAQAEDEADYYCNSRDSSS
'''
cleaned_sequence = sequence.replace(' ', '') # Get ride of extraneous spaces if any
assert abs(idx) < len(sequence), "Zero-indexed value needs to be less than sequence length minus one."
cleaned_sequence = list(cleaned_sequence) # Turn the sequence into a list
cleaned_sequence[idx] = '*' # Mask the sequence at idx
masked_sequence = ' '.join(cleaned_sequence) # Convert list -> seq
masked_sequence = masked_sequence.replace('*', mask)
return masked_sequence
# Test
assert mask_seq_pos('CAT', 1)=='C [MASK] T'
#TODO: Add unit tests with pytest where you can check that the assert has been hit
HuggingFace Pipelines:
1. Pipeline object is a wrapper for inference and can be treated like an object for API calls
2. There is a fill-mask pipeline that we can use which accepts a single mask token in out input and outputs a dictionary
of the score of that sequence, the imputed token, and the reconstructed full sequence.
[{'score': 0.13761496543884277,
'token': 7,
'token_str': 'S',
'sequence': 'G S E L T Q D P A S S V A L G Q T V R I T C Q G D S L R N Y Y A S W Y Q Q K P R Q A P V L V F Y
G K N N R P S G I P D R F S G S S S G N T A S L T I S G A Q A E D E A D Y Y C N S R D S S S N H L V F G G G T K
L T V L S Q'},
{'score': 0.1152879148721695,
'token': 6,
'token_str': 'E',
'sequence': 'G S E L T Q D P A E S V A L G Q T V R I T C Q G D S L R N Y Y A S W Y Q Q K P R Q A P V L V F Y
G K N N R P S G I P D R F S G S S S G N T A S L T I S G A Q A E D E A D Y Y C N S R D S S S N H L V F G G G T K
L T V L S Q'},
{'score': 0.0989701896905899,
'token': 9,
'token_str': 'N',
'sequence': 'G S E L T Q D P A N S V A L G Q T V R I T C Q G D S L R N Y Y A S W Y Q Q K P R Q A P V L V F Y
G K N N R P S G I P D R F S G S S S G N T A S L T I S G A Q A E D E A D Y Y C N S R D S S S N H L V F G G G T K
L T V L S Q'},
{'score': 0.08586061000823975,
'token': 14,
'token_str': 'A',
'sequence': 'G S E L T Q D P A A S V A L G Q T V R I T C Q G D S L R N Y Y A S W Y Q Q K P R Q A P V L V F Y
G K N N R P S G I P D R F S G S S S G N T A S L T I S G A Q A E D E A D Y Y C N S R D S S S N H L V F G G G T K
L T V L S Q'},
{'score': 0.07652082294225693,
'token': 8,
'token_str': 'T',
'sequence': 'G S E L T Q D P A T S V A L G Q T V R I T C Q G D S L R N Y Y A S W Y Q Q K P R Q A P V L V F Y
G K N N R P S G I P D R F S G S S S G N T A S L T I S G A Q A E D E A D Y Y C N S R D S S S N H L V F G G G T K
L T V L S Q'}]
Disclaimer: For a more thorough antibody (re)design, we will typically want to follow an approach like what was done in
Hie et al. where every point along the sequence will be mutated and the total number of sequences will be collated and
scored with the top-100 or so antibodies being expressed for validation. If you would like to explore this feel free to try it
out yourself as a challenge!
You can also refer to the real data in Hie et al. to see if any of the predicted ones were found to work well and increase
fitness.
2.3 Limitations
While promising, this approach is obviously not without its shortcomings. Key limitations include:
Fixed length antibody design since masked tokens are applied in a 1:1 fashion.
Lack of target information included during conditional sampling step which can influence choice of amino acid given
the sequence context.
Approach is sensitive to choice of protein language model
This letter [18] provides a great synopsis of Hie et al.'s work, which by extension apply to the methods presented in this
tutorial as well.
@manual{Bioinformatics,
title={An Introduction to Antibody Design Using Protein Language Models},
organization={DeepChem},
author={Karthikeyan, Dhuvarakesh and Menezes, Aaron},
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/DeepChem_AntibodyTutorial_Simpl
year={2024},
}
Works Cited
[1] Hie, B.L., Shanker, V.R., Xu, D. et al. Efficient evolution of human antibodies from general protein language models.
Nat Biotechnol 42, 275–283 (2024). https://doi.org/10.1038/s41587-023-01763-2
[2] Bretscher, P., & Cohn, M. (1970). A Theory of Self-Nonself Discrimination. Science, 169(3950), 1042–1049.
doi:10.1126/science.169.3950.1042
[3] Cohn, M. The common sense of the self-nonself discrimination. Springer Semin Immun 27, 3–17 (2005).
https://doi.org/10.1007/s00281-005-0199-1
[5] ROB J. DE BOER, PAULINE HOGEWEG, Self-Nonself Discrimination due to Immunological Nonlinearities: the Analysis
of a Series of Models by Numerical Methods, Mathematical Medicine and Biology: A Journal of the IMA, Volume 4, Issue
1, 1987, Pages 1–32, https://doi.org/10.1093/imammb/4.1.1
[6] Cohn, M. A biological context for the self-nonself discrimination and the regulation of effector class by the immune
system. Immunol Res 31, 133–150 (2005). https://doi.org/10.1385/IR:31:2:133
[7] Janeway CA Jr, Travers P, Walport M, et al. Immunobiology: The Immune System in Health and Disease. 5th edition.
New York: Garland Science; 2001. Principles of innate and adaptive immunity. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK27090/
[8] Perelson, A. Modelling viral and immune system dynamics. Nat Rev Immunol. 2. , 28–36 (2002).
https://doi.org/10.1038/nri700
[10] Janeway CA Jr, Travers P, Walport M, et al. Immunobiology: The Immune System in Health and Disease. 5th edition.
New York: Garland Science; 2001. Chapter 8, T Cell-Mediated Immunity. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK10762/
[11] Glick, B., Chang, T. S., & Jaap, R. G. (1956). The Bursa of Fabricius and Antibody Production. Poultry Science, 35(1),
224–225. doi:10.3382/ps.0350224
[12] Nooren, I. M. (2003). NEW EMBO MEMBER’S REVIEW: Diversity of protein-protein interactions. EMBO Journal, 22(14),
3486–3492. https://doi.org/10.1093/emboj/cdg359
[13] Karolis Martinkus, Jan Ludwiczak, Kyunghyun Cho, Wei-Ching Liang, Julien Lafrance-Vanasse, Isidro Hotzel, Arvind
Rajpal, Yan Wu, Richard Bonneau, Vladimir Gligorijevic, & Andreas Loukas. (2024). AbDiffuser: Full-Atom Generation of
in vitro Functioning Antibodies.
[14] Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, Cathy H. Wu, UniRef: comprehensive and non-
redundant UniProt reference clusters, Bioinformatics, Volume 23, Issue 10, May 2007, Pages 1282–1288,
https://doi.org/10.1093/bioinformatics/btm098
[15] Tobias H Olsen, Iain H Moal, Charlotte M Deane, AbLang: an antibody language model for completing antibody
sequences, Bioinformatics Advances, Volume 2, Issue 1, 2022, vbac046, https://doi.org/10.1093/bioadv/vbac046
[16] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, & Veselin Stoyanov. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
[17] Olsen TH, Boyles F, Deane CM. Observed Antibody Space: A diverse database of cleaned, annotated, and translated
unpaired and paired antibody sequences. Protein Sci. 2022 Jan;31(1):141-146. doi: 10.1002/pro.4205. Epub 2021 Oct
29. PMID: 34655133; PMCID: PMC8740823.
[18] Outeiral, C., Deane, C.M. Perfecting antibodies with language models. Nat Biotechnol 42, 185–186 (2024).
https://doi.org/10.1038/s41587-023-01991-6
Protein Language Models (Intuition): A First Look
at Modeling Syntax and Semantics of the Known
Protein Universe
By Dhuvi Karthikeyan, Aaron Menezes, Elisa Gómez de Lope, and Rakshit Singh
Inspired by the success of Large Language Models (LLMs) such as BERT, T5, and GPT, that have demonstrated state of
the art performance on sentiment analysis, summarization, question answering, and classification tasks, protein
language models (pLMs) have demonstrated similar sweeping success across a broad array of protein specific tasks.
These tasks include contact prediction, mutational landscape fitness prediction, binding site prediction, property
prediction, and much more! In this tutorial, we explore the fundamental intuition driving the success of protein language
models, by developing a strong intuition of what is actually happening under the hood and the resulting pros and cons
of their usage. If you'd like to learn more about language models in various other domains, check out DeepChem's very
own ChemBERTa, a large language model trained on the chemical domain here.
Open in Colab
Table of Contents:
1. Introduction
2. What is a language model?
3. Methods for learning language
4. How do Protein Language Models (pLMs) work?
5. MSA-aware vs non-MSA-aware protein language models
6. Evolutionary statistics of Hemoglobin and its ProtBERT learned representation
7. Concluding thoughts
1. Introduction
This DeepChem tutorial is designed to serve as an introductory primer on protein language models, a powerful and
versatile method of processing protein sequence information inspired by methods from the natural language space.
Over the past decade, natural language processing has shown the strength of using learned representations to
encapsulate the semantic meaning of text data. Notable models like word2vec [1] and GloVe [2] proved that self-
supervised pre-training on large, unlabeled corpora effectively creates robust feature embeddings that maintain
similarity and analogy in language. However, these models were limited in utility by their context-free embeddings. The
advent of context-aware models, starting with BERT [3], led to numerous sequence models applicable beyond language
domains. In biology, self-supervised pre-training on protein language models has achieved state-of-the-art performance
in various tasks by deriving context-aware amino acid embeddings that can be finteuned to capture information on
structure [4] and function [5] of proteins.
This tutorial aims to provide an overview of the concepts and intuition of protein language models that are needed to
work with them and understand their input/outputs, strengths, failure modes. We skip over the detailed breakdown of
their architecture, but invite the community to add content as they see fit in the form of a pull request to build upon
this.
Disclaimer: For brevity sake, we make some assumptions with familiarity to the multi-layered perceptron, neural
networks, and learning by gradient descent. Additionally we assume some fluency with probability theory on matters
such as discrete vs. continuous distributions, likelihood, and conditional distributions. We provide links on non-obvious
topics and concepts to external sources wherever necessary to bring the audience a vetted and beginner friendly source
to start learning on the more complicated topics. Follow along for a high-level overview into the reason that protein
language models have been so successful across a broad range of tasks.
A simple way to visualize what a language model is doing in the background is to think of the language model as
updating and indexing a huge square matrix of transition probabilities of size
, where
is the vocabulary size of the model. Here vocabulary size refers to the number of unique words or sub-words that make
up the state space of the categorical distribution. So a model that only knows the words ['a', 'boy', 'cute', 'is', 'student',
'the' 'walking'] has a vocabulary size of 7. If we start off with an untrained model that is randomly initialized, we can use
a uniform initialization we would get a transition matrix that looks something like this where we introduce a special word
to designate the end of sequence (EOS):
However, if we look at some of the transition probabilities, we can immediately see that the model is not very good. For
example, the probability of the word 'a' coming after 'a' should be close to 0. Same goes for the word 'the' coming after
'a'. It's pretty clear that we need some way of training this model so that we can get some realistic transition
probabilities.
The first language models were trained on the principle of causal language modeling, where the model is tasked with
next word prediction during each training step.
After enough rounds of this training protocol the model learns a much more plausible distribution over the words -
something that looks like the following:
Here we can see that the model has learned that the words above are not typically repeated twice in a row. It assigns
subject words ['boy', 'student'] after the word 'the' with higher probability than the verbs ['is', 'walking']. If we start at
'the' and sample the most likely words at each transition we can generate the following sentence as a path through the
model: 'the' -> 'boy' -> 'is' -> 'walking' -> 'EOS'. This mode of sampling a word at every time step and then conditioning
on the previously sampled words is known as auto-regressive generation.
Causal language modeling has a key drawback in that sometimes the necessary context to make sense of a word in a
sentence comes after the word and not before. Masked language modeling is like causal modeling, but makes use of the
fact that context may come before and after the word of interest.
This approach is what underlies the powerful BERT [3] language model, where they used a masking rate of about 15% of
the words. Amazingly, this approach has been tried on sequences other than language and has been shown to be a
robust model for learning the syntax and semantics of sequential data of various modalities including time series data,
videos, and yes even proteins!
An optional second training step known as fine-tuning can be applied on a pre-trained protein language model, to
further train it on a specific task with protein sequence examples annotated with labels. In practice, starting from the
pretrained weights has shown to have better performance than starting from randomly initialized weights as the model
simply learns how to use strong representations of the inputs (learned during pretraining) instead of jointly learning the
representation AND how to use it. PLMs finetuned on the mappings between specific protein families or functional
classes can significantly enhance predictive power compared to non-pretrained models, and can be applied in a number
of different use cases, such as predicting binding sites or the effects of mutations.
One of the most compelling benefits of PLMs is their ability to capture coevolutionary relationships within and across
protein sequences [7]. In the same way that words in a sentence co-occur to convey coherent meaning, amino acid
residues in a protein sequence co-evolve to maintain the protein's structural integrity and functionality. PLMs capture
these coevolutionary patterns, allowing for the prediction of how changes in one part of a protein may affect other parts.
Thus, from a design perspective, the directed evolution task is an area where PLMs offer substantial advantages. In a
directed evolution experiment, a naturally occurring protein can be mutated according to any arbitrary heuristic and is
then checked if a desired function has improved. Since PLMs capture intra-sequence conditional distributions, this
process can be vastly streamlined by masking portions of the protein we wish to 'mutate' and sampling from the
distribution of what amino acids are strong candidates to occur given the rest of the sequence. PLMs thus have the
potential to significantly reduce experimental burden by identify promising candidates a higher hit rate.
Models like ESM-1b [4] and ESM-2 [8] are examples of sequence-only pLMs that do not explicitly incorporate 3D
structural information. These sequence-based pLMs have demonstrated impressive performance on a variety of protein
function prediction tasks by learning patterns from large protein sequence datasets. However, the lack of structural
information can limit the generalization capabilities of sequence-only PLMs. This is true especially for applications
heavily dependent on protein structure such as contact prediction. Moreoever, the inclusion of structural information
helps overcome the distributional biases that exist in the training datasets of sequences.
Structure-aware pLMs like S-PLM[9] and ESM-Fold [8] are trained on both sequence and structural information, and in
turn generate protein representations that encode both sequence and structural information. These models use various
methods such as multi-view contrastive learning to align the sequence and structure representations in a shared latent
space (S-PLM). The structural awareness enables them to achieve comparable or superior performance to specialized
structure-based methods or sequence-based pLMs, particularly for applications that heavily rely on protein structure.
Interestingly, the recently released ESM-3 [10] pLM reasons over sequence, structure, and function, meaning that for
each protein, its sequence, structure, and function are extracted, tokenized, and partially masked during pre-training.
The framework of S-PLM and lightweight tuning strategies for downstream supervised learning. a, The framework of S-
PLM: During pretraining, the model inputs both the amino acid sequences and contact maps derived from protein
structures simultaneously. After pretraining, the ESM-Adapter that generates the AA-level embeddings before the
projector layer is used for downstream tasks. The entire ESM-Adapter model can be fully frozen or learnable through
lightweight tuning. b, Architecture of the ESM-Adapter. c, Adapter tunning for supervised downstream tasks. d, LoRA
tuning for supervised downstream tasks is implemented. Adapted from [9].
In the context of pLMs, MSA provides evolutionary context to the representations of protein sequences. PLMs can be
MSA-aware and non-MSA-aware:
MSA-aware models:
MSA-aware models, such as the MSA Transformer [11], Evoformer (used in AlphaFold) [12] and ESM-MSA [11], are
trained on datasets that include MSAs as input to incorporate evolutionary information and relationships between
sequences to learn richer representations. They align multiple homologous sequences to capture conserved and
variable regions. The rationale is that conserved regions often indicate functionally or structurally important parts of the
protein, while variable regions can provide insights into evolutionary divergence and adaptation.
MSA-aware models can provide deeper insights into protein function and structure due to the evolutionary context.
However, they are computationally intensive and require high-quality MSAs, which may not be available for all protein
families.
Non-MSA-aware models:
Non-MSA-aware models, such as ESMFold (ESM-2)[8], ProtBERT [6] and TAPE, treat each protein sequence
independently and do not explicitly incorporate evolutionary information from MSAs. They are trained on large datasets
of individual protein sequences, learning patterns and representations directly from the sequence data.
While they can generalize well to diverse sequences and are computationally efficient, they may miss out on the
evolutionary context that can be crucial for certain tasks.
Evolutionary insight: MSAs provide evolutionary information, highlighting conserved residues that are often critical
for protein function and structure.
Improved predictions: By incorporating evolutionary context, MSA-aware models can improve performance on tasks
such as secondary structure prediction, contact prediction, and function annotation.
Functional and structural understanding: MSAs help in identifying functionally important regions and understanding
the structural constraints of proteins.
Computational complexity: Generating and processing MSAs is computationally expensive and time-consuming.
Data availability: High-quality MSAs are not available for all protein families, especially those with few known
homologs.
Model complexity: MSA-aware models are more complex and require sophisticated architectures to effectively utilize
the evolutionary information.
Other considerations:
The performance benchmark of both MSA-aware and not MSA-aware for predicting the 3D structure of proteins, as
well as their function and other properties is currently an active topic of research.
Interestingly, MSA-free models have reported ability to efficiently generate sufficiently accurate MSAs that can be
used as input for the MSA-aware models.
Without further ado, lets explore some of the properties of protein language models in the wild!
Image Source: Adapted from "Représentation simplifiée de l'hémoglobine et de l'hème". Wikimedia Commons.
Hemoglobin is the protein responsible for transporting oxygen from the lungs to all the cells of our body via red blood
cells. Hemoglobin is a great protein to interrogate the behaviors of protein language models as it is highly conserved in
certain regions across species, and also slightly variable in other places. What would we expect the distribution over
amino acids to look like if we mask out a highly conserved region? What about a highly diverse region? Let's find out.
Hemoglobin Sequence Homology across closely related mammals (from [13]):
hemoglobin_beta = {
'human':
"MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA
'chimpanzee':
"MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTORFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA
'camel':
"MVHLSGDEKNAVHGLWSKVKVDEVGGEALGRLLVVYPWTRRFFESFGDLSTADAVMNNPKVKAHGSKVLNSFGDGLNHLDNLKGTYAKLSELHCDKLHVDPENFRLLGNVLVVVLA
'rabbit':
"MVHLSSEEKSAVTALWGKVNVEEVGGEALGRLLVVYPWTQRFFESFGDLSSANAVMNNPKVKAHGKKVLAAFSEGLSHLDNLKGTFAKLSELHCDKLHVDPENFRLLGNVLVIVLS
'pig':
"MVHLSAEEKEAVLGLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPKVKAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVVLA
'horse':
"*VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLA
'bovine':
"M**LTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKVKAHGKKVLDSFSNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLA
'sheep':
"M**LTAEEKAAVTGFWGKVKVDEVGAEALGRLLVVYPWTQRFFEHFGDLSNADAVMNNPKVKAHGKKVLDSFSNGMKHLDDLKGTFAQLSELHCDKLHVDPENFRLLGNVLVVVLA
}
subunits across the animal kingdom. The part of the hemoglobin sequence that is essential to the function of carrying
oxygen is the part that binds to the heme group. This is handled by a single amino acid, namely the Histidine (H) near
position 92 on the beta chain, in the middle of the underlined subsequences above. Unsurprsingly, given its functional
importance, the amino acid (H) at position is unchanged across all species. Can a language model recapitulate this?
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/
tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 0%| | 0.00/86.0 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/81.0 [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download`
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force
a new download, use `force_download=True`.
warnings.warn(
config.json: 0%| | 0.00/361 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/1.68G [00:00<?, ?B/s]
Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertForMaskedLM: ['ber
t.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another tas
k or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTrainin
g model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to
be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification mo
del).
BertForMaskedLM(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30, 1024, padding_idx=0)
(position_embeddings): Embedding(40000, 1024)
(token_type_embeddings): Embedding(2, 1024)
(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-29): 30 x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
)
)
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(transform_act_fn): GELUActivation()
(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=1024, out_features=30, bias=True)
)
)
)
M V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q R F F E S F G D L S T P D A V M
G N P K V K A H G K K V L G A F S D G L A H L D N L K G T F A T L S E L [MASK] C D K L H V D P E N F R L L G N V
L V C V L A H H F G K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y H
torch.Size([147, 30])
### Step 6. Decode the Logits Using Greedy Decoding (Max Probability at Each Timestep)
decoded_outputs = tokenizer.batch_decode(softmaxed.argmax(axis=1))
decoded_sequence = ''.join(decoded_outputs)
print(decoded_sequence)
print(f'The filled-in masked sequence is: {decoded_sequence[92]}')
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLKHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLV
CVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
The filled-in masked sequence is: H
Sanity Check: Whew, looks like the pLM ProtBERT was able to recapitulate the correct amino acid at that position. But
how confident was the model? Let's visualize the distribution at that position and see what other amino acids the model
was choosing between.
plt.bar(tokenizer.get_vocab().keys(), softmaxed[92])
plt.ylabel('Normalized Probability')
plt.xlabel('Model Vocabulary')
plt.title('Target Distribution at the F8 Histidine')
plt.xticks(rotation='vertical')
plt.show()
### [EXTRA] Step 8. Visualize the Logits Map Across All Positions
import seaborn as sns
plt.figure(figsize=(10,16))
sns.heatmap(softmaxed, xticklabels=tokenizer.get_vocab())
plt.show()
### [EXTRA] Step 9. Look at a Low Confidence Region
plt.bar(tokenizer.get_vocab().keys(), softmaxed[87])
plt.ylabel('Normalized Probability')
plt.xlabel('Model Vocabulary')
plt.title('Target Distribution at Position 87')
plt.xticks(rotation='vertical')
plt.show()
As we can see from the above, at the positions where is lower confidence, there tends to be an increase in diversity
among the different species. This aligns well with out understanding of what the categorical distribution would look like
if we took calculated the probabilities of each of the amino acids using all the homologous proteins in the protein
universe.
7. Concluding Thougts
We hope you liked this Tutorial 0 on protein language models. While subsequent tutorials will cover more of the
architecture of the protein language models, the learned representations, and the applications of this remarkable class
of methods, we hope that this work helps ground you in when going through all the details. Analyzing the input/outputs
of pLMs using this lens has helped me understand why performance disparities for certain examples and understand the
failure modes that these models can encounter. For a quick reference, some of their strengths and limitations as they
fall within the scope of the tutorial are summarized below:
7.1 Strengths
pLMs learn co-evolutionary statistics of residues across diverse protein families [7].
They capture information on structure and function from protein sequence alone (most available and accurate
modality by far).
7.2 Limitations
pLMs demonstrate poorer performance on learning the sequence distributions of highly mutated/variable protein
sequences. They bias towards germline sequences [15]
Current pLMs are biased towards the sequences derived from canonically studied model organisms [16]
@manual{Bioinformatics,
title={Protein Language Models (Intuition): A First Look at Modeling Syntax and Semantics of
the Known Protein Universe},
organization={DeepChem},
author={Karthikeyan, Dhuvarakesh and Menezes, Aaron and de Lope, Elisa Gomez},
howpublished =
{\\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/ProteinLM_Tutorial0.ipynb}},
year={2024},
},
References
[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.
arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/1301.3781
[2] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–
1543, Doha, Qatar. Association for Computational Linguistics.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding.
[4] Rao, R., Meier, J., Sercu, T., Ovchinnikov, S., & Rives, A. (2020). Transformer protein language models are
unsupervised structure learners. bioRxiv. doi:10.1101/2020.12.15.422761
[5] Ibtehaz, N., Kagaya, Y., & Kihara, D. (2023). Domain-PFP: Protein Function Prediction Using Function-Aware Domain
Embedding Representations. bioRxiv. doi:10.1101/2023.08.23.554486
[6] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas
Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, & Burkhard Rost. (2021). ProtTrans: Towards
Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing.
[7] Zhidian Zhang, Hannah K. Wayment-Steele, Garyk Brixi, Haobo Wang, Matteo Dal Peraro, Dorothee Kern, Sergey
Ovchinnikov bioRxiv 2024.01.30.577970; doi: https://doi.org/10.1101/2024.01.30.577970
[8] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli,
Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, Alexander Rives bioRxiv
2022.07.20.500902; doi: https://doi.org/10.1101/2022.07.20.500902
[9] Wang D, Pourmirzaei M, Abbas UL, Zeng S, Manshour N, Esmaili F, Poudel B, Jiang Y, Shao Q, Chen J, Xu D. S-PLM:
Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure. bioRxiv [Preprint].
2024 May 13:2023.08.06.552203. doi: 10.1101/2023.08.06.552203. PMID: 37609352; PMCID: PMC10441326.
[10] Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q.
Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil
Thomas, Yousuf Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu,
Salvatore Candido, Alexander Rives bioRxiv 2024.07.01.600583; doi: https://doi.org/10.1101/2024.07.01.600583
[11] Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, Alexander Rives
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8844-8856, 2021.
[12] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–
589 (2021). https://doi.org/10.1038/s41586-021-03819-2
[13] Ali, A., Baby, B., Soman, S.S. et al. Molecular insights into the interaction of hemorphin and its targets. Sci Rep 9,
14747 (2019). https://doi.org/10.1038/s41598-019-50619-w
[14] Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, Cathy H. Wu, UniRef: comprehensive and non-
redundant UniProt reference clusters, Bioinformatics, Volume 23, Issue 10, May 2007, Pages 1282–1288,
https://doi.org/10.1093/bioinformatics/btm098
[15] Shaw, A., Spinner, H., Shin, J., Gurev, S., Rollins, N., & Marks, D. (2023). Removing bias in sequence models of
protein fitness. bioRxiv. doi:10.1101/2023.09.28.560044
[16] Ding, F., & Steinhardt, J. (2024). Protein language models are biased by unequal sequence sampling across the tree
of life. doi:10.1101/2024.03.07.584001
Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue
working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the
DeepChem community in the following ways:
The DeepChem Discord hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life
sciences. Join the conversation!
In this tutorial, you will explore the fundamental concepts and computational methods for studying binding sites to
enhance your understanding and analysis of molecular interactions.
Table of Contents:
Introduction
Basic concepts
Types of binding sites
Computational methods to study binding sites
DeepChem tools
How does a binding pocket look like?
Further Reading
This tutorial is made to run without any GPU support, and can be used in Google colab. If you'd like to open this
notebook in colab, you can use the following link.
Open in Colab
Introduction
Binding sites are specific locations on a molecule where ligands, such as substrates, inhibitors, or other molecules, can
attach through various types of molecular interactions.
Binding sites are crucial for the function of many biological molecules. They are typically located on the surface of
proteins or within their three-dimensional structure. When a ligand binds to a binding site, it can induce a
conformational change in the protein, which can either activate or inhibit the protein's function. This binding process is
essential for numerous biological processes, including enzyme catalysis, signal transduction, and molecular recognition.
For example, in enzymes, the binding site where the substrate binds is often referred to as the active site. In receptors,
the binding site for signaling molecules (such as hormones or neurotransmitters) is critical for transmitting signals inside
the cell.
Understanding binding sites is particularly relevant for the development of new drugs, as it can lead to the development
of more effective and selective drugs from multiple angles:
Target identification: Identifying binding sites on target proteins allows researchers to design molecules that can
specifically interact with these sites, leading to the development of drugs that can modulate the protein's activity.
Drug design: Knowledge of the structure and properties of binding sites enables the design of drugs with high
specificity and affinity, reducing off-target effects and increasing efficacy.
Optimization: Detailed understanding of binding interactions helps improving the binding characteristics of drug
candidates, such as increasing binding affinity and selectivity.
Additionally, knowledge of the structure and properties of binding sites enables the design of drugs with high specificity
and affinity, reducing off-target effects and increasing efficacy.
Myoglobin (blue) with its ligand heme (orange) bound. Based on PDB: 1MBO.
Basic concepts
Here we cover some basic notions to understand the science of binding site identification.
Molecular Interactions
The specific interactions that occur at the binding site can be of various types, including (non-exhaustive list):
Hydrogen Bonding: Weak electrostatic interactions between hydrogen atoms bonded to highly electronegative
atoms (like oxygen, nitrogen, or fluorine) and other electronegative atoms or functional groups. Hydrogen bonding is
important for stabilizing protein-ligand complexes and can be enhanced by halogen bonding.
Halogen Bonding: A type of intermolecular interaction where a halogen atom (like iodine or fluorine) acts as an
acceptor, forming a bond with a hydrogen atom or a multiple bond. Halogen bonding can significantly enhance the
affinity of ligands for binding sites.
Orthogonal Multipolar Interactions: Interactions between backbone carbonyls, amide-containing side chains,
guanidinium groups, and sulphur atoms, which can also enhance binding affinity.
Van der Waals Forces: Weak, non-specific interactions arising from induced electrical interactions between closely
approaching atoms or molecules. They usually provide additional stabilization and contribute to the overall binding
affinity, especially in close-contact regions.
Metal coordination: Interactions between metal ions (e.g., zinc, magnesium) and ligands that have lone pairs of
electrons (e.g., histidine, cysteine, water). These interactions are typically coordinate covalent bonds, where both
electrons in the bond come from the ligand, and are crucial in metalloenzymes and metalloproteins, where metal
ions often play a key role in catalytic activity and structural stability.
Polar Interactions: Interactions between polar functional groups, such as hydrogen bond donors (e.g., backbone NH,
polarized Cα–H, polar side chains, and protein-bound water) and acceptors (e.g., backbone carbonyls, amide-
containing side chains, and guanidinium groups).
Hydrophobic Interactions: Non-polar interactions between lipophilic side chains, which can contribute to the binding
affinity of ligands.
Pi Interactions: Interactions between aromatic rings (Pi-Pi), and aromatic rings with other types of molecules
(Halogen-Pi, Cation-Pi,...). They occur in binding sites with aromatic residues such as phenylalanine, tyrosine, and
tryptophan, and stabilize the binding complex through stacking interactions.
Valiulin, Roman A. "Non-Covalent Molecular Interactions". Cheminfographic, 13 Apr. 2020,
https://cheminfographic.wordpress.com/2020/04/13/non-covalent-molecular-interactions
Ligands
Ligands are molecules that bind to specific sites on proteins or other molecules, facilitating various biological processes.
Substrates: Molecules that bind to an enzyme's active site and undergo a chemical reaction.
Inhibitors: Molecules that bind to an enzyme or receptor and block its activity.
Activators: Molecules that bind to an enzyme or receptor and increase its activity.
Cofactors: Non-protein molecules (metal ions, vitamins, or other small molecules) that bind to enzymes to modify
their activity. They can act as activators or inhibitors depending on the specific enzyme and the binding site.
Signaling Lipids: Lipid-based molecules that act as signaling molecules, such as steroid hormones.
Neurotransmitters: Chemical messengers that transmit signals between neurons and their target cells.
The binding of ligands to their target sites is influenced by various physicochemical properties:
Size and Shape: Ligands must be the appropriate size and shape to fit into the binding site.
Charge: Electrostatic interactions, such as ionic bonds, can contribute to ligand binding.
Hydrophobicity: Hydrophobic interactions between non-polar regions of the ligand and the binding site can stabilize
the complex.
Hydrogen Bonding: Hydrogen bonds between the ligand and the binding site can also play a crucial role in binding
affinity.
The specific interactions between a ligand and its binding site, as well as the physicochemical properties of the ligand,
are essential for understanding and predicting ligand-receptor binding events
Binding affinity is the strength of the binding interaction between a biomolecule (e.g., a protein or DNA) and its ligand or
binding partner (e.g., a drug or inhibitor). It is typically measured and reported by the equilibrium dissociation constant (
), which is used to evaluate and rank the strengths of bimolecular interactions. The smaller the
value, the greater the binding affinity of the ligand for its target. Conversely, the larger the
value, the more weakly the target molecule and ligand are attracted to and bind to one another.
Binding affinity is influenced by non-covalent intermolecular interactions, such as hydrogen bonding, electrostatic
interactions, hydrophobic interactions, and van der Waals forces between the two molecules. The presence of other
molecules can also affect the binding affinity between a ligand and its target.
Binding specificity refers to the selectivity of a ligand for binding to a particular site or target. Highly specific ligands will
bind tightly and selectively to their intended target, while less specific ligands may bind to multiple targets with varying
affinities. It is determined by the complementarity between the ligand and the binding site, including factors such as
size, shape, charge, and hydrophobicity (see section above on ligands). Specific interactions, like hydrogen bonding and
ionic interactions, contribute to the selectivity of the binding.
Binding specificity is crucial in various biological processes, such as enzymatic reactions or drug-target interactions, as it
allows for specific and regulated interactions, which is essential for the proper functioning of biological systems.
Antigen-antibody interactions are particularly highly specific binding sites (often also characterized by high affinity). The specificity of these
interactions is fundamental to ensure precise immune recognition and response. Kyowa Kirin. "Specificity of Antibodies".
In summary, binding affinity measures the strength of the interaction between a ligand and its target, while binding
specificity determines the selectivity of the ligand for a particular binding site or target.
Thermodynamics of Binding
The thermodynamics of binding involves the interplay of enthalpy (
), entropy (
) to describe the binding of ligands to binding sites. These thermodynamic parameters are crucial in understanding the
binding process and the forces involved.
Enthalpy (
) is a measure of the total energy change during a process. In the context of binding, enthalpy represents the
energy change associated with the formation of the ligand-binding site complex. A negative enthalpy change
indicates that the binding process is exothermic, meaning that heat is released during binding. Conversely, a
positive enthalpy change indicates an endothermic process, where heat is absorbed during binding.
Entropy (
) measures the disorder or randomness of a system. In binding, entropy represents the change in disorder
associated with the formation of the ligand-binding site complex. A negative entropy change indicates a decrease in
disorder, which is often associated with the formation of a more ordered complex. Conversely, a positive entropy
change indicates an increase in disorder, which can be seen in the disruption of the binding site or the ligand.
) is a measure of the energy change during a process that takes into account both enthalpy and entropy. It is
defined as
, where
is the temperature in Kelvin. Gibbs free energy represents the energy available for work during a process. In
binding, a negative Gibbs free energy change indicates that the binding process is spontaneous and favorable, while
a positive Gibbs free energy change indicates that the process is non-spontaneous and less favorable.
image.png
Calculated variation of the Gibbs free energy (G), entropy (S), enthalpy (H), and heat capacity (Cp) of a given reaction plotted as a function
of temperature (T). The solid curves correspond to a pressure of 1 bar; The dashed curve shows variation in ∆G at 8 GPa. Ghiorso, Mark S.,
Yang, Hexiong and Hazen, Robert M. "Thermodynamics of cation ordering in karrooite (MgTi2O5)" American Mineralogist, vol. 84, no. 9,
1999, pp. 1370-1374. https://doi.org/10.2138/am-1999-0914
Binding isotherms are models that describe the relationship between the concentration of ligand and the
occupancy of binding sites. These isotherms are crucial in understanding the binding process and the forces
involved. One example is the Langmuir isotherm, which assumes that the binding site is homogeneous and the
ligand binds to the site with a single binding constant.
Cooperative binding occurs when the binding of one ligand molecule affects the binding of subsequent ligand
molecules. This can lead to non-linear binding isotherms, where the binding of ligands is enhanced or inhibited by
the presence of other ligands. Cooperative binding is often seen in systems where multiple binding sites are
involved or where the binding site is heterogeneous.
Kinetics of Binding
The kinetics of binding involves the study of the rates at which ligands bind to and dissociate from binding sites.
Rate constants for association and dissociation are essential in describing the kinetics of binding. They represent
the rate at which the ligand binds or dissociates to the site, respectively.
Kinetic models are used to describe the binding process. One commonly used kinetic model to describe enzyme
kinetics is the Michaelis-Menten model, which assumes that the enzyme has a single binding site and that the
binding of the substrate is reversible. There are other kinetic models, including the Langmuir adsorption model.
Binding curves for three ligands following the Hill-Langmuir model, each with a
(equilibrium dissociation constant) of 10 µM for its target protein. The blue ligand shows negative cooperativity of binding, meaning that
binding of the first ligand reduces the binding affinity of the remaining site(s) for binding of a second ligand. The red ligand shows positive
cooperativity of binding, meaning that binding of the first ligand increases the binding affinity of the remaining site(s) for binding of a
second ligand. "Hill-Langmuir equation". Open Educational Alberta. https://openeducationalberta.ca/abcofpkpd/chapter/hill-langmuir/.
Protein Binding Sites: These are regions on a protein where other molecules can bind. They can be further divided
into:
Active Sites: Regions where enzymes bind substrates and catalyze chemical reactions. Example: The active site
of the enzyme hexokinase binds to glucose and ATP, catalyzing the phosphorylation of glucose.
Allosteric Sites: Regions where ligands bind and alter the protein's activity without being part of the active site.
Example: The binding of 2,3-bisphosphoglycerate (2,3-BPG) to hemoglobin enhances the ability of hemoglobin
to release oxygen where it is most needed.
Regulatory Sites: Regions where ligands bind and regulate protein activity or localization. Example: Binding of a
regulatory protein to a specific site on a receptor can modulate the receptor's activity.
Nucleic Acid Binding Sites: These are regions on DNA or RNA where other molecules can bind. They can be further
divided into:
Transcription Factor Binding Sites: Regions where transcription factors bind to regulate gene expression.
Example: The TATA box is a DNA sequence that transcription factors bind to initiate transcription.
Restriction Sites: Regions where restriction enzymes bind to cleave DNA. Example: The EcoRI restriction enzyme
recognizes and cuts the DNA sequence GAATTC.
Recombination Sites: Regions where site-specific recombinases bind to facilitate genetic recombination.
Example: The loxP sites are recognized by the Cre recombinase enzyme to mediate recombination.
Small Molecule Binding Sites: These are regions on proteins or nucleic acids where small molecules like drugs or
substrates bind. They can be further divided into:
Compound Binding Sites: Regions typically located at the active site of the enzyme where the substrate binds
and undergoes a chemical reaction, usually reversible. Example: The binding site for the drug aspirin on the
enzyme cyclooxygenase (COX) inhibits its activity.
Cofactor Binding Sites: Regions where cofactors can bind, sometimes permanently and covalently attached to
the protein, and can be located at various sites. Example: The binding site for the heme cofactor in hemoglobin,
which is essential for oxygen transport.
Ion and Water Binding Sites: These are regions on proteins or nucleic acids where ions or water molecules bind. The
calcium-binding sites in calmodulin, which are crucial for its role in signal transduction.
Combines quantum
Quantum Used to study reaction Detailed electronic
mechanical calculations for
Mechanics/Molecular mechanisms, electronic Gaussian, structure information,
the active site with molecular
Mechanics (QM/MM) properties, and the role of ORCA, Q-Chem reaction pathways,
mechanical calculations for
Methods metal ions in binding sites. energy profiles
the rest of the system.
DeepChem,
Uses machine learning and AI Enhancing the accuracy of
TensorFlow, Predictive models,
techniques to predict binding docking predictions,
Machine Learning PyTorch, binding affinity
affinities, identify binding predicting drug-target
and AI protein predictions, novel ligand
sites, and generate new interactions, and designing
language designs
ligand structures. novel compounds.
models (ESM2)
Recently, large language models, particularly protein language models (PLMs), have emerged as powerful tools for
predicting protein properties. These models typically use a transformer-based architecture to process protein
sequences, learning relationships between amino acids and protein properties. PLMs can then be fine-tuned for specific
tasks, such as binding site prediction, reducing the need for large, specific training datasets and offering high scalability.
BindingPocketFinder: This is an abstract superclass in DeepChem that provides a template for child classes to
algorithmically locate potential binding pockets on proteins. The idea is to help identify regions of the protein that
may be good interaction sites for ligands or other molecules.
ConvexHullPocketFinder: This is a specific implementation of the BindingPocketFinder class that uses the convex
hull of the protein structure to find potential binding pockets. It takes in a protein structure and returns a list of
binding pockets represented as CoordinateBoxes.
Pose generators: Pose generation is the task of finding a “pose”, that is a geometric configuration of a small
molecule interacting with a protein. A key step in computing the binding free energy of two complexes is to find low
energy “poses”, that is energetically favorable conformations of molecules with respect to each other. This can be
useful for identifying favorable binding modes and orientations (low energy poses) of ligands within a protein's
binding site. Current implementations allow for Autodock Vina and GNINA.
Docking: There is a generic docking implementation that depends on provide pose generation and pose scoring
utilities to perform docking.
There is a tutorial on using machine learning and molecular docking methods to predict the binding energy of a protein-
ligand complex, and another tutorial on using atomic convolutions in particular to model such interactions.
# setup
!pip install py3Dmol biopython requests
import requests
import py3Dmol
from Bio.PDB import PDBParser, NeighborSearch
from io import StringIO
We have a homodimer of two identical chains and several ligands: CL, HOH, and STI. CL and HOH are common solvents
and ions binding to the protein structure, while STI is the ligand of interest (Imatinib).
For visualization, let's extract the information of chain A and its ligands:
chain_id = "A"
chain_lines = []
for line in pdb_content.splitlines():
if line.startswith("HETATM") or line.startswith("ATOM"):
if line[21] == chain_id:
chain_lines.append(line)
elif line.startswith("TER"):
if chain_lines and chain_lines[-1][21] == chain_id:
chain_lines.append(line)
chain_A = "\n".join(chain_lines)
view = py3Dmol.view()
view.show()
Now let's highlight the binding pocket in the protein ribbon. For this purpose we need to parse the PDB content again to
identify residues belonging to chain A and STI ligand:
parser = PDBParser()
structure = parser.get_structure('protein', StringIO(pdb_content))
binding_residues = set()
distance_threshold = 5.0
ns = NeighborSearch(chain_A_atoms)
And let's see the visualization, now with the binding pocket:
Further Reading
For further reading on computational methods for binding sites and protein language models here are a couple of great
resources:
Exploring the computational methods for protein-ligand binding site prediction -Getting started with protein
language models
@manual{Bioinformatics,
title={Introduction to Binding Sites},
organization={DeepChem},
author={Gómez de Lope, Elisa},
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_Binding_Sites.i
year={2024},
}
Tutorial Part 13: Modeling Protein-Ligand Interactions
By Nathan C. Frey | Twitter and Bharath Ramsundar | Twitter
In this tutorial, we'll walk you through the use of machine learning and molecular docking methods to predict the
binding energy of a protein-ligand complex. Recall that a ligand is some small molecule which interacts (usually non-
covalently) with a protein. Molecular docking performs geometric calculations to find a “binding pose” with a small
molecule interacting with a protein in a suitable binding pocket (that is, a region on the protein which has a groove in
which the small molecule can rest).
The structure of proteins can be determined experimentally with techniques like Cryo-EM or X-ray crystallography. This
can be a powerful tool for structure-based drug discovery. For more info on docking, read the AutoDock Vina paper and
the deepchem.dock documentation. There are many graphical user and command line interfaces (like AutoDock) for
performing molecular docking. Here, we show how docking can be performed programmatically with DeepChem, which
enables automation and easy integration with machine learning pipelines.
To start the tutorial, we'll use a simple pre-processed dataset file that comes in the form of a gzipped file. Each row is a
molecular system, and each column represents a different piece of information about that system. For instance, in this
example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex
identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python
list of all lines in a PDB file for the protein alone; and a Python list of all lines in a ligand file for the ligand alone.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the syst
em package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
✨ ✨ Everything looks OK!
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the syst
em package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
import os
import numpy as np
import pandas as pd
import tempfile
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometri
c'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from
'deepchem.models.torch_models' (/usr/local/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py
)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightn
ing'
Skipped loading some Jax models, missing a dependency. No module named 'haiku'
To illustrate the docking procedure, here we'll use a csv that contains SMILES strings of ligands as well as PDB files for
the ligand and protein targets from PDBbind. Later, we'll use the labels to train a model to predict binding affinities.
We'll also show how to download and featurize PDBbind to train a model from scratch.
data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")
if not os.path.exists(dataset_file):
print('File does not exist. Downloading file...')
download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
print('File downloaded...')
raw_dataset = load_from_disk(dataset_file)
raw_dataset = raw_dataset[['pdb_id', 'smiles', 'label']]
raw_dataset.head(2)
%%time
fixer = PDBFixer(pdbid=pdbid)
PDBFile.writeFile(fixer.topology, fixer.positions, open('%s.pdb' % (pdbid), 'w'))
p, m = None, None
# fix protein, optimize ligand geometry, and sanitize molecules
try:
p, m = prepare_inputs('%s.pdb' % (pdbid), ligand)
except:
print('%s failed PDB fixing' % (pdbid))
<timed exec>:7: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
3cyx 1510
CPU times: user 2.04 s, sys: 157 ms, total: 2.2 s
Wall time: 4.32 s
Visualization
If you're outside of Colab, you can expand these cells and use MDTraj and nglview to visualize proteins and ligands.
import mdtraj as md
import nglview
Let's take a look at the first protein ligand pair in our dataset:
protein_mdtraj = md.load_pdb('3cyx.pdb')
ligand_mdtraj = md.load_pdb('ligand_3cyx.pdb')
We'll use the convenience function nglview.show_mdtraj in order to view our proteins and ligands. Note that this will
only work if you uncommented the above cell, installed nglview, and enabled the necessary notebook extensions.
v = nglview.show_mdtraj(ligand_mdtraj)
NGLWidget()
Now that we have an idea of what the ligand looks like, let's take a look at our protein:
view = nglview.show_mdtraj(protein_mdtraj)
display(view) # interactive view outside Colab
NGLWidget()
Molecular Docking
Ok, now that we've got our data and basic visualization tools up and running, let's see if we can use molecular docking
to estimate the binding affinities between our protein ligand systems.
There are three steps to setting up a docking job, and you should experiment with different settings. The three things
we need to specify are 1) how to identify binding pockets in the target protein; 2) how to generate poses (geometric
configurations) of a ligand in a binding pocket; and 3) how to "score" a pose. Remember, our goal is to identify
candidate ligands that strongly interact with a target protein, which is reflected by the score.
DeepChem has a simple built-in method for identifying binding pockets in proteins. It is based on the convex hull
method. The method works by creating a 3D polyhedron (convex hull) around a protein structure and identifying the
surface atoms of the protein as the ones closest to the convex hull. Some biochemical properties are considered, so the
method is not purely geometrical. It has the advantage of having a low computational cost and is good enough for our
purposes.
finder = dc.dock.binding_pocket.ConvexHullPocketFinder()
pockets = finder.find_pockets('3cyx.pdb')
len(pockets) # number of identified pockets
36
Pose generation is quite complex. Luckily, using DeepChem's pose generator will install the AutoDock Vina engine under
the hood, allowing us to get up and running generating poses quickly.
vpg = dc.dock.pose_generation.VinaPoseGenerator()
We could specify a pose scoring function from deepchem.dock.pose_scoring , which includes things like repulsive and
hydrophobic interactions and hydrogen bonding. Vina will take care of this, so instead we'll allow Vina to compute scores
for poses.
!mkdir -p vina_test
%%time
complexes, scores = vpg.generate_poses(molecular_complex=('3cyx.pdb', 'ligand_3cyx.pdb'), # protein-ligand files for
out_dir='vina_test',
generate_scores=True
)
CPU times: user 41min 4s, sys: 21.9 s, total: 41min 26s
Wall time: 28min 32s
/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for
the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is
safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If yo
u wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#dep
recations
self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
We used the default value for num_modes when generating poses, so Vina will return the 9 lowest energy poses it found
in units of kcal/mol .
scores
Can we view the complex with both protein and ligand? Yes, but we'll need to combine the molecules into a single RDkit
molecule.
Let's now visualize our complex. We can see that the ligand slots into a pocket of the protein.
v = nglview.show_rdkit(complex_mol)
display(v)
NGLWidget()
Now that we understand each piece of the process, we can put it all together using DeepChem's Docker class. Docker
creates a generator that yields tuples of posed complexes and docking scores.
docker = dc.dock.docking.Docker(pose_generator=vpg)
posed_complex, score = next(docker.dock(molecular_complex=('3cyx.pdb', 'ligand_3cyx.pdb'),
use_pose_generator_scores=True))
Next, we'll need a way to transform our protein-ligand complexes into representations which can be used by learning
algorithms. Ideally, we'd have neural protein-ligand complex fingerprints, but DeepChem doesn't yet have a good
learned fingerprint of this sort. We do however have well-tuned manual featurizers that can help us with our challenge
here.
We'll make use of two types of fingerprints in the rest of the tutorial, the CircularFingerprint and
ContactCircularFingerprint . DeepChem also has voxelizers and grid descriptors that convert a 3D volume
containing an arragment of atoms into a fingerprint. These featurizers are really useful for understanding protein-ligand
complexes since they allow us to translate complexes into vectors that can be passed into a simple machine learning
algorithm. First, we'll create circular fingerprints. These convert small molecules into a vector of fragments.
pdbids = raw_dataset['pdb_id'].values
ligand_smiles = raw_dataset['smiles'].values
%%time
for (pdbid, ligand) in zip(pdbids, ligand_smiles):
fixer = PDBFixer(url='https://files.rcsb.org/download/%s.pdb' % (pdbid))
PDBFile.writeFile(fixer.topology, fixer.positions, open('%s.pdb' % (pdbid), 'w'))
p, m = None, None
# skip pdb fixing for speed
try:
p, m = prepare_inputs('%s.pdb' % (pdbid), ligand, replace_nonstandard_residues=False,
remove_heterogens=False, remove_water=False,
add_hydrogens=False)
except:
print('%s failed sanitization' % (pdbid))
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:11:45] UFFTYPER: Unrecognized atom type: S_5+4 (7)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
3cyx failed sanitization
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:02] UFFTYPER: Warning: hybridization set to SP3 for atom 17
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:04] UFFTYPER: Warning: hybridization set to SP3 for atom 6
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:06] UFFTYPER: Warning: hybridization set to SP3 for atom 1
[15:12:06] UFFTYPER: Unrecognized atom type: S_5+4 (21)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:23] UFFTYPER: Warning: hybridization set to SP3 for atom 20
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:31] UFFTYPER: Warning: hybridization set to SP3 for atom 19
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:12:35] UFFTYPER: Warning: hybridization set to SP3 for atom 29
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:13:03] UFFTYPER: Unrecognized atom type: S_5+4 (39)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:13:37] UFFTYPER: Warning: hybridization set to SP3 for atom 33
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:01] UFFTYPER: Unrecognized atom type: S_5+4 (11)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:02] UFFTYPER: Unrecognized atom type: S_5+4 (47)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:14] UFFTYPER: Unrecognized atom type: S_5+4 (1)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:27] UFFTYPER: Warning: hybridization set to SP3 for atom 6
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:33] UFFTYPER: Unrecognized atom type: S_5+4 (47)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:43] UFFTYPER: Unrecognized atom type: S_5+4 (28)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:55] UFFTYPER: Warning: hybridization set to SP3 for atom 17
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:14:57] UFFTYPER: Warning: hybridization set to SP3 for atom 6
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:08] Explicit valence for atom # 388 O, 3, is greater than permitted
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:15] UFFTYPER: Warning: hybridization set to SP3 for atom 9
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:19] UFFTYPER: Unrecognized atom type: S_5+4 (6)
3utu failed sanitization
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:29] UFFTYPER: Unrecognized atom type: S_5+4 (1)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:39] UFFTYPER: Unrecognized atom type: S_5+4 (19)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:43] UFFTYPER: Unrecognized atom type: S_5+4 (21)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:15:57] UFFTYPER: Unrecognized atom type: S_5+4 (9)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:16:01] UFFTYPER: Warning: hybridization set to SP3 for atom 18
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:16:21] UFFTYPER: Warning: hybridization set to SP3 for atom 17
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:16:42] UFFTYPER: Warning: hybridization set to SP3 for atom 10
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:19] UFFTYPER: Unrecognized atom type: S_5+4 (13)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:25] UFFTYPER: Unrecognized atom type: S_5+4 (10)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:27] UFFTYPER: Unrecognized atom type: S_5+4 (6)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:28] UFFTYPER: Warning: hybridization set to SP3 for atom 11
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:46] UFFTYPER: Unrecognized atom type: S_5+4 (8)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:17:58] UFFTYPER: Unrecognized atom type: S_5+4 (4)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:02] UFFTYPER: Unrecognized atom type: S_5+4 (9)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:15] UFFTYPER: Unrecognized atom type: S_5+4 (1)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:32] UFFTYPER: Unrecognized atom type: S_5+4 (23)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:35] UFFTYPER: Unrecognized atom type: S_5+4 (22)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:18:42] UFFTYPER: Warning: hybridization set to SP3 for atom 8
[15:18:42] UFFTYPER: Warning: hybridization set to SP3 for atom 24
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:01] UFFTYPER: Warning: hybridization set to SP3 for atom 16
[15:19:01] UFFTYPER: Unrecognized atom type: S_5+4 (20)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:02] UFFTYPER: Unrecognized atom type: S_5+4 (6)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:05] UFFTYPER: Unrecognized atom type: S_5+4 (6)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
1hfs failed sanitization
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:22] UFFTYPER: Warning: hybridization set to SP3 for atom 20
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:41] Explicit valence for atom # 1800 C, 5, is greater than permitted
[15:19:41] UFFTYPER: Unrecognized atom type: S_5+4 (11)
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:42] UFFTYPER: Warning: hybridization set to SP3 for atom 11
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:57] UFFTYPER: Warning: hybridization set to SP3 for atom 9
[15:19:57] UFFTYPER: Warning: hybridization set to SP3 for atom 23
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 8
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 12
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 34
[15:19:59] UFFTYPER: Warning: hybridization set to SP3 for atom 41
<timed exec>:8: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding fun
ction in deepchem.utils.docking_utils.
CPU times: user 4min 9s, sys: 3.31 s, total: 4min 12s
Wall time: 8min 19s
We'll do some clean up to make sure we have a valid ligand file for every valid protein. The lines here will compare the
PDB IDs between the ligand and protein files and remove any proteins that don't have corresponding ligands.
(190, 190)
fp_featurizer = dc.feat.CircularFingerprint(size=2048)
The convenience loader dc.molnet.load_pdbbind will take care of downloading and featurizing the pdbbind dataset
under the hood for us. This will take quite a bit of time and compute, so the code to do it is commented out. Uncomment
it and grab a cup of coffee if you'd like to featurize all of PDBbind's refined set. Otherwise, you can continue with the
small dataset we constructed above.
To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a
created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to
interoperate with deepchem. To instantiate an SklearnModel , you will need (a) task_types, (b) model_params, another
dict as illustrated below, and (c) a model_instance defining the type of model you would like to fit, in this case a
RandomForestRegressor .
value for the test set indicates that the model isn't producing meaningful outputs. It turns out that predicting binding
affinities is hard. This tutorial isn't meant to show how to create a state-of-the-art model for predicting binding affinities,
but it gives you the tools to generate your own datasets with molecular docking, featurize complexes, and train models.
We're using a very small dataset and an overly simplistic representation, so it's no surprise that the test set
performance is quite bad.
[(6.862549999999994, 7.4),
(6.616400000000008, 6.85),
(4.852004999999995, 3.4),
(6.43060000000001, 6.72),
(8.66322999999999, 11.06)]
list(zip(model.predict(test_dataset), test_dataset.y))[:5]
[(5.960549999999999, 4.21),
(6.051305714285715, 8.7),
(5.799900000000003, 6.39),
(6.433881666666665, 4.94),
(6.7465399999999995, 9.21)]
fp_featurizer = dc.feat.ContactCircularFingerprint(size=2048)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
Ok, it looks like we have lower accuracy than the ligand-only dataset. Nonetheless, it's probably still useful to have a
protein-ligand model since it's likely to learn different features than the the pure ligand-only model.
Further reading
So far we have used DeepChem's docking module with the AutoDock Vina backend to generate docking scores for the
PDBbind dataset. We trained a simple machine learning model to directly predict binding affinities, based on featurizing
the protein-ligand complexes. We might want to try more sophisticated docking protocols, like the deep learning
framework gnina. You can read more about using convolutional neural nets for protein-ligand scoring here. And here is a
review of machine learning-based scoring functions.
This DeepChem tutorial introduces the Atomic Convolutional Neural Network. We'll see the structure of the
AtomicConvModel and write a simple program to run Atomic Convolutions.
ACNN Architecture
ACNN’s directly exploit the local three-dimensional structure of molecules to hierarchically learn more complex chemical
features by optimizing both the model and featurization simultaneously in an end-to-end fashion.
The atom type convolution makes use of a neighbor-listed distance matrix to extract features encoding local chemical
environments from an input representation (Cartesian atomic coordinates) that does not necessarily contain spatial
locality. The following methods are used to build the ACNN architecture:
Distance Matrix
The distance matrix
coordinate matrix
matrix
. The matrix
, where
is the number of unique atomic numbers (atom types) present in the molecular system. The atom type convolution
kernel is a step function that operates on the neighbor distance matrix
, where
) output of the radial pooling layer into the atom type convolution operation. Finally, we feed the tensor row-wise
(per-atom) into a fully-connected network. The same fully connected weights and biases are used for each atom in a
given molecule.
Now that we have seen the structural overview of ACNNs, we'll try to get deeper into the model and see how we can
train it and what we expect as the output.
For the training, we will use the publicly available PDBbind dataset. In this example, every row reflects a protein-ligand
complex and the target is the binding affinity (
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
!/usr/local/bin/conda install -c conda-forge pycosat mdtraj pdbfixer openmm -y -q # needed for AtomicConvs
import deepchem as dc
import os
import numpy as np
import tensorflow as tf
acf = AtomicConvFeaturizer(frag1_num_atoms=f1_num_atoms,
frag2_num_atoms=f2_num_atoms,
complex_num_atoms=f1_num_atoms+f2_num_atoms,
max_num_neighbors=max_num_neighbors,
neighbor_cutoff=4)
load_pdbbind allows us to specify if we want to use the entire protein or only the binding pocket ( pocket=True ) for
featurization. Using only the pocket saves memory and speeds up the featurization. We can also use the "core" dataset
of ~200 high-quality complexes for rapidly testing our model, or the larger "refined" set of nearly 5000 complexes for
more datapoints and more robust training/validation. On Colab, it takes only a minute to featurize the core PDBbind set!
This is pretty incredible, and it means you can quickly experiment with different featurizations and model architectures.
%%time
tasks, datasets, transformers = load_pdbbind(featurizer=acf,
save_dir='.',
data_dir='.',
pocket=True,
reload=False,
set_name='core')
Unfortunately, if you try to use the "refined" dataset, there are some complexes that cannot be featurized. To resolve
this issue, rather than increasing complex_num_atoms , simply omit the lines of the dataset that have an x value of
None
class MyTransformer(dc.trans.Transformer):
def transform_array(x, y, w, ids):
kept_rows = x != None
return x[kept_rows], y[kept_rows], w[kept_rows], ids[kept_rows],
datasets
(<DiskDataset X.shape: (154, 9), y.shape: (154,), w.shape: (154,), ids: ['1mq6' '3pe2' '2wtv' ... '3f3c' '4gqq'
'2x00'], task_names: [0]>,
<DiskDataset X.shape: (19, 9), y.shape: (19,), w.shape: (19,), ids: ['3ivg' '4de1' '4tmn' ... '2vw5' '1w3l' '2
zjw'], task_names: [0]>,
<DiskDataset X.shape: (20, 9), y.shape: (20,), w.shape: (20,), ids: ['1kel' '2w66' '2xnb' ... '2qbp' '3lka' '1
qi0'], task_names: [0]>)
acm = AtomicConvModel(n_tasks=1,
frag1_num_atoms=f1_num_atoms,
frag2_num_atoms=f2_num_atoms,
complex_num_atoms=f1_num_atoms+f2_num_atoms,
max_num_neighbors=max_num_neighbors,
batch_size=12,
layer_sizes=[32, 32, 16],
learning_rate=0.003,
)
%%time
max_epochs = 50
metric = dc.metrics.Metric(dc.metrics.score_function.rms_score)
step_cutoff = len(train)//12
def val_cb(model, step):
if step%step_cutoff!=0:
return
val_losses.append(model.evaluate(val, metrics=[metric])['rms_score']**2) # L2 Loss
losses.append(model.evaluate(train, metrics=[metric])['rms_score']**2) # L2 Loss
CPU times: user 2min 41s, sys: 11.4 s, total: 2min 53s
Wall time: 2min 47s
The loss curves are not exactly smooth, which is unsurprising because we are using 154 training and 19 validation
datapoints. Increasing the dataset size may help with this, but will also require greater computational resources.
f, ax = plt.subplots()
ax.scatter(range(len(losses)), losses, label='train loss')
ax.scatter(range(len(val_losses)), val_losses, label='val loss')
plt.legend(loc='upper right');
score of 0.912 and 0.448 for a random 80/20 split of the PDBbind core train/test sets. Here, we've used an 80/10/10
training/validation/test split and achieved similar performance for the training set (0.943). We can see from the
performance on the training, validation, and test sets (and from the results in the paper) that the ACNN can learn
chemical interactions from small training datasets, but struggles to generalize. Still, it is pretty amazing that we can
train an AtomicConvModel with only a few lines of code and start predicting binding affinities!
From here, you can experiment with different hyperparameters, more challenging splits, and the "refined" set of
PDBbind to see if you can reduce overfitting and come up with a more robust model.
score = dc.metrics.Metric(dc.metrics.score_function.pearson_r2_score)
for tvt, ds in zip(['train', 'val', 'test'], datasets):
print(tvt, acm.evaluate(ds, metrics=[score]))
Further reading
We have explored the ACNN architecture and used the PDBbind dataset to train an ACNN to predict protein-ligand
binding energies. For more information, read the original paper that introduced ACNNs: Gomes, Joseph, et al. "Atomic
convolutional networks for predicting protein-ligand binding affinity." arXiv preprint arXiv:1703.10603 (2017). There are
many other methods and papers on predicting binding affinities. Here are a few interesting ones to check out:
predictions using only ligands or proteins, molecular docking with deep learning, and AtomNet.
In this tutorial, we explore a potential use case where we combine the capabilities of AlphaFold and DeepChem.
Alphafold2 has made immense strides in predicting protein structure folding without the use of costly lab equipment and
DeepChem comprises a repertoire of easy to use modules which can then be applied on these protein structures for
further analysis. In the first part of our tutorial we will predict the protein structure when given a protein sequence.
Then, in the second part of our tutorial, we sample a few ligands from the protein-complex dataset(PDBbind) and
perform programmatic docking to estimate binding affinities between our protein and a number of ligands.
This tutorial is meant to be run in google colab. You can follow the below link to open this notebook in colab.
Open in Colab
Setup
We start off with all the installations and configurations. If you would like to directly skip reading this part, you can head
over to the Input Query Section below.
We will first be installing deepchem as a runtime restart might be required. Along with that lets also install condacolab,
vina and pdbfixer which will be used in the later parts of this tutorial.
Part 1: Predict Protein Structure in pdb format from a given input sequence
Note: The cells until part 2 of this tutorial are directly from the Colabfold's google colab implementation and have been
further annotated in this tutorial.
For more details, checkout the ColabFold GitHub and read the below manuscript.
Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to
all. Nature Methods, 2022
We then create a job-name folder and store all the sequences as queries in a text file which we will reference later to
call model inference on. The way that Alphafold is structured is that it interacts with a job directory to process the
inputs, templates, and save the outpus. Hence, it is important to create a unique job folder.
#@title Input protein sequence(s), then hit `Runtime` -> `Run all`
from google.colab import files
import os
import re
import hashlib
import random
AlphaFold uses templates from a vast protein structure database to guide its predictions. It aligns the target protein's
sequence with similar sequences from the database to generate structural constraints, aiding in the accurate prediction
of the protein's 3D structure.
none = no template information is used. In this case the sequence alignment step is skipped and prediction is made
without it.
pdb100 = templates are detected from the pdb100 database. pdb100 is a subset of the Protein Data Bank (PDB)
containing 100 diverse protein structures used for training and validation in structural biology and bioinformatics
research.
custom = the user has the option to upload their own templates database to search on (PDB or mmCIF format)
Lets specify a hashing function which we will use to create a job name. Hashing is mainly used to reduce the length of
the folder name in order to have reasonable and unique jobname everytime we run the notebook.
def add_hash(x,y):
return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]
# remove whitespaces
query_sequence = "".join(query_sequence.split())
basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)
We then create the directory where we will store our protein seqeunce (query) and respective files which will be
generated in the later parts of this tutorial. We also define a check function below which prevents us from creating
duplicate directories.
Based on the template_mode specified earlier, we automatically adjust a few parameters such as use_templates and
custom_template_path. If the template_mode is custom, we will create an additional directory for it.
if template_mode == "pdb100":
use_templates = True
custom_template_path = None
elif template_mode == "custom":
custom_template_path = os.path.join(jobname,f"template")
os.makedirs(custom_template_path, exist_ok=True)
uploaded = files.upload()
use_templates = True
for fn in uploaded.keys():
os.rename(fn,os.path.join(custom_template_path,fn))
else:
custom_template_path = None
use_templates = False
Install dependencies
Based on the parameters mentioned above we will respectively install the dependencies with the code below.
1. First we have to install the latest version of ColabFold from their github repo. After a successful installation, a file
called COLABFOLD_READY will be created to mark its completion.
%%time
import os
USE_AMBER = use_amber
USE_TEMPLATES = use_templates
PYTHON_VERSION = python_version
if not os.path.isfile("COLABFOLD_READY"):
print("installing colabfold...")
os.system("pip install -q --no-warn-conflicts 'colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/Co
os.system("pip install --upgrade dm-haiku")
os.system("ln -s /usr/local/lib/python3.*/dist-packages/colabfold colabfold")
os.system("ln -s /usr/local/lib/python3.*/dist-packages/alphafold alphafold")
# patch for jax > 0.3.25
os.system("sed -i 's/weights = jax.nn.softmax(logits)/logits=jnp.clip(logits,-1e8,1e8);weights=jax.nn.softmax(logit
os.system("touch COLABFOLD_READY")
2. Next, if we need amber relaxation or protein templates, we will install mamba which will help us in installing further
packages
if USE_AMBER or USE_TEMPLATES:
if not os.path.isfile("CONDA_READY"):
print("installing conda...")
os.system("wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
os.system("bash Mambaforge-Linux-x86_64.sh -bfp /usr/local")
os.system("mamba config --set auto_update_conda false")
os.system("touch CONDA_READY")
3. Then, we will install hhsuite for db-search/tempate-retireval and openmm for amber relaxation.
HHsuite is a widely used open source software suite for sensitive sequence similarity searches and protein fold
recognition.
Openmm is a high-performance toolkit for molecular simulation. In our tutorial this tooklkit helps us simulate amber
force field which is used to "relax" the position of atoms with respect to each other in order to remove any definite
obstructions between them. This helps us to better model protein folding in edge cases and offers better refinement of
the protein structures.
Now lets specifiy the different msa options available.
MSA
Multi Sequence Alignment is a process in Alphafold which aligns multiple sequences of amino acids from different
sources which have a similar sequence. In this step, Alphafold2 forms a grid aligning identical amino acids in the same
columns and leaving gaps where there are differences. We have the option of how we would like to pair the respective
MSA with the options: "unpaired_paired" to pair sequences from same species and an unpaired MSA, "unpaired" to
seperate MSA for each chain, and "paired" to only use paired sequences. Additionally, we have multiple options in the
way Alphafold searches for the respective sequences and few of them are mentioned below.
MMSEQ2: This is a sequence searching tool which finds similar sequences to our input sequence from a large database.
Single_sequence: This option restricts Alphafold from searching for any similar amino acid sequences and restricts it
utilize the only given one.
Custom: This options lets Alphafold do the sequence search from user defined sequence search space.
#@markdown ### MSA options (custom MSA upload, single sequence, pairing mode)
msa_mode = "mmseqs2_uniref_env" #@param ["mmseqs2_uniref_env", "mmseqs2_uniref","single_sequence","custom"]
pair_mode = "unpaired_paired" #@param ["unpaired_paired","paired","unpaired"] {type:"string"}
#@markdown - "unpaired_paired" = pair sequences from same species + unpaired MSA, "unpaired" = seperate MSA for each
So based on the above msa parameters we will set the path to the A3M file. An A3M file (Alignment to Multiple Models)
is a type of input file used in the protein structure prediction process which contains multiple sequence alignments
(MSAs) of related protein sequences that are used as input data for the AlphaFold model. A3M file format is an
extentions to FASTA file format and can be read about over at FASTA format extension.
Additionally, for the purpose of this tutorial we don't need to get into the details of custom MSA(where user inputs their
own template databse for search).
Advanced settings
Below we can specify more advanced settings with Alphafold. We can choose which model parameters to use from the
options given below i.e alphafold2, alphafold2_multimer_v1 etc.., recycle early stop tolerence, saving to google drive,
and image resolution options. Also note that there is no need to fully understand the parameters and just stick to the
defaults. But here are a few details about the parameters which can be changed.
model_type: If auto selected, will use alphafold2_ptm for monomer prediction and alphafold2_multimer_v3 for complex
prediction. Any of the mode_types can be used (regardless if input is monomer or complex).
num_recycles: "auto" with other options available as ["auto", "0", "1", "3", "6", "12", "24", "48"]
recycle_early_stop_tolerance: "auto" with other options available as ["auto", "0.0", "0.5", "1.0"]
By "recycling" Alphafold refers to the process of iterative refinement of protein structure prediction to imporve accuracy.
We can also set the maximum length of Multiple Sequence Alignment, Number of Seeds, and whether to use dropout or
not.
max_msa = "auto" with other optinos as ["auto", "512:1024", "256:512", "64:128", "32:64", "16:32"]. Here left is the
minimum and right is the maximum msa length.
if save_to_google_drive:
from pydrive.drive import GoogleDrive
from pydrive.auth import GoogleAuth
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print("You are logged into Google Drive and are good to go!")
Now we will run the prediction model using all the inputs and specifications from above.
We import the respective files from Colabfold's package for inference and plotting, and we then check if we have a
specific GPU. We also define 2 helper functions: input_features_callback, and prediction_callback, which help us to
visualize the respective input featues and prediction results.
import sys
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from Bio import BiopythonDeprecationWarning
warnings.simplefilter(action='ignore', category=BiopythonDeprecationWarning)
from pathlib import Path
from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
from colabfold.plot import plot_msa_v2
import os
import numpy as np
try:
K80_chk = os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()
except:
K80_chk = "0"
pass
if "1" in K80_chk:
print("WARNING: found GPU Tesla K80: limited to total length < 1000")
if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
del os.environ["TF_FORCE_UNIFIED_MEMORY"]
if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]
Now, let's define some helper functions to visualize input features and output preduction.
def input_features_callback(input_features):
if display_images:
plot_msa_v2(input_features)
plt.show()
plt.close()
Let's define our logging environment and retrieve the input queries from our job folder.
result_dir = jobname
log_filename = os.path.join(jobname,"log.txt")
setup_logging(Path(log_filename))
We then store our query_sequence in a csv inside the jobname director. This facilates multiple query execution and
provides input in the format expected by ColabFold.
# save queries
queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
text_file.write(f"id,sequence\n{jobname},{query_sequence}")
We utilize the get_queries function, which is a colabfold utility function, to fetch the queries from the directory.
One downside to having a high number of sequence alignemts(~128) detected in the MSA step is that it is
quadrateically expensive to continue the later steps in inference. In order, to reduce this the MSA's are clustered
smartly, based on sequence similarity, in order to reduce the computational cost while ensuring that each MSA has
some influence on the final prediction. This clustering step is captured by use_cluster_profile which is also configured
below.
Inference
Our final step is to download the alphafold parameters for the model and run inference using all the previous given
specifications and inputs.
We will then save the results in a zip file and download it.
download_alphafold_params(model_type, Path("."))
results = run(
queries=queries,
result_dir=result_dir,
use_templates=use_templates,
custom_template_path=custom_template_path,
num_relax=num_relax,
msa_mode=msa_mode,
model_type=model_type,
num_models=5,
num_recycles=num_recycles,
relax_max_iterations=relax_max_iterations,
recycle_early_stop_tolerance=recycle_early_stop_tolerance,
num_seeds=num_seeds,
use_dropout=use_dropout,
model_order=[1,2,3,4,5],
is_complex=is_complex,
data_dir=Path("."),
keep_existing_results=False,
rank_by="auto",
pair_mode=pair_mode,
pairing_strategy=pairing_strategy,
stop_at_score=float(100),
prediction_callback=prediction_callback,
dpi=dpi,
zip_results=False,
save_all=save_all,
max_msa=max_msa,
use_cluster_profile=use_cluster_profile,
input_features_callback=input_features_callback,
save_recycles=save_recycles,
user_agent="colabfold/google-colab-main",
)
results_zip = f"{jobname}.result.zip"
os.system(f"zip -r {results_zip} {jobname}")
Display 3D structure of the protein file generated based on a few options and using
py3Dmol package
Alphafold generates top n ranked model estimates of the protein structure. Here we have 5 ranked structures. The lower
the rank the higher the accuracy and quality of the predicted model.
Here we can display the structure with various color schemas: chair, lDDT, and rainbow. We also have options to show
the sidechain and mainchains of the protein's structure.
Let's import a few important visualization libraries and set the visualization variables.
tag = results["rank"][0][rank_num - 1]
jobname_prefix = ".custom" if msa_mode == "custom" else ""
pdb_filename = f"{jobname}/{jobname}{jobname_prefix}_unrelaxed_{tag}.pdb"
pdb_file = glob.glob(pdb_filename)
Now lets define the visualization function show_pdb using py3Dmol. This function takes in the pdb and various
visualization parameters such as rank number, sidechains, and mainchains and visualizes the protein accordingly Here
we have passed "lDDT" for the color parameter of the function. lDDT is in short for local Distance Difference Test. We
display the color based on the score for each part of the protein as mentioned in the key of the .
if color == "lDDT":
view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
elif color == "rainbow":
view.setStyle({'cartoon': {'color':'spectrum'}})
elif color == "chain":
chains = len(queries[0][1]) + 1 if is_complex else 1
for n,chain,color in zip(range(chains),alphabet_list,pymol_color_list):
view.setStyle({'chain':chain},{'cartoon': {'color':color}})
if show_sidechains:
BB = ['C','O','N']
view.addStyle({'and':[{'resn':["GLY","PRO"],'invert':True},{'atom':BB,'invert':True}]},
{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
view.addStyle({'and':[{'resn':"GLY"},{'atom':'CA'}]},
{'sphere':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
view.addStyle({'and':[{'resn':"PRO"},{'atom':['C','O'],'invert':True}]},
{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
if show_mainchains:
BB = ['C','O','N','CA']
view.addStyle({'atom':BB},{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
view.zoomTo()
return view
PLOTS of Alphafold
The following 3 types of plots are generated with Alphafold.
1. PAE (Predicted Average Error): It measures the average error in the predicted atomic positions of a protein's 3D
structure compared to the true positions. In our experiment below, the PAE plot of the top 5 ranked sturctures is
mostly blue which means that we have low errors.
2. COV (Sequence Coverage): It indicates the percentage of aminoacid sequence for which Alphafold has provided
structural prediction and to what degree. In the plot below, the x-axis represents the amino acid sequence and the
y-axis represents the number sequences for that amino acid.
3. iDDT (local Distance Difference Test) or lDDT: It provides a per-residue measure of the predicted model's
confidence. It assigns a score to each residue in the protein structure, indicating the reliability of the prediction at
that location. For our protein sequence the lDDT is high throughout the amino acid sequence but tends to get lower
towards the right end.
# see: https://stackoverflow.com/a/53688522
def image_to_data_url(filename):
ext = filename.split('.')[-1]
prefix = f'data:image/{ext};base64,'
with open(filename, 'rb') as f:
img = f.read()
return prefix + base64.b64encode(img).decode('utf-8')
pae = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_pae.png"))
cov = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_coverage.png"))
plddt = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_plddt.png"))
display(HTML(f"""
<style>
img {{
float:left;
}}
.full {{
max-width:100%;
}}
.half {{
max-width:50%;
}}
@media (max-width:640px) {{
.half {{
max-width:100%;
}}
}}
</style>
<div style="max-width:90%; padding:2em;">
<h1>Plots for {escape(jobname)}</h1>
<img src="{pae}" class="full" />
<img src="{cov}" class="half" />
<img src="{plddt}" class="half" />
</div>
"""))
if msa_mode == "custom":
print("Don't forget to cite your custom MSA generation method.")
files.download(f"{jobname}.result.zip")
import os
import numpy as np
import pandas as pd
import tempfile
import deepchem as dc
from deepchem.utils import download_url, load_from_disk
To sample a set of ligands we will use PDBind. We will download the respective dataset file and store it in a variable
called raw_dataset.
data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")
if not os.path.exists(dataset_file):
print('File does not exist. Downloading file...')
download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
print('File downloaded...')
raw_dataset = load_from_disk(dataset_file)
raw_dataset = raw_dataset[['pdb_id', 'smiles', 'label']]
ligands10 = raw_dataset['smiles'].iloc[0:10]
# %%time
import os
#'test_a5e17/test_a5e17_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb'
generated_pdb = pdb_filename_captured
generated_pdb_no_extension = os.path.splitext(os.path.basename(generated_pdb))[0]
finder = dc.dock.binding_pocket.ConvexHullPocketFinder()
pockets = finder.find_pockets(generated_pdb)
vpg = dc.dock.pose_generation.VinaPoseGenerator()
count=0
scores_matrix =[]
complex_mol_array = []
for count in range(0,3):
print("Docking ligand "+str(count))
ligand = ligands10[count]
p, m = None, None
vpg = dc.dock.pose_generation.VinaPoseGenerator()
try:
p, m = prepare_inputs('%s' % (generated_pdb), ligand)
except:
print('%s failed PDB fixing' % (generated_pdb))
Docking ligand 0
<ipython-input-42-a86da5d11cfe>:17: DeprecationWarning: Call to deprecated function prepare_inputs. Please use t
he corresponding function in deepchem.utils.docking_utils.
p, m = prepare_inputs('%s' % (generated_pdb), ligand)
[00:47:58] UFFTYPER: Unrecognized atom type: S_5+4 (7)
test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb 448
2023-11-14 00:48:01,498 Pockets not specified. Will use whole protein to dock
2023-11-14 00:48:03,344 Docking in pocket 1/1
2023-11-14 00:48:03,345 Docking with center: [0.28462623 1.04385902 1.65269617]
2023-11-14 00:48:03,345 Box dimensions: [45.593 35.786 38.447]
2023-11-14 00:48:03,346 About to call Vina
/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for
the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is
safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If yo
u wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#dep
recations
self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
<ipython-input-42-a86da5d11cfe>:17: DeprecationWarning: Call to deprecated function prepare_inputs. Please use t
he corresponding function in deepchem.utils.docking_utils.
p, m = prepare_inputs('%s' % (generated_pdb), ligand)
[-4.321, -4.142, -4.135, -4.109, -4.083, -4.069, -4.046, -4.036, -3.993]
Docking ligand 1
test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb 448
2023-11-14 00:49:13,834 Pockets not specified. Will use whole protein to dock
2023-11-14 00:49:15,455 Docking in pocket 1/1
2023-11-14 00:49:15,457 Docking with center: [0.28062951 1.0434776 1.64895082]
2023-11-14 00:49:15,457 Box dimensions: [45.662 35.909 38.417]
2023-11-14 00:49:15,458 About to call Vina
/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for
the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is
safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If yo
u wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#dep
recations
self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
<ipython-input-42-a86da5d11cfe>:17: DeprecationWarning: Call to deprecated function prepare_inputs. Please use t
he corresponding function in deepchem.utils.docking_utils.
p, m = prepare_inputs('%s' % (generated_pdb), ligand)
[-6.083, -6.022, -5.811, -5.797, -5.796, -5.73, -5.689, -5.654, -5.643]
Docking ligand 2
test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb 448
2023-11-14 00:55:55,258 Pockets not specified. Will use whole protein to dock
2023-11-14 00:55:56,761 Docking in pocket 1/1
2023-11-14 00:55:56,762 Docking with center: [0.28300874 1.0426 1.64975847]
2023-11-14 00:55:56,762 Box dimensions: [45.657 35.819 38.429]
2023-11-14 00:55:56,763 About to call Vina
[-5.96, -5.9, -5.791, -5.733, -5.704, -5.662, -5.617, -5.605, -5.591]
/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for
the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is
safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If yo
u wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#dep
recations
self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
import mdtraj as md
import nglview
Now lets visualize the first 3 protein ligand complexes which we have stored in the complex_mol_array
v = nglview.show_rdkit(complex_mol_array[0])
display(v)
v = nglview.show_rdkit(complex_mol_array[1])
display(v)
v = nglview.show_rdkit(complex_mol_array[2])
display(v)
print(scores_matrix)
[[-4.321, -4.142, -4.135, -4.109, -4.083, -4.069, -4.046, -4.036, -3.993], [-6.083, -6.022, -5.811, -5.797, -5.7
96, -5.73, -5.689, -5.654, -5.643], [-5.96, -5.9, -5.791, -5.733, -5.704, -5.662, -5.617, -5.605, -5.591]]
Next, we can see that all the scores generated from Vina Pose Generator for the respective complexes are negative.
This is because protein–ligand binding occurs only when the change in Gibbs free energy (ΔG) of the system is negative
and more negative the free energy is the more stable the complex would be as show in Ref. Additionally, molecular
docking evaluation based on the paper here showed that the binding affinities of all the derivatives range from (- 3.2
and -18.5 kcal/mol).
Hence based on our experiment we can successfully predict the potential affinity between a protein sequence and a
ligand even of we just have the protein sequence!
@manual{DeepChemXAlphafold,
title={Applications of DeepChem with Alphafold: Docking and protein-ligand interaction from
protein sequence},
organization={DeepChem},
author={Bellamkonda, Sriphani Vardhan},
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/DeepChemXAlphafold.ipynb}},
year={2023},
}
UniProt data pre-processing for binding site prediction
downstream task
This notebook guides you through:
Downloading Data: Retrieve information from the UniProt website, including details on protein families, binding
sites, active sites, and amino acid sequences.
Processing Data: Handle special symbols (angle brackets and question marks) in binding/active site information
and convert this data into binary labels. Each amino acid position in the protein sequences is marked as 1
(binding/active site) or 0 (non-binding/active site).
✂ Splitting Data: Divide amino acid sequences and their labels into stratified train/test sets based on UniProt
protein families.
Chunking Sequences: Split sequences and their labels into non-overlapping chunks of a specified length to
define a context window for the ESM-2 model.
This tutorial is made to run without any GPU support, and can be used in Google colab. If you'd like to open this
notebook in colab, you can use the following link.
Open in Colab
Go to the UniProt website and perform a search to query for the proteins of interest (you can search by organism,
protein name, function, etc). Filter your results with the filters on the left-hand side to refine your results further if
necessary. Here I performed the search: (organism_id:9606) AND (family:kinase) AND (existence:1 OR existence:2)
in UniProtKB.
Select columns: Above the search results, there is an option to select the columns you want to be included in your
download. Click on the 'Columns' button and a dropdown menu will appear.
Customize columns: In the dropdown menu, you can check the boxes next to the columns you want to include in
your TSV file. Look for the 'Protein families', 'Binding site', 'Active site', and 'Sequence' options. I also added further
info such as entry name, protein name, gene name, organism, sequence length and whether the entry has been
reviewed.
Download the file: After selecting the desired columns, click the 'Download' button located above the search results.
Choose the 'Tab-separated' format from the list of available formats. You may also have the option to select the
number of entries you want to download (e.g., all entries, displayed entries, or a custom range). Click on the
'Download' button to start the download process and your browser will prompt you to save the TSV file.
Process data
Now, let's process the downloaded UniProt TSV file with columns (Protein families, Binding site, Active site, Sequence). If
the family annotation or binding sites are missing, the code will filter out this sequence. If the Active site annotation is
missing, the sequence will be included without issue. Missing sequences are not handled by this notebook.
# I/O
import pandas as pd
import numpy as np
import re
import random
import pickle
import os
import requests
import xml.etree.ElementTree as ET
# set seed
random.seed(42)
np.random.seed(42)
If you upload the downloaded file from UniProt to Google Drive, you should be able to access it by first mounting your
Google Drive and then loading it:
Mounted at /content/gdrive
Gene Protein
Entry Reviewed Entry Name Protein names Organism
Names families
Diacylglycerol
Homo Eukaryotic
kinase (DAG
0 A0A087WV00 unreviewed A0A087WV00_HUMAN DGKI sapiens diacylglycerol MDAAGRGCHLLPLPAA
kinase) (EC
(Human) kinase family
2.7.1.107)
Protein
kinase
CDK5 Homo
Cell division superfamily,
1 A0A090N7W4 unreviewed A0A090N7W4_HUMAN hCG_18690 sapiens MQKYEKLEKIGEGTYG
protein kinase 5 CMGC
tcag7.772 (Human)
Ser/Thr
prote...
Protein
Serine/threonine- Homo kinase
2 A0A0S2Z310 unreviewed A0A0S2Z310_HUMAN protein kinase ACVRL1 sapiens superfamily, MTLGSPRKGLLMLLMA
receptor (EC 2... (Human) TKL Ser/Thr
protei...
Protein
non-specific
Homo kinase
serine/threonine
3 A0A0S2Z4D1 unreviewed A0A0S2Z4D1_HUMAN STK11 sapiens superfamily, MEVVDPQQLGMFTE
protein kinase
(Human) CAMK Ser/Thr
(...
prote...
Protein
Rho-associated Homo kinase
4 A0A2P9DU05 unreviewed A0A2P9DU05_HUMAN protein kinase ROCK2 sapiens superfamily, MSRPPPTGKMPGAPE
(EC 2.7.11.1) (Human) AGC Ser/Thr
protei...
Now let's extract the required information for the purposes of this task: Protein families, Binding site, Active site,
Sequence. Also, let's filter out entries without binding site or protein families information.
data["Binding site"]
0 NaN
1 BINDING
33; /ligand="ATP"; /ligand_id="ChEBI:C...
2 BINDING
229; /ligand="ATP"; /ligand_id="ChEBI:...
3 BINDING
78; /ligand="ATP"; /ligand_id="ChEBI:C...
4 BINDING
121; /ligand="ATP"; /ligand_id="ChEBI:...
...
2186 NaN
2187 NaN
2188 NaN
2189 BINDING 73; /ligand="ATP"; /ligand_id="ChEBI:C...
2190 BINDING 165; /ligand="ATP"; /ligand_id="ChEBI:...
Name: Binding site, Length: 2191, dtype: object
(1406, 5)
Protein
Entry Binding site Active site Sequence
families
Protein
kinase
BINDING 33;
superfamily,
1 A0A090N7W4 /ligand="ATP"; NaN MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...
CMGC
/ligand_id="ChEBI:C...
Ser/Thr
prote...
Protein
kinase BINDING 229;
2 A0A0S2Z310 superfamily, /ligand="ATP"; NaN MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...
TKL Ser/Thr /ligand_id="ChEBI:...
protei...
Protein
kinase
BINDING 78;
superfamily,
3 A0A0S2Z4D1 /ligand="ATP"; NaN MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...
CAMK
/ligand_id="ChEBI:C...
Ser/Thr
prote...
Protein ACT_SITE
kinase BINDING 121; 214;
4 A0A2P9DU05 superfamily, /ligand="ATP"; /note="Proton MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...
AGC Ser/Thr /ligand_id="ChEBI:... acceptor";
protei... /eviden...
Protein ACT_SITE
kinase BINDING 250..258; 379;
5 A3QNQ0 superfamily, /ligand="ATP"; /note="Proton MGRGLLRGLWPLHIVLWTRIASTIPPHVQKSVNNDMIVTDNNGAVK...
TKL Ser/Thr /ligand_id="C... acceptor";
protei... /eviden...
So we have a dataset of 1406 proteins, all having a binding site and information of the aminoacids sequence and the
protein family. We download proteins proteins from human and kinase family, however there may still exist subgroups
of protein families:
# Group the data by 'Protein families' and get the size of each group
family_sizes = data.groupby('Protein families').size()
print(family_sizes.sort_values(ascending=False))
# Create a new column with the size of each family and sort by 'Family size' in descending order and then by 'Protein
data['Family size'] = data['Protein families'].map(family_sizes)
data = data.sort_values(by=['Family size', 'Protein families'], ascending=[False, True])
data.drop(columns='Family size', inplace=True) # Drop the 'Family size' column as it is no longer needed
data
Protein families
Protein kinase superfamily 164
Protein kinase superfamily, CMGC Ser/Thr protein kinase family, CDC2/CDKX subfamily 96
Protein kinase superfamily, STE Ser/Thr protein kinase family, STE20 subfamily 78
Protein kinase superfamily, Tyr protein kinase family, Insulin receptor subfamily 73
Protein kinase superfamily, CAMK Ser/Thr protein kinase family 56
...
GHMP kinase family, Mevalonate kinase subfamily 1
Protein kinase superfamily, TKL Ser/Thr protein kinase family, ROCO subfamily 1
Glutamate 5-kinase family; Gamma-glutamyl phosphate reductase family 1
Guanylate kinase family 1
GHMP kinase family 1
Length: 126, dtype: int64
Protein
Entry Binding site Active site Sequen
families
ACT_SITE
Protein BINDING 144..152; 278;
359 Q504Y2 kinase /ligand="ATP"; /note="Proton MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG
superfamily /ligand_id="C... acceptor";
/eviden...
Protein
kinase BINDING 358;
1770 M1VPF4 superfamily, /ligand="ATP"; NaN MMEAIKKKMQMLKLDKENALDRAEQAEAEQKQAEERSKQLEDELAA
Tyr protein /ligand_id="ChEBI:...
kinase...
ACT_SITE
BINDING 12; 235;
Pyridoxine
21 O00764 /ligand="pyridoxal"; /note="Proton MEEECRVLSIQSHVIRGYVGNRAATFPLQVLGFEIDAVNSVQFSNH
kinase family
/ligand_id="C... acceptor";
/eviden...
SLC34A
transporter
BINDING 906;
family;
1017 M1V485 /ligand="ATP"; NaN MAPWPELGDAQPNPDKYLEGAAGQQPTAPDKSKETNKTDNTEAPVT
Protein
/ligand_id="ChEBI:...
kinase
supe...
ACT_SITE 98;
BINDING 26..33;
Thymidine /note="Proton
82 P04183 /ligand="ATP"; MSCINLPTVLPGSPSKTRGQIQVILGPMFSGKSTELMRRVRRFQIA
kinase family acceptor";
/ligand_id="ChE...
/evidenc...
Type II
pantothenate
BINDING 196;
kinase
542 Q9NVE7 /ligand="acetyl-CoA"; NaN MAECGASGSGSSGDSLDKSITLPPDEIFRNLENAKRFAIDIGGSLT
family;
/ligand_id=...
Damage-
con...
Now let's make the binding and active sites information clearer:
# Extract the location from the binding and active site columns
def extract_location(site_info):
if pd.isnull(site_info):
return None
locations = []
for info in site_info.split(';'):
if 'BINDING' in info or 'ACT_SITE' in info:
locations.append(info.split()[1])
return '; '.join(locations)
# Apply the function to the 'Binding site' and 'Active site' columns to extract the locations
data['Binding site'] = data['Binding site'].apply(extract_location)
data['Active site'] = data['Active site'].apply(extract_location)
Binding Active
Entry Protein families Sequence
site site
Protein kinase
778 A0A7P0T838 71 None MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
superfamily
Protein kinase
779 A0A7P0T952 71 None MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
superfamily
# Create a new column that combines the 'Binding site' and 'Active site' columns
data['Binding-Active site'] = data['Binding site'].astype(str) + '; ' + data['Active site'].astype(str)
# Replace 'nan' values with None
data['Binding-Active site'] = data['Binding-Active site'].replace('nan; nan', None)
data.head()
Binding-
Protein Binding Active
Entry Sequence Active
families site site
site
Protein
144..152; 144..152;
359 Q504Y2 kinase 278 MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...
166 166; 278
superfamily
Protein 233..241;
233..241;
414 Q8IWB6 kinase None MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG... 273;
273
superfamily None
Protein 209..217;
209..217;
427 Q8NB16 kinase None MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ... 230;
230
superfamily None
Protein
778 A0A7P0T838 kinase 71 None MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG... 71; None
superfamily
Protein
779 A0A7P0T952 kinase 71 None MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG... 71; None
superfamily
'<': This symbol is used to indicate that the feature (such as a binding or active site) starts before the position given.
For example, if you see "<5" in the context of a binding site, it suggests that the binding site starts before amino
acid position 5 in the protein sequence.
'>': Conversely, this symbol is used to show that the feature extends beyond the position given. If you see ">200"
for an active site, it implies that the active site extends beyond amino acid position 200.
These annotations provide information about the location of certain functional sites within a protein, but with an
acknowledgment of some level of uncertainty or incompleteness in the data that could be due to various reasons, such
as limitations in experimental data, partial protein sequences, or predictions based on related proteins rather than direct
evidence.
We will filter out entries containing these symbols so as to work with a dataset with certainty on the binding/active sites.
# Find entries containing '<' or '>'
entries_angles = data['Binding-Active site'].str.contains('<|>', na=False)
print(f"Number of entries with angle brackets: {entries_angles.sum()}")
# Remove all rows where the "Binding-Active site" column contains '<' or '>'
data = data[~entries_angles]
print(f"Number of remaining rows: {data.shape[0]}")
# Find rows where the "Binding-Active site" column contains the character "?", treating "?" as a literal character
entries_question_mark = data[data['Binding-Active site'].str.contains('\?', na=False, regex=True)]
print(f"Number of entries with angle brackets: {entries_question_mark.shape[0]}")
def expand_ranges(s):
"""Expand ranges into a comma-separated string."""
return re.sub(r'(\d+)\.\.(\d+)', lambda m: ', '.join(map(str, range(int(m.group(1)), int(m.group(2))+1))), str(
Sequence \
359 MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...
414 MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...
427 MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...
778 MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
779 MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
Binding-Active site
359 144, 145, 146, 147, 148, 149, 150, 151, 152; 1...
414 233, 234, 235, 236, 237, 238, 239, 240, 241; 2...
427 209, 210, 211, 212, 213, 214, 215, 216, 217; 2...
778 71; None
779 71; None
You can now convert the binding/active sites information into a binary label: 1s where there is a binding/active site; 0s
where there is not. Retrieve the indices in 'Bindig/active site' column, and set their corresponding positions in the
protein sequence to 1. All other aminoacids of the sequence are set to 0:
return binary_list
# Apply the function to both datasets
data['Binding-Active site'] = data.apply(lambda row: convert_to_binary_list(row['Binding-Active site'], len(row['Sequ
data.head()
Binding-
Protein Binding Active
Entry Sequence Active
families site site
site
[0, 0, 0,
0, 0, 0,
Protein
144..152; 0, 0, 0,
359 Q504Y2 kinase 278 MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...
166 0, 0, 0,
superfamily
0, 0, 0,
...
[0, 0, 0,
0, 0, 0,
Protein
233..241; 0, 0, 0,
414 Q8IWB6 kinase None MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...
273 0, 0, 0,
superfamily
0, 0, 0,
...
[0, 0, 0,
0, 0, 0,
Protein
209..217; 0, 0, 0,
427 Q8NB16 kinase None MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...
230 0, 0, 0,
superfamily
0, 0, 0,
...
[0, 0, 0,
0, 0, 0,
Protein
0, 0, 0,
778 A0A7P0T838 kinase 71 None MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
0, 0, 0,
superfamily
0, 0, 0,
...
[0, 0, 0,
0, 0, 0,
Protein
0, 0, 0,
779 A0A7P0T952 kinase 71 None MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
0, 0, 0,
superfamily
0, 0, 0,
...
Notably, this is different from the traiditional stratified split, which aims to preserve the distribution of classes across
both sets.
Parameters:
- data: pandas DataFrame containing the dataset with a 'Protein families' column.
- test_ratio: float, the proportion of the dataset to include in the test split.
Returns:
- test_df: pandas DataFrame containing the test set.
- train_df: pandas DataFrame containing the training set.
"""
# Get unique protein families and shuffle them to randomize the selection
unique_families = data['Protein families'].unique()
np.random.shuffle(unique_families)
# Loop through the shuffled families and add rows to the test set
test_rows = []
current_test_rows = 0
for family in unique_families:
family_rows = data[data['Protein families'] == family].index.tolist()
if current_test_rows + len(family_rows) <= int(test_ratio * data.shape[0]):
test_rows.extend(family_rows)
current_test_rows += len(family_rows)
else:
# If adding the current family exceeds the target, stop adding
test_rows.extend(family_rows)
break
392 1014
test_df.head()
Binding
Binding Active
Entry Protein families Sequence Activ
site site
62..67; [0, 0,
89..92; 0, 0,
APS kinase family;
101; 0, 0,
39 O43252 Sulfate None MEIPGSLCKKVKLSNNAQNWGMQRATNVTYQAHHVSRNKRGQVVGT...
106..109; 0, 0,
adenylyltransferase...
132..133; 0, 0,
171; ...
52..57;
[0, 0,
79..82;
0, 0,
APS kinase family; 91;
0, 0,
68 O95340 Sulfate 96..99; None MSGIKKQKTENQQKSTNVVYQAHHVSRNKRGQVVGTRGGFRGCTVW...
0, 0,
adenylyltransferase... 122..123;
0, 0,
161;
174...
[0, 0,
0, 0,
Protein kinase
0, 0,
4 A0A2P9DU05 superfamily, AGC 121 214 MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...
0, 0,
Ser/Thr protei...
0, 0,
[0, 0,
0, 0,
Protein kinase
104..112; 0, 0,
12 O00141 superfamily, AGC 222 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
127 0, 0,
Ser/Thr protei...
0, 0,
[0, 0,
0, 0,
Protein kinase
103..111; 0, 0,
22 O14578 superfamily, AGC 221 MLKFKYGARNPLDAGAAEPIASRASRLNLFFQGKPPFMTQQQMSPL...
126 0, 0,
Ser/Thr protei...
0, 0,
In case you don't want to keep the entire train/test datasets, you can create a smaller version (with a random
representation of the original dataset). Uncomment the code below if that is the case:
# Apply the function to create new datasets with chunks of size "chunk_size" or less
chunk_size = 1000
test_seq_chunked, test_labels_chunked = split_into_chunks(test_seq, test_labels)
train_seq_chunked, train_labels_chunked = split_into_chunks(train_seq, train_labels)
The resulting train and test files will be exported to the same path where the input data file was located:
filename = os.path.splitext(os.path.basename(file_path))[0]
dir = os.path.dirname(file_path)
('/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_test_labels_chunked_1000.pkl',
'/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_test_sequences_chunked_1000.pkl',
'/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_train_labels_chunked_1000.pkl',
'/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_train_sequences_chunked_1000.pkl')
@manual{Bioinformatics,
title={UniProt data pre-processing for binding site prediction downstream task},
organization={DeepChem},
author={Gómez de Lope, Elisa},
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/UniProt_Data_Preprocessing_for_
year={2024},
}
Exploring Quantum Chemistry with GDB1k
Most of the tutorials we've walked you through so far have focused on applications to the drug discovery realm, but
DeepChem's tool suite works for molecular design problems generally. In this tutorial, we're going to walk through an
example of how to train a simple molecular machine learning for the task of predicting the atomization energy of a
molecule. (Remember that the atomization energy is the energy required to form 1 mol of gaseous atoms from 1 mol of
the molecule in its standard state under standard conditions).
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
With our setup in place, let's do a few standard imports to get the ball rolling.
import deepchem as dc
from sklearn.ensemble import RandomForestRegressor
from sklearn.kernel_ridge import KernelRidge
The ntext step we want to do is load our dataset. We're using a small dataset we've prepared that's pulled out of the
larger GDB benchmarks. The dataset contains the atomization energies for 1K small molecules.
tasks = ["atomization_energy"]
dataset_file = "../../datasets/gdb1k.sdf"
smiles_field = "smiles"
mol_field = "mol"
We now need a way to transform molecules that is useful for prediction of atomization energy. This representation
draws on foundational work [1] that represents a molecule's 3D electrostatic structure as a 2D matrix
If you're observing carefully, you might ask, wait doesn't this mean that molecules with different numbers of atoms
generate matrices of different sizes? In practice the trick to get around this is that the matrices are "zero-padded." That
is, if you're making coulomb matrices for a set of molecules, you pick a maximum number of atoms
and set to zero all the extra entries for this molecule. (There's a couple extra tricks that are done under the hood
beyond this. Check out reference [1] or read the source code in DeepChem!)
DeepChem has a built in featurization class dc.feat.CoulombMatrixEig that can generate these featurizations for
you.
. Let's now load our dataset file into DeepChem. As in the previous tutorials, we use a Loader class, in particular
dc.data.SDFLoader to load our .sdf file into DeepChem. The following snippet shows how we do this:
loader = dc.data.SDFLoader(
tasks=["atomization_energy"],
featurizer=featurizer)
dataset = loader.create_dataset(dataset_file)
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
RDKit WARNING: [17:25:11] Warning: molecule is tagged as 3D, but all Z coords are zero
/Users/peastman/workspace/deepchem/deepchem/feat/molecule_featurizers/coulomb_matrices.py:141: RuntimeWarning: d
ivide by zero encountered in true_divide
m = np.outer(z, z) / d
For the purposes of this tutorial, we're going to do a random split of the dataset into training, validation, and test. In
general, this split is weak and will considerably overestimate the accuracy of our models, but for now in this simple
tutorial isn't a bad place to get started.
random_splitter = dc.splits.RandomSplitter()
train_dataset, valid_dataset, test_dataset = random_splitter.train_valid_test_split(dataset)
One issue that Coulomb matrix featurizations have is that the range of entries in the matrix
term can range very widely. In general, a wide range of values for inputs can throw off learning for the neural network.
For this, a common fix is to normalize the input values so that they fall into a more standard range. Recall that the
normalization transform applies to each feature
of datapoint
where
and
-th feature. This transformation enables the learning to proceed smoothly. A second point is that the atomization
energies also fall across a wide range. So we apply an analogous transformation normalization transformation to the
output to scale the energies better. We use DeepChem's transformation API to make this happen:
transformers = [
dc.trans.NormalizationTransformer(transform_X=True, dataset=train_dataset),
dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)]
Now that we have the data cleanly transformed, let's do some simple machine learning. We'll start by constructing a
random forest on top of the data. We'll use DeepChem's hyperparameter tuning module to do this.
metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)
optimizer = dc.hyper.GridHyperparamOpt(rf_model_builder)
best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
params_dict, train_dataset, valid_dataset, output_transformers=transformers,
metric=metric, use_max=False)
for key, value in all_rf_results.items():
print(f'{key}: {value}')
print('Best hyperparams:', best_rf_hyperparams)
_max_featuresauto_n_estimators_10: 91166.92046422893
_max_featuressqrt_n_estimators_10: 90145.02789928475
_max_featureslog2_n_estimators_10: 85589.77206099383
_max_featuresNone_n_estimators_10: 86870.06019336461
_max_featuresauto_n_estimators_100: 86385.9006447343
_max_featuressqrt_n_estimators_100: 85051.76415912053
_max_featureslog2_n_estimators_100: 86443.79468510246
_max_featuresNone_n_estimators_100: 85464.79840440316
Best hyperparams: (100, 'sqrt')
Let's build one more model, a kernel ridge regression, on top of this raw data.
params_dict = {
"kernel": ["laplacian"],
"alpha": [0.0001],
"gamma": [0.0001]
}
metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)
optimizer = dc.hyper.GridHyperparamOpt(krr_model_builder)
best_krr, best_krr_hyperparams, all_krr_results = optimizer.hyperparam_search(
params_dict, train_dataset, valid_dataset, output_transformers=transformers,
metric=metric, use_max=False)
for key, value in all_krr_results.items():
print(f'{key}: {value}')
print('Best hyperparams:', best_krr_hyperparams)
_alpha_0.000100_gamma_0.000100_kernellaplacian: 94056.64820129865
Best hyperparams: ('laplacian', 0.0001, 0.0001)
Bibliography:
[1] https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.98.146401
DeepQMC tutorial
Background:
The electrons in a molecule are quantum mechanical in nature, meaning they do not follow classical physical laws.
Quantum mechanics only gives a probability of where the electrons will be found and will not tell exactly where it can be
found. This probability is given by the squared magnitude of a property of molecule system called the wavefunction and
is different in each case of different molecule.
For many purposes, the nuclei of the atoms in a molecule can be considered to be stationary, and then we solve to get
the wavefunction of the electrons. The probabilites when modelled into 3 dimensional space, takes the shape of the
orbitals. Like shown in these images below which is taken via electron microscope.
Don't worry if you cannot remember or relate to the concept of orbitals, just remember that these are the space where
electrons are found with more probability.
Using these wavefunctions, the electronic structure(a model containing electrons at its most probable positions) of a
system can be obtained which can be used to calculate the energy at ground state. This value, then can be used to
calculate various properties like ionization energy, electron affinity, etc.
The wavefunction of simple one electron systems like hydrogen atom, helium cation can be found easily, but for heavier
atoms and molecules, electron-electron repulsion comes into act and makes it hard to compute the wavefunctions due
to these interactions. Calculating these wavefunctions exactly will need a lot of computing resources and time which
cannot be feasible to get them. Hence, other various different techniques for to approximate the wavefunction have
been introduced, where there is a different tradeoff between speed and accuracy of the solution. One such method is
the variational Monte Carlo which aims to include the effects of electron correlation in the solution without it.
Since Deep learning act as universal function approximators, it can be used to approximate wavefunction as well!! One
such approach is the DNN based Variational Monte Carlo called paulinet. In this tutorial we will be looking into how to
use Paulinet which is a part of an application called DeepQMC.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Setup:
Then, we create our own custom 'Molecule dataset', which is the format in which DeepQMC accepts the parameters of
the molecule system. It shoud contain the list of coordinates of each nucleon(coord), the list of number of protons in
each nucleon(charges), the total ionic charge(charge) and the spin.
Since we are testing out each molecule with different distance, we can loop the Molecule with keeping one nuclei at the
origin and varying the other nuclei x-axis position as given below. Also, the coordinates when given should be in the
magnitude of bohr, where one bohr is equal to 0.52917721092 angstroms.
Then the molecule is loaded and given to train. Here we have modified a particular set of parameters in training to get a
solution with reasonable accuracy in short time. Here we have altered n_steps(the number of steps in which electrons
are sampled), batch_size(number of samples in a single step) and the epoch size(number of steps between sampling
from wavefunction).
Now after training the model, we now will evaluate it, which means that the model will be run from first with the weights
and biases of the neural network which one gets after the training. The result of this evaluation is given as \ uncertainity
data type like '3.14±0.01', for the sake of graphing, we ignore the uncertainity and take the principal integer called the
nominal value(i.e 3.14 in previous case). To do this, we use the 'uncertainities' library.
angstroms_to_bohr=1/(0.52917721092)
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
converged SCF energy = -1.11127760265224
Reducing cusp-correction cutoffs due to overlaps
equilibrating: 0it [00:00, ?it/s]
training: 0%| | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:274: FutureWarning: torch.cuda.reset_max_memory_allo
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
converged SCF energy = -1.12727489687867
Reducing cusp-correction cutoffs due to overlaps
equilibrating: 0it [00:00, ?it/s]
training: 0%| | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:274: FutureWarning: torch.cuda.reset_max_memory_allo
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
converged SCF energy = -1.11346928280952
Reducing cusp-correction cutoffs due to overlaps
equilibrating: 0it [00:00, ?it/s]
training: 0%| | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:274: FutureWarning: torch.cuda.reset_max_memory_allo
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
converged SCF energy = -1.07787156827504
equilibrating: 0it [00:00, ?it/s]
training: 0%| | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:274: FutureWarning: torch.cuda.reset_max_memory_allo
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
converged SCF energy = -1.03805266783613
equilibrating: 0it [00:00, ?it/s]
training: 0%| | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:274: FutureWarning: torch.cuda.reset_max_memory_allo
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
converged SCF energy = -1.00006529201883
equilibrating: 0it [00:00, ?it/s]
training: 0%| | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:274: FutureWarning: torch.cuda.reset_max_memory_allo
cated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
FutureWarning)
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
sampling: 0%| | 0/5 [00:00<?, ?it/s]
equilibrating: 0it [00:00, ?it/s]
Let's plot the results and draw some conclusions from it. The results of the evaluation happens to be a dictionary with
key "energy" and contains the values of the type uncertainities and hence uncertainities library has been used to deal
with
plt.ylabel("Energy")
plt.xlabel("Inter-nuclear distance")
xpoints = np.array([0.4,0.5,0.6,0.7,0.9,1.1,1.3,1.5])
ypoints = np.array(energies)
#nominal value refers to the principal value excluding the error
plt.plot(xpoints, ypoints)
plt.show()
Here we examined the stability of different hypothetical molecule by differing the coordinates of the nucleus of a
molecule. The one with lowest ground state energy is the most stable one. The inter-nuclear distance so lies at
approximately between 0.7 and 0.8 angstroms as it has a visible minima there.
As you can see here, this result has a lot of applications!! If you calculate the ground state energy for the hydrogen
molecule cation (H2+), then difference in their energy gives you the ionization energy, same can be done if you
calculate the hyrogen moleculet anion to calculate the electron affinity, all from doing simulations!! Also, via this
method, molecular electronic structure can also be determined, which can be used to examine various properties like
conductivity, optical and chemical nature. So, this helps in finding better materials for specific applications.
For this, we're going to use the venerable biopython library to do some basic bioinformatics. A lot of the material in this
notebook is adapted from the extensive official [Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html).
We strongly recommend checking out the official tutorial after you work through this notebook!
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5
minutes to run to completion and install your environment.
Collecting biopython
Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 12.1 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from biopython) (1.22.4)
Installing collected packages: biopython
Successfully installed biopython-1.81
import Bio
Bio.__version__
'1.81'
Seq('AGTACACATTG')
The complement() method in Biopython's Seq object returns the complement of a DNA sequence. It replaces each base
with its complement according to the Watson-Crick base pairing rules. Adenine (A) is complemented by thymine (T), and
guanine (G) is complemented by cytosine (C).
The reverse_complement() method in Biopython's Seq object returns the reverse complement of a DNA sequence. It first
reverses the sequence and then replaces each base with its complement according to the Watson-Crick base pairing
rules.
But why is direction important? Many cellular processes occur only along a particular direction. To understand what
gives a sense of directionality to a strand of DNA, take a look at the pictures below. Carbon atoms in the backbone of
DNA are numbered from 1' to 5' (usually pronounced as "5 prime") in a clockwise direction. One might notice that the
strand on the left has the 5' carbon above the 3' carbon in every nucleotide, resulting in a strand starting with a 5' end
and ending with a 3' end. The strand on the right runs from the 3' end to the 5' end. As hinted earlier, reading of a DNA
strand during replication and transcription only occurs from the 3' end to the 5' end.
my_seq.complement()
Seq('TCATGTGTAAC')
my_seq.reverse_complement()
Seq('CAATGTGTACT')
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
Let's take a look at what the contents of this file look like:
1. List item
2. List item
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
gi|2765648|emb|Z78523.1|CHZ78523
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG')
709
gi|2765647|emb|Z78522.1|CMZ78522
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG')
700
gi|2765646|emb|Z78521.1|CCZ78521
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC')
726
gi|2765645|emb|Z78520.1|CSZ78520
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TTT')
753
gi|2765644|emb|Z78519.1|CPZ78519
Seq('ATATGATCGAGTGAATCTGGTGGACTTGTGGTTACTCAGCTCGCCATAGGCTTT...TTA')
699
gi|2765643|emb|Z78518.1|CRZ78518
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATAGTAG...TCC')
658
gi|2765642|emb|Z78517.1|CFZ78517
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...AGC')
752
gi|2765641|emb|Z78516.1|CPZ78516
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAT...TAA')
726
gi|2765640|emb|Z78515.1|MXZ78515
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGCTGAGACCGTAG...AGC')
765
gi|2765639|emb|Z78514.1|PSZ78514
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...CTA')
755
gi|2765638|emb|Z78513.1|PBZ78513
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GAG')
742
gi|2765637|emb|Z78512.1|PWZ78512
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...AGC')
762
gi|2765636|emb|Z78511.1|PEZ78511
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCCCCA...GGA')
745
gi|2765635|emb|Z78510.1|PCZ78510
Seq('CTAACCAGGGTTCCGAGGTGACCTTCGGGAGGATTCCTTTTTAAGCCCCCGAAA...TTA')
750
gi|2765634|emb|Z78509.1|PPZ78509
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GGA')
731
gi|2765633|emb|Z78508.1|PLZ78508
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TGA')
741
gi|2765632|emb|Z78507.1|PLZ78507
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCCCCA...TGA')
740
gi|2765631|emb|Z78506.1|PLZ78506
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...TGA')
727
gi|2765630|emb|Z78505.1|PSZ78505
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TTT')
711
gi|2765629|emb|Z78504.1|PKZ78504
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCGCAA...TAA')
743
gi|2765628|emb|Z78503.1|PCZ78503
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCCTTGTTGAGACCGCCA...TAA')
727
gi|2765627|emb|Z78502.1|PBZ78502
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGACCGCCA...CGC')
757
gi|2765626|emb|Z78501.1|PCZ78501
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...AGA')
770
gi|2765625|emb|Z78500.1|PWZ78500
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGCTCATTGTTGAGACCGCAA...AAG')
767
gi|2765624|emb|Z78499.1|PMZ78499
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAGGGATCATTGTTGAGATCGCAT...ACC')
759
gi|2765623|emb|Z78498.1|PMZ78498
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAAGGTCATTGTTGAGATCACAT...AGC')
750
gi|2765622|emb|Z78497.1|PDZ78497
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')
788
gi|2765621|emb|Z78496.1|PAZ78496
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC')
774
gi|2765620|emb|Z78495.1|PEZ78495
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GTG')
789
gi|2765619|emb|Z78494.1|PNZ78494
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGGTCGCAT...AAG')
688
gi|2765618|emb|Z78493.1|PGZ78493
Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...CCC')
719
gi|2765617|emb|Z78492.1|PBZ78492
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...ATA')
743
gi|2765616|emb|Z78491.1|PCZ78491
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC')
737
gi|2765615|emb|Z78490.1|PFZ78490
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')
728
gi|2765614|emb|Z78489.1|PDZ78489
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGC')
740
gi|2765613|emb|Z78488.1|PTZ78488
Seq('CTGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGCAATAATTGATCGA...GCT')
696
gi|2765612|emb|Z78487.1|PHZ78487
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TAA')
732
gi|2765611|emb|Z78486.1|PBZ78486
Seq('CGTCACGAGGTTTCCGTAGGTGAATCTGCGGGAGGATCATTGTTGAGATCACAT...TGA')
731
gi|2765610|emb|Z78485.1|PHZ78485
Seq('CTGAACCTGGTGTCCGAAGGTGAATCTGCGGATGGATCATTGTTGAGATATCAT...GTA')
735
gi|2765609|emb|Z78484.1|PCZ78484
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGGGGAAGGATCATTGTTGAGATCACAT...TTT')
720
gi|2765608|emb|Z78483.1|PVZ78483
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA')
740
gi|2765607|emb|Z78482.1|PEZ78482
Seq('TCTACTGCAGTGACCGAGATTTGCCATCGAGCCTCCTGGGAGCTTTCTTGCTGG...GCA')
629
gi|2765606|emb|Z78481.1|PIZ78481
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')
572
gi|2765605|emb|Z78480.1|PGZ78480
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')
587
gi|2765604|emb|Z78479.1|PPZ78479
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGT')
700
gi|2765603|emb|Z78478.1|PVZ78478
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCAGTGTTGAGATCACAT...GGC')
636
gi|2765602|emb|Z78477.1|PVZ78477
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC')
716
gi|2765601|emb|Z78476.1|PGZ78476
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC')
592
gi|2765600|emb|Z78475.1|PSZ78475
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT')
716
gi|2765599|emb|Z78474.1|PKZ78474
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGT...CTT')
733
gi|2765598|emb|Z78473.1|PSZ78473
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG')
626
gi|2765597|emb|Z78472.1|PLZ78472
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')
737
gi|2765596|emb|Z78471.1|PDZ78471
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')
740
gi|2765595|emb|Z78470.1|PPZ78470
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT')
574
gi|2765594|emb|Z78469.1|PHZ78469
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT')
594
gi|2765593|emb|Z78468.1|PAZ78468
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...GTT')
610
gi|2765592|emb|Z78467.1|PSZ78467
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')
730
gi|2765591|emb|Z78466.1|PPZ78466
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC')
641
gi|2765590|emb|Z78465.1|PRZ78465
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC')
702
gi|2765589|emb|Z78464.1|PGZ78464
Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC')
733
gi|2765588|emb|Z78463.1|PGZ78463
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGTTCATTGTTGAGATCACAT...AGC')
738
gi|2765587|emb|Z78462.1|PSZ78462
Seq('CGTCACGAGGTCTCCGGATGTGACCCTGCGGAAGGATCATTGTTGAGATCACAT...CAT')
736
gi|2765586|emb|Z78461.1|PWZ78461
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TAA')
732
gi|2765585|emb|Z78460.1|PCZ78460
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TTA')
745
gi|2765584|emb|Z78459.1|PDZ78459
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTT')
744
gi|2765583|emb|Z78458.1|PHZ78458
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTG')
738
gi|2765582|emb|Z78457.1|PCZ78457
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GAG')
739
gi|2765581|emb|Z78456.1|PTZ78456
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')
740
gi|2765580|emb|Z78455.1|PJZ78455
Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA')
745
gi|2765579|emb|Z78454.1|PFZ78454
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AAC')
695
gi|2765578|emb|Z78453.1|PSZ78453
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA')
745
gi|2765577|emb|Z78452.1|PBZ78452
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA')
743
gi|2765576|emb|Z78451.1|PHZ78451
Seq('CGTAACAAGGTTTCCGTAGGTGTACCTCCGGAAGGATCATTGTTGAGATCACAT...AGC')
730
gi|2765575|emb|Z78450.1|PPZ78450
Seq('GGAAGGATCATTGCTGATATCACATAATAATTGATCGAGTTAAGCTGGAGGATC...GAG')
706
gi|2765574|emb|Z78449.1|PMZ78449
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC')
744
gi|2765573|emb|Z78448.1|PAZ78448
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG')
742
gi|2765572|emb|Z78447.1|PVZ78447
Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATCACAT...AGC')
694
gi|2765571|emb|Z78446.1|PAZ78446
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...CCC')
712
gi|2765570|emb|Z78445.1|PUZ78445
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGT')
715
gi|2765569|emb|Z78444.1|PAZ78444
Seq('CGTAACAAGGTTTCCGTAGGGTGAACTGCGGAAGGATCATTGTTGAGATCACAT...ATT')
688
gi|2765568|emb|Z78443.1|PLZ78443
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG')
784
gi|2765567|emb|Z78442.1|PBZ78442
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAATTGATCGAGT...AGT')
721
gi|2765566|emb|Z78441.1|PSZ78441
Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG')
703
gi|2765565|emb|Z78440.1|PPZ78440
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA')
744
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC')
592
Sequence Objects
A large part of the biopython infrastructure deals with tools for handling sequences. These could be DNA sequences,
RNA sequences, amino acid sequences or even more exotic constructs. Generally, 'Seq' object can be treated like a
normal python string.
ACAGTAGAC
Seq('ACAGTAGAC')
my_prot = Seq("AAAAA")
my_prot
Seq('AAAAA')
We can take the length of sequences and index into them like strings.
print(len(my_prot))
my_prot[0]
'A'
my_prot[0:3]
Seq('AAA')
You can concatenate sequences if they have the same type so this works.
my_prot + my_prot
Seq('AAAAAAAAAA')
Biopython automatically handles the concatenation as both sequences are of the generic alphabet.
my_prot + my_seq
Seq('AAAAAACAGTAGAC')
Transcription
Transcription is the process by which a DNA sequence is converted into messenger RNA. Remember that this is part of
the "central dogma" of biology in which DNA engenders messenger RNA which engenders proteins. Here's a nice
representation of this cycle borrowed from a Khan academy lesson.
Note from the image above that DNA has two strands. The top strand is typically called the coding strand, and the
bottom the template strand. The template strand is used for the actual transcription process of conversion into
messenger RNA, but in bioinformatics, it's more common to work with the coding strand because this strand has the
same sequence as the RNA transcript (except that RNA has uracil (U) instead of thymine (T)). Let's now see how we can
execute a transcription computationally using Biopython.
ATGATCTCGTAA
template_dna = coding_dna.reverse_complement()
template_dna
Seq('TTACGAGATCAT')
Note that these sequences match those in the image below. You might be confused about why the template_dna
sequence is shown reversed. The reason is that by convention, the template strand is read in the reverse direction.
Let's now see how we can transcribe our coding_dna strand into messenger RNA. This will only swap 'T' for 'U' and
change the alphabet for our object.
messenger_rna = coding_dna.transcribe()
messenger_rna
Seq('AUGAUCUCGUAA')
We can also perform a "back-transcription" to recover the original coding strand from the messenger RNA.
messenger_rna.back_transcribe()
Seq('ATGATCTCGTAA')
Translation
Translation is the next step in the process, whereby a messenger RNA is transformed into a protein sequence. Here's a
beautiful diagram from Wikipedia#/media/File:Ribosome_mRNA_translation_en.svg) that lays out the basics of this
process.
Note how 3 nucleotides at a time correspond to one new amino acid added to the growing protein chain. A set of 3
nucleotides which codes for a given amino acid is called a "codon." We can use the translate() method on the
messenger rna to perform this transformation in code.
messenger_rna.translate()
The translation can also be performed directly from the coding sequence DNA
coding_dna.translate()
Seq('MIS*')
Let's now consider a longer genetic sequence that has some more interesting structure for us to look at.
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
coding_dna.translate()
Seq('MAIVMGR*KGAR*')
In both of the sequences above, '*' represents the stop codon. A stop codon is a sequence of 3 nucleotides that turns off
the protein machinery. In DNA, the stop codons are 'TGA', 'TAA', 'TAG'. Note that this latest sequence has multiple stop
codons. It's possible to run the machinery up to the first stop codon and pause too.
coding_dna.translate(to_stop=True)
Seq('MAIVMGR')
We're going to introduce a bit of terminology here. A complete coding sequence CDS is a nucleotide sequence of
messenger RNA which is made of a whole number of codons (that is, the length of the sequence is a multiple of 3),
starts with a "start codon" and ends with a "stop codon". A start codon is basically the opposite of a stop codon and is
mostly commonly the sequence "AUG", but can be different (especially if you're dealing with something like bacterial
DNA).
Let's see how we can translate a complete CDS of bacterial messenger RNA.
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*
gene.translate(table="Bacterial", to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')
class SeqRecord(builtins.object)
| SeqRecord(seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None,
features=None, annotations=None, letter_annotations=None)
|
| A SeqRecord object holds a sequence and information about it.
|
| Main attributes:
| - id - Identifier such as a locus tag (string)
| - seq - The sequence itself (Seq object or similar)
|
| Additional attributes:
| - name - Sequence name, e.g. gene name (string)
| - description - Additional text (string)
| - dbxrefs - List of database cross references (list of strings)
| - features - Any (sub)features defined (list of SeqFeature objects)
| - annotations - Further information about the whole sequence (dictionary).
| Most entries are strings, or lists of strings.
| - letter_annotations - Per letter/symbol annotation (restricted
| dictionary). This holds Python sequences (lists, strings
| or tuples) whose length matches that of the sequence.
| A typical use would be to hold a list of integers
| representing sequencing quality scores, or a string
| representing the secondary structure.
|
| You will typically use Bio.SeqIO to read in sequences from files as
| SeqRecord objects. However, you may want to create your own SeqRecord
| objects directly (see the __init__ method for further details):
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein")
| >>> print(record)
| ID: YP_025292.1
| Name: HokC
| Description: toxic membrane protein
| Number of features: 0
| Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')
|
| If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO
| for this. For the special case where you want the SeqRecord turned into
| a string in a particular file format there is a format method which uses
| Bio.SeqIO internally:
|
| >>> print(record.format("fasta"))
| >YP_025292.1 toxic membrane protein
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| <BLANKLINE>
|
| You can also do things like slicing a SeqRecord, checking its length, etc
|
| >>> len(record)
| 44
| >>> edited = record[:10] + record[11:]
| >>> print(edited.seq)
| MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| >>> print(record.seq)
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
|
| Methods defined here:
|
| __add__(self, other)
| Add another sequence or string to this sequence.
|
| The other sequence can be a SeqRecord object, a Seq object (or
| similar, e.g. a MutableSeq) or a plain Python string. If you add
| a plain string or a Seq (like) object, the new SeqRecord will simply
| have this appended to the existing data. However, any per letter
| annotation will be lost:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
|
| >>> new = record + "ACT"
| >>> print("%s %s" % (new.id, new.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT
| >>> print(list(new.letter_annotations))
| []
|
| The new record will attempt to combine the annotation, but for any
| ambiguities (e.g. different names) it defaults to omitting that
| annotation.
|
| >>> from Bio import SeqIO
| >>> with open("GenBank/pBAD30.gb") as handle:
| ... plasmid = SeqIO.read(handle, "gb")
| >>> print("%s %i" % (plasmid.id, len(plasmid)))
| pBAD30 4923
|
| Now let's cut the plasmid into two pieces, and join them back up the
| other way round (i.e. shift the starting point on this plasmid, have
| a look at the annotated features in the original file to see why this
| particular split point might make sense):
|
| >>> left = plasmid[:3765]
| >>> right = plasmid[3765:]
| >>> new = right + left
| >>> print("%s %i" % (new.id, len(new)))
| pBAD30 4923
| >>> str(new.seq) == str(right.seq + left.seq)
| True
| >>> len(new.features) == len(left.features) + len(right.features)
| True
|
| When we add the left and right SeqRecord objects, their annotation
| is all consistent, so it is all conserved in the new SeqRecord:
|
| >>> new.id == left.id == right.id == plasmid.id
| True
| >>> new.name == left.name == right.name == plasmid.name
| True
| >>> new.description == plasmid.description
| True
| >>> new.annotations == left.annotations == right.annotations
| True
| >>> new.letter_annotations == plasmid.letter_annotations
| True
| >>> new.dbxrefs == left.dbxrefs == right.dbxrefs
| True
|
| However, we should point out that when we sliced the SeqRecord,
| any annotations dictionary or dbxrefs list entries were lost.
| You can explicitly copy them like this:
|
| >>> new.annotations = plasmid.annotations.copy()
| >>> new.dbxrefs = plasmid.dbxrefs[:]
|
| __bool__(self)
| Boolean value of an instance of this class (True).
|
| This behaviour is for backwards compatibility, since until the
| __len__ method was added, a SeqRecord always evaluated as True.
|
| Note that in comparison, a Seq object will evaluate to False if it
| has a zero length sequence.
|
| WARNING: The SeqRecord may in future evaluate to False when its
| sequence is of zero length (in order to better match the Seq
| object behaviour)!
|
| __bytes__(self)
|
| __contains__(self, char)
| Implement the 'in' keyword, searches the sequence.
|
| e.g.
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
| >>> "GAATTC" in record
| False
| >>> "AAA" in record
| True
|
| This essentially acts as a proxy for using "in" on the sequence:
|
| >>> "GAATTC" in record.seq
| False
| >>> "AAA" in record.seq
| True
|
| Note that you can also use Seq objects as the query,
|
| >>> from Bio.Seq import Seq
| >>> Seq("AAA") in record
| True
|
| See also the Seq object's __contains__ method.
|
| __eq__(self, other)
| Define the equal-to operand (not implemented).
|
| __format__(self, format_spec)
| Return the record as a string in the specified file format.
|
| This method supports the Python format() function and f-strings.
| The format_spec should be a lower case string supported by
| Bio.SeqIO as a text output file format. Requesting a binary file
| format raises a ValueError. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein")
| ...
| >>> format(record, "fasta")
| '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
| >>> print(f"Here is {record.id} in FASTA format:\n{record:fasta}")
| Here is YP_025292.1 in FASTA format:
| >YP_025292.1 toxic membrane protein
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| <BLANKLINE>
|
| See also the SeqRecord's format() method.
|
| __ge__(self, other)
| Define the greater-than-or-equal-to operand (not implemented).
|
| __getitem__(self, index)
| Return a sub-sequence or an individual letter.
|
| Slicing, e.g. my_record[5:10], returns a new SeqRecord for
| that sub-sequence with some annotation preserved as follows:
|
| * The name, id and description are kept as-is.
| * Any per-letter-annotations are sliced to match the requested
| sub-sequence.
| * Unless a stride is used, all those features which fall fully
| within the subsequence are included (with their locations
| adjusted accordingly). If you want to preserve any truncated
| features (e.g. GenBank/EMBL source features), you must
| explicitly add them to the new SeqRecord yourself.
| * With the exception of any molecule type, the annotations
| dictionary and the dbxrefs list are not used for the new
| SeqRecord, as in general they may not apply to the
| subsequence. If you want to preserve them, you must explicitly
| copy them to the new SeqRecord yourself.
|
| Using an integer index, e.g. my_record[5] is shorthand for
| extracting that letter from the sequence, my_record.seq[5].
|
| For example, consider this short protein and its secondary
| structure as encoded by the PDB (e.g. H for alpha helices),
| plus a simple feature for its histidine self phosphorylation
| site:
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.SeqFeature import SeqFeature, SimpleLocation
| >>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT"
| ... "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR"),
| ... id="1JOY", name="EnvZ",
| ... description="Homodimeric domain of EnvZ from E. coli")
| >>> rec.letter_annotations["secondary_structure"] = " S SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHH
HHHHHHHHHHHHTT "
| >>> rec.features.append(SeqFeature(SimpleLocation(20, 21),
| ... type = "Site"))
|
| Now let's have a quick look at the full record,
|
| >>> print(rec)
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 1
| Per letter annotation for: secondary_structure
| Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR')
| >>> rec.letter_annotations["secondary_structure"]
| ' S SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT '
| >>> print(rec.features[0].location)
| [20:21]
|
| Now let's take a sub sequence, here chosen as the first (fractured)
| alpha helix which includes the histidine phosphorylation site:
|
| >>> sub = rec[11:41]
| >>> print(sub)
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 1
| Per letter annotation for: secondary_structure
| Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD')
| >>> sub.letter_annotations["secondary_structure"]
| 'HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH'
| >>> print(sub.features[0].location)
| [9:10]
|
| You can also of course omit the start or end values, for
| example to get the first ten letters only:
|
| >>> print(rec[:10])
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 0
| Per letter annotation for: secondary_structure
| Seq('MAAGVKQLAD')
|
| Or for the last ten letters:
|
| >>> print(rec[-10:])
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 0
| Per letter annotation for: secondary_structure
| Seq('IIEQFIDYLR')
|
| If you omit both, then you get a copy of the original record (although
| lacking the annotations and dbxrefs):
|
| >>> print(rec[:])
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 1
| Per letter annotation for: secondary_structure
| Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR')
|
| Finally, indexing with a simple integer is shorthand for pulling out
| that letter from the sequence directly:
|
| >>> rec[5]
| 'K'
| >>> rec.seq[5]
| 'K'
|
| __gt__(self, other)
| Define the greater-than operand (not implemented).
|
| __init__(self, seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=N
one, features=None, annotations=None, letter_annotations=None)
| Create a SeqRecord.
|
| Arguments:
| - seq - Sequence, required (Seq or MutableSeq)
| - id - Sequence identifier, recommended (string)
| - name - Sequence name, optional (string)
| - description - Sequence description, optional (string)
| - dbxrefs - Database cross references, optional (list of strings)
| - features - Any (sub)features, optional (list of SeqFeature objects)
| - annotations - Dictionary of annotations for the whole sequence
| - letter_annotations - Dictionary of per-letter-annotations, values
| should be strings, list or tuples of the same length as the full
| sequence.
|
| You will typically use Bio.SeqIO to read in sequences from files as
| SeqRecord objects. However, you may want to create your own SeqRecord
| objects directly.
|
| Note that while an id is optional, we strongly recommend you supply a
| unique id string for each record. This is especially important
| if you wish to write your sequences to a file.
|
| You can create a 'blank' SeqRecord object, and then populate the
| attributes later.
|
| __iter__(self)
| Iterate over the letters in the sequence.
|
| For example, using Bio.SeqIO to read in a protein FASTA file:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/loveliesbleeding.pro", "fasta")
| >>> for amino in record:
| ... print(amino)
| ... if amino == "L": break
| X
| A
| G
| L
| >>> print(record.seq[3])
| L
|
| This is just a shortcut for iterating over the sequence directly:
|
| >>> for amino in record.seq:
| ... print(amino)
| ... if amino == "L": break
| X
| A
| G
| L
| >>> print(record.seq[3])
| L
|
| Note that this does not facilitate iteration together with any
| per-letter-annotation. However, you can achieve that using the
| python zip function on the record (or its sequence) and the relevant
| per-letter-annotation:
|
| >>> from Bio import SeqIO
| >>> rec = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (rec.id, rec.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(rec.letter_annotations))
| ['solexa_quality']
| >>> for nuc, qual in zip(rec, rec.letter_annotations["solexa_quality"]):
| ... if qual > 35:
| ... print("%s %i" % (nuc, qual))
| A 40
| C 39
| G 38
| T 37
| A 36
|
| You may agree that using zip(rec.seq, ...) is more explicit than using
| zip(rec, ...) as shown above.
|
| __le__(self, other)
| Define the less-than-or-equal-to operand (not implemented).
|
| __len__(self)
| Return the length of the sequence.
|
| For example, using Bio.SeqIO to read in a FASTA nucleotide file:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
| >>> len(record)
| 309
| >>> len(record.seq)
| 309
|
| __lt__(self, other)
| Define the less-than operand (not implemented).
|
| __ne__(self, other)
| Define the not-equal-to operand (not implemented).
|
| __radd__(self, other)
| Add another sequence or string to this sequence (from the left).
|
| This method handles adding a Seq object (or similar, e.g. MutableSeq)
| or a plain Python string (on the left) to a SeqRecord (on the right).
| See the __add__ method for more details, but for example:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
|
| >>> new = "ACT" + record
| >>> print("%s %s" % (new.id, new.seq))
| slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(new.letter_annotations))
| []
|
| __repr__(self)
| Return a concise summary of the record for debugging (string).
|
| The python built in function repr works by calling the object's __repr__
| method. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT"
| ... "GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ"
| ... "SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ"
| ... "PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF"),
| ... id="NP_418483.1", name="b4059",
| ... description="ssDNA-binding protein",
| ... dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"])
| >>> print(repr(rec))
| SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF'), id='NP_418483.1', nam
e='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
|
| At the python prompt you can also use this shorthand:
|
| >>> rec
| SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF'), id='NP_418483.1', nam
e='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
|
| Note that long sequences are shown truncated. Also note that any
| annotations, letter_annotations and features are not shown (as they
| would lead to a very long string).
|
| __str__(self)
| Return a human readable summary of the record and its annotation (string).
|
| The python built in function str works by calling the object's __str__
| method. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein, small")
| >>> print(str(record))
| ID: YP_025292.1
| Name: HokC
| Description: toxic membrane protein, small
| Number of features: 0
| Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')
|
| In this example you don't actually need to call str explicitly, as the
| print command does this automatically:
|
| >>> print(record)
| ID: YP_025292.1
| Name: HokC
| Description: toxic membrane protein, small
| Number of features: 0
| Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')
|
| Note that long sequences are shown truncated.
|
| count(self, sub, start=None, end=None)
| Return the number of non-overlapping occurrences of sub in seq[start:end].
|
| Optional arguments start and end are interpreted as in slice notation.
| This method behaves as the count method of Python strings.
|
| format(self, format)
| Return the record as a string in the specified file format.
|
| The format should be a lower case string supported as an output
| format by Bio.SeqIO, which is used to turn the SeqRecord into a
| string. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein")
| >>> record.format("fasta")
| '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
| >>> print(record.format("fasta"))
| >YP_025292.1 toxic membrane protein
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| <BLANKLINE>
|
| The Python print function automatically appends a new line, meaning
| in this example a blank line is shown. If you look at the string
| representation you can see there is a trailing new line (shown as
| slash n) which is important when writing to a file or if
| concatenating multiple sequence strings together.
|
| Note that this method will NOT work on every possible file format
| supported by Bio.SeqIO (e.g. some are for multiple sequences only,
| and binary formats are not supported).
|
| islower(self)
| Return True if all ASCII characters in the record's sequence are lowercase.
|
| If there are no cased characters, the method returns False.
|
| isupper(self)
| Return True if all ASCII characters in the record's sequence are uppercase.
|
| If there are no cased characters, the method returns False.
|
| lower(self)
| Return a copy of the record with a lower case sequence.
|
| All the annotation is preserved unchanged. e.g.
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/aster.pro", "fasta")
| >>> print(record.format("fasta"))
| >gi|3298468|dbj|BAA31520.1| SAMIPF
| GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG
| VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI
| <BLANKLINE>
| >>> print(record.lower().format("fasta"))
| >gi|3298468|dbj|BAA31520.1| SAMIPF
| gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg
| vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani
| <BLANKLINE>
|
| To take a more annotation rich example,
|
| >>> from Bio import SeqIO
| >>> old = SeqIO.read("EMBL/TRBG361.embl", "embl")
| >>> len(old.features)
| 3
| >>> new = old.lower()
| >>> len(old.features) == len(new.features)
| True
| >>> old.annotations["organism"] == new.annotations["organism"]
| True
| >>> old.dbxrefs == new.dbxrefs
| True
|
| reverse_complement(self, id=False, name=False, description=False, features=True, annotations=False, letter_a
nnotations=True, dbxrefs=False)
| Return new SeqRecord with reverse complement sequence.
|
| By default the new record does NOT preserve the sequence identifier,
| name, description, general annotation or database cross-references -
| these are unlikely to apply to the reversed sequence.
|
| You can specify the returned record's id, name and description as
| strings, or True to keep that of the parent, or False for a default.
|
| You can specify the returned record's features with a list of
| SeqFeature objects, or True to keep that of the parent, or False to
| omit them. The default is to keep the original features (with the
| strand and locations adjusted).
|
| You can also specify both the returned record's annotations and
| letter_annotations as dictionaries, True to keep that of the parent,
| or False to omit them. The default is to keep the original
| annotations (with the letter annotations reversed).
|
| To show what happens to the pre-letter annotations, consider an
| example Solexa variant FASTQ file with a single entry, which we'll
| read in as a SeqRecord:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
| >>> print(record.letter_annotations["solexa_quality"])
| [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
|
| Now take the reverse complement, here we explicitly give a new
| identifier (the old identifier with a suffix):
|
| >>> rc_record = record.reverse_complement(id=record.id + "_rc")
| >>> print("%s %s" % (rc_record.id, rc_record.seq))
| slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
|
| Notice that the per-letter-annotations have also been reversed,
| although this may not be appropriate for all cases.
|
| >>> print(rc_record.letter_annotations["solexa_quality"])
| [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2
3, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
|
| Now for the features, we need a different example. Parsing a GenBank
| file is probably the easiest way to get an nice example with features
| in it...
|
| >>> from Bio import SeqIO
| >>> with open("GenBank/pBAD30.gb") as handle:
| ... plasmid = SeqIO.read(handle, "gb")
| >>> print("%s %i" % (plasmid.id, len(plasmid)))
| pBAD30 4923
| >>> plasmid.seq
| Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG')
| >>> len(plasmid.features)
| 13
|
| Now, let's take the reverse complement of this whole plasmid:
|
| >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc")
| >>> print("%s %i" % (rc_plasmid.id, len(rc_plasmid)))
| pBAD30_rc 4923
| >>> rc_plasmid.seq
| Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC')
| >>> len(rc_plasmid.features)
| 13
|
| Let's compare the first CDS feature - it has gone from being the
| second feature (index 1) to the second last feature (index -2), its
| strand has changed, and the location switched round.
|
| >>> print(plasmid.features[1])
| type: CDS
| location: [1081:1960](-)
| qualifiers:
| Key: label, Value: ['araC']
| Key: note, Value: ['araC regulator of the arabinose BAD promoter']
| Key: vntifkey, Value: ['4']
| <BLANKLINE>
| >>> print(rc_plasmid.features[-2])
| type: CDS
| location: [2963:3842](+)
| qualifiers:
| Key: label, Value: ['araC']
| Key: note, Value: ['araC regulator of the arabinose BAD promoter']
| Key: vntifkey, Value: ['4']
| <BLANKLINE>
|
| You can check this new location, based on the length of the plasmid:
|
| >>> len(plasmid) - 1081
| 3842
| >>> len(plasmid) - 1960
| 2963
|
| Note that if the SeqFeature annotation includes any strand specific
| information (e.g. base changes for a SNP), this information is not
| amended, and would need correction after the reverse complement.
|
| Note trying to reverse complement a protein SeqRecord raises an
| exception:
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> protein_rec = SeqRecord(Seq("MAIVMGR"), id="Test",
| ... annotations={"molecule_type": "protein"})
| >>> protein_rec.reverse_complement()
| Traceback (most recent call last):
| ...
| ValueError: Proteins do not have complements!
|
| If you have RNA without any U bases, it must be annotated as RNA
| otherwise it will be treated as DNA by default with A mapped to T:
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> rna1 = SeqRecord(Seq("ACG"), id="Test")
| >>> rna2 = SeqRecord(Seq("ACG"), id="Test", annotations={"molecule_type": "RNA"})
| >>> print(rna1.reverse_complement(id="RC", description="unk").format("fasta"))
| >RC unk
| CGT
| <BLANKLINE>
| >>> print(rna2.reverse_complement(id="RC", description="RNA").format("fasta"))
| >RC RNA
| CGU
| <BLANKLINE>
|
| Also note you can reverse complement a SeqRecord using a MutableSeq:
|
| >>> from Bio.Seq import MutableSeq
| >>> from Bio.SeqRecord import SeqRecord
| >>> rec = SeqRecord(MutableSeq("ACGT"), id="Test")
| >>> rec.seq[0] = "T"
| >>> print("%s %s" % (rec.id, rec.seq))
| Test TCGT
| >>> rc = rec.reverse_complement(id=True)
| >>> print("%s %s" % (rc.id, rc.seq))
| Test ACGA
|
| translate(self, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None, id=False, name=False,
description=False, features=False, annotations=False, letter_annotations=False, dbxrefs=False)
| Return new SeqRecord with translated sequence.
|
| This calls the record's .seq.translate() method (which describes
| the translation related arguments, like table for the genetic code),
|
| By default the new record does NOT preserve the sequence identifier,
| name, description, general annotation or database cross-references -
| these are unlikely to apply to the translated sequence.
|
| You can specify the returned record's id, name and description as
| strings, or True to keep that of the parent, or False for a default.
|
| You can specify the returned record's features with a list of
| SeqFeature objects, or False (default) to omit them.
|
| You can also specify both the returned record's annotations and
| letter_annotations as dictionaries, True to keep that of the parent
| (annotations only), or False (default) to omit them.
|
| e.g. Loading a FASTA gene and translating it,
|
| >>> from Bio import SeqIO
| >>> gene_record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
| >>> print(gene_record.format("fasta"))
| >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds
| CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCA
| AAACATGTGAAGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCG
| ACCTTAAGAGCTCCACATAGTTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCT
| TCATTGGTTATGGCAGTGGTCGTCAATGACAGCGATGAAGATGGAGATAGCCGTGACGCA
| GTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAGTTTGTCATAACACTACTCCG
| AGGTTTGTT
| <BLANKLINE>
|
| And now translating the record, specifying the new ID and description:
|
| >>> protein_record = gene_record.translate(table=11,
| ... id="phya",
| ... description="translation")
| >>> print(protein_record.format("fasta"))
| >phya translation
| QAARFLFMKNKVRMIVDCHAKHVKVLQDEKLPFDLTLCGSTLRAPHSCHLQYMANMDSIA
| SLVMAVVVNDSDEDGDSRDAVLPQKKKRLWGLVVCHNTTPRFV
| <BLANKLINE>
|
| upper(self)
| Return a copy of the record with an upper case sequence.
|
| All the annotation is preserved unchanged. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("acgtACGT"), id="Test",
| ... description = "Made up for this example")
| >>> record.letter_annotations["phred_quality"] = [1, 2, 3, 4, 5, 6, 7, 8]
| >>> print(record.upper().format("fastq"))
| @Test Made up for this example
| ACGTACGT
| +
| "#$%&'()
| <BLANKLINE>
|
| Naturally, there is a matching lower method:
|
| >>> print(record.lower().format("fastq"))
| @Test Made up for this example
| acgtacgt
| +
| "#$%&'()
| <BLANKLINE>
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| letter_annotations
| Dictionary of per-letter-annotation for the sequence.
|
| For example, this can hold quality scores used in FASTQ or QUAL files.
| Consider this example using Bio.SeqIO to read in an example Solexa
| variant FASTQ file as a SeqRecord:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
| >>> print(record.letter_annotations["solexa_quality"])
| [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
|
| The letter_annotations get sliced automatically if you slice the
| parent SeqRecord, for example taking the last ten bases:
|
| >>> sub_record = record[-10:]
| >>> print("%s %s" % (sub_record.id, sub_record.seq))
| slxa_0001_1_0001_01 ACGTNNNNNN
| >>> print(sub_record.letter_annotations["solexa_quality"])
| [4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
|
| Any python sequence (i.e. list, tuple or string) can be recorded in
| the SeqRecord's letter_annotations dictionary as long as the length
| matches that of the SeqRecord's sequence. e.g.
|
| >>> len(sub_record.letter_annotations)
| 1
| >>> sub_record.letter_annotations["dummy"] = "abcdefghij"
| >>> len(sub_record.letter_annotations)
| 2
|
| You can delete entries from the letter_annotations dictionary as usual:
|
| >>> del sub_record.letter_annotations["solexa_quality"]
| >>> sub_record.letter_annotations
| {'dummy': 'abcdefghij'}
|
| You can completely clear the dictionary easily as follows:
|
| >>> sub_record.letter_annotations = {}
| >>> sub_record.letter_annotations
| {}
|
| Note that if replacing the record's sequence with a sequence of a
| different length you must first clear the letter_annotations dict.
|
| seq
| The sequence itself, as a Seq or MutableSeq object.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __hash__ = None
Let's write a bit of code involving SeqRecord and see how it comes out looking.
simple_seq = Seq("GATC")
simple_seq_r = SeqRecord(simple_seq)
simple_seq_r.id = "AC12345"
simple_seq_r.description = "Made up sequence"
print(simple_seq_r.id)
print(simple_seq_r.description)
AC12345
Made up sequence
Let's now see how we can use SeqRecord to parse a large fasta file. We'll pull down a file hosted on the biopython site.
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='gi|45478711|ref|NC_00581
6.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Mi
crotus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])
record.id
'gi|45478711|ref|NC_005816.1|'
record.name
'gi|45478711|ref|NC_005816.1|'
record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
Let's now look at the same sequence, but downloaded from GenBank. We'll download the hosted file from the biopython
tutorial website as before.
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb
SeqIO Objects
.count() Method
The .count() method in Biopython's Seq object behaves similar to the .count() method of Python strings. It returns the
number of non-overlapping occurrences of a specific subsequence within the sequence.
4
1
MutableSeq objects
Just like the normal Python string, the Seq object is “read only”, or in Python terminology, immutable. Apart from
wanting the Seq object to act like a string, this is also a useful default since in many biological applications you want to
ensure you are not changing your sequence data: you can convert it into a mutable sequence (a MutableSeq object) and
do pretty much anything you want with it
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
References: [1]https://www.khanacademy.org/science/ap-biology/gene-expression-and-regulation/transcription-and-rna-
processing/a/overview-of-transcription [2]From DNA to RNA. https://www.ncbi.nlm.nih.gov/books/NBK26887/
Multisequence Alignment (MSA)
Proteins are made up of sequences of amino acids chained together. Their amino acid sequence determines their
structure and function. Finding proteins with similar sequences, or homologous proteins, is very useful in identifying the
structures and functions of newly discovered proteins as well as identifying their ancestry. Below is an example of what
a protein amino acid multisequence alignment may look like, taken from [2].
To understand Multiple Sequence Alignment (MSA), it's helpful to first grasp pairwise sequence alignment. Pairwise
sequence alignment is a hypothesis about how two sequences may have evolved from a common ancestor through
events such as mutation, insertion, and deletion. When a nucleotide is aligned with a gap, it represents an indel event—
either a deletion in one sequence or an insertion in the other. When two different nucleotides are aligned, this is
typically interpreted as a substitution or mutation event introduced in one or both of the lineages since the time they
diverged from one another.If identical nucleotides are aligned, it suggests a conserved region, which may indicate
functional importance and possibly homology—evidence that the sequences share a common ancestor. Pairwise
alignment also provides an optimal alignment of two sequences by strategically introducing gaps, making it useful for
comparing sequences and identifying conserved regions.The alignment is an optimal hypothesis but may not reflect the
actual evolutionary path.
Using profile-sequence comparison instead of just sequence-sequence comparison when constructing multiple sequence
alignment utilizes more information. Sequence profiles are based on frequencies of each of the 20 amino acids at each
position in a sequence. Sequence profiles (or PSSMs) tell us how likely it is that a particular amino acid (or nucleotide in
DNA/RNA sequences) at a specific position is due to conservation rather than random chance.This is achieved through
the use of log-odds scores that compare the observed frequency of an amino acid at a particular position in the multiple
sequence alignment (MSA) to its expected frequency under random conditions (i.e., the background probability).
Here is how the position frequence matrix looks. Sequence profile is constructed through the log odds score of these
frequencies.
A Profile Hidden Markov Model is a probabilistic model that represents the sequence conservation at each position
(including insertions and deletions) and the likelihood of transitioning between different sequence states.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
HH-suite
This tutorial will show you the basics of how to use hh-suite. hh-suite is an open source package for searching protein
sequence alignments for homologous proteins. It is the current state of the art for building highly accurate
multisequence alignments (MSA) from a single sequence or from MSAs.
HH-suite leverages profile HMMs to improve the accuracy of detecting remote homologs (sequences that share a
common ancestor but are highly diverged).
Instead of comparing a single sequence against a database of sequences, it aligns one profile HMM against another
profile HMM.The idea is that by comparing two profile HMMs, it can detect relationships between sequence families that
are not apparent when comparing just sequences. This is particularly useful when dealing with highly diverged proteins
or detecting remote homologs.
Setup
Let's start by importing the deepchem sequence_utils module and downloading a database to compare our query
sequence to.
hh-suite provides a set of HMM databases that will work with the software, which you can find here:
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs
dbCAN is a good one for this tutorial because it is a relatively smaller download.
%%bash
mkdir hh
cd hh
mkdir databases; cd databases
wget http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V9.tar.gz
tar xzvf dbCAN-fam-V9.tar.gz
dbCAN-fam-V9_a3m.ffdata
dbCAN-fam-V9_a3m.ffindex
dbCAN-fam-V9_hhm.ffdata
dbCAN-fam-V9_hhm.ffindex
dbCAN-fam-V9_cs219.ffdata
dbCAN-fam-V9_cs219.ffindex
dbCAN-fam-V9.md5sum
--2022-02-11 12:47:57-- http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V9.tar.gz
Resolving wwwuser.gwdg.de (wwwuser.gwdg.de)... 134.76.10.111
Connecting to wwwuser.gwdg.de (wwwuser.gwdg.de)|134.76.10.111|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25882327 (25M) [application/x-gzip]
Saving to: ‘dbCAN-fam-V9.tar.gz’
Using hhsearch
hhblits and hhsearch are the main functions in hhsuite which identify homologous proteins. They do this by calculating a
profile hidden Markov model (HMM) from a given alignment and searching over a reference HMM proteome database
using the Viterbi algorithm. Then the most similar HMMs are realigned and output to the user. To learn more, check out
the original paper in the references above.
!hhsearch
HHsearch 3.3.0
Search a database of HMMs with a query alignment or query HMM
(c) The HH-suite development team
Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S J, and Söding J (2019)
HH-suite3 for fast remote homology detection and deep protein annotation.
BMC Bioinformatics, doi:10.1186/s12859-019-3019-7
Output options:
-o <file> write results in standard format to file (default=<infile.hhr>)
-oa3m <file> write result MSA with significant matches in a3m format
-blasttab <name> write result in tabular BLAST format (compatible to -m 8 or -outfmt 6 output)
1 2 3 4 5 6 7 8 9 10 11 12
query target #match/tLen alnLen #mismatch #gapOpen qstart qend tstart tend eval score
-add_cons generate consensus sequence as master sequence of query MSA (default=don't)
-hide_cons don't show consensus sequence in alignments (default=show)
-hide_pred don't show predicted 2ndary structure in alignments (default=show)
-hide_dssp don't show DSSP 2ndary structure in alignments (default=show)
-show_ssconf show confidences for predicted 2ndary structure in alignments
Filter options applied to query MSA, database MSAs, and result MSA
-all show all sequences in result MSA; do not filter result MSA
-id [0,100] maximum pairwise sequence identity (def=90)
-diff [0,inf[ filter MSAs by selecting most diverse set of sequences, keeping
at least this many seqs in each MSA block of length 50
Zero and non-numerical values turn off the filtering. (def=100)
-cov [0,100] minimum coverage with master sequence (%) (def=0)
-qid [0,100] minimum sequence identity with master sequence (%) (def=0)
-qsc [0,100] minimum score per column with master sequence (default=-20.0)
-neff [1,inf] target diversity of multiple sequence alignment (default=off)
-mark do not filter out sequences marked by ">@"in their name line
Let's do an example. Say we have a protein which we want to compare to a MSA in order to identify any homologous
regions. For this we can use hhsearch.
Now let's take some protein sequence and search through the dbCAN database to see if we can find any potential
homologous regions. First we will specify the sequence and save it as a FASTA file or a3m file in order to be readable by
hhsearch. I pulled this sequence from the example query.a3m in the hhsuite data directory.
Then we can call hhsearch, specifying the query sequence with the -i flag, the database to search through with -d, and
the output with -o.
- 12:48:13.331 INFO: NOTE: Use the '-add_cons' option to calculate a consensus sequence as first sequence of the
alignment with hhconsensus or hhmake.
- 12:48:13.692 INFO: 0 sequences belonging to 0 database HMMs found with an E-value < 0.001
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 ABJ15796.1|231-344|9.6e-33 8.2 2.9 0.0042 25.2 0.0 13 224-236 40-52 (116)
2 lcl|consensus 5.1 5.2 0.0076 17.1 0.0 14 182-195 1-14 (21)
3 ABW08129.1|GT4|GT97||563-891 4.8 5.7 0.0084 26.6 0.0 46 104-150 93-140 (329)
4 AEO62162.1|AA13||19-250 4.6 6 0.0087 25.5 0.0 18 330-347 139-156 (232)
5 BAF49076.1|GH5_26.hmm|8.3e-11| 2.4 13 0.02 21.9 0.0 12 287-298 45-56 (141)
6 BBD44721.1 Hypothetical protei 2.3 14 0.02 25.7 0.0 81 110-221 326-406 (552)
7 AAU92474.1|CBM2|2-82|1.9e-23 2.3 14 0.02 19.1 0.0 19 222-240 13-33 (104)
8 BAX82587.1 hypothetical protei 2.3 14 0.021 25.7 0.0 25 104-128 466-490 (656)
9 AHE46274.1|GH13_13.hmm|1.6e-20 2.0 16 0.024 24.1 0.0 45 143-199 99-143 (393)
10 ACF55060.1|GH13_13.hmm|2.5e-47 1.9 17 0.025 23.2 0.0 22 144-165 74-95 (330)
No 1
>ABJ15796.1|231-344|9.6e-33
Probab=8.16 E-value=2.9 Score=25.22 Aligned_cols=13 Identities=46% Similarity=0.795 Sum_probs=10.2 Templa
te_Neff=3.400
No 2
>lcl|consensus
Probab=5.13 E-value=5.2 Score=17.13 Aligned_cols=14 Identities=29% Similarity=0.437 Sum_probs=10.2 Templa
te_Neff=4.300
Q Uncharacterize 181 DFNKDNRVSLAEAK 194 (430)
Q Consensus 182 dfnkdnrvslaeak 195 (431)
|.|.|++|+-.++-
T Consensus 1 DvN~DG~Vna~D~~ 14 (21)
T lcl|consensus_ 1 DVNGDGKVNALDLA 14 (21)
Confidence 67888888766553
No 3
>ABW08129.1|GT4|GT97||563-891
Probab=4.78 E-value=5.7 Score=26.58 Aligned_cols=46 Identities=20% Similarity=0.367 Sum_probs=28.5 Templa
te_Neff=1.500
No 4
>AEO62162.1|AA13||19-250
Probab=4.61 E-value=6 Score=25.50 Aligned_cols=18 Identities=39% Similarity=0.936 Sum_probs=14.8 Template
_Neff=1.600
No 5
>BAF49076.1|GH5_26.hmm|8.3e-11|182-335
Probab=2.39 E-value=13 Score=21.92 Aligned_cols=12 Identities=33% Similarity=0.720 Sum_probs=9.5 Template
_Neff=1.900
No 6
>BBD44721.1 Hypothetical protein PEIBARAKI_4714 [Petrimonas sp. IBARAKI]
Probab=2.34 E-value=14 Score=25.75 Aligned_cols=81 Identities=23% Similarity=0.240 Sum_probs=46.6 Templat
e_Neff=3.400
No 7
>AAU92474.1|CBM2|2-82|1.9e-23
Probab=2.33 E-value=14 Score=19.07 Aligned_cols=19 Identities=26% Similarity=0.562 Sum_probs=14.3 Templat
e_Neff=6.600
No 8
>BAX82587.1 hypothetical protein ALGA_4297 [Marinifilaceae bacterium SPP2]
Probab=2.28 E-value=14 Score=25.65 Aligned_cols=25 Identities=40% Similarity=0.314 Sum_probs=21.4 Templat
e_Neff=1.500
No 9
>AHE46274.1|GH13_13.hmm|1.6e-201|415-835
Probab=2.04 E-value=16 Score=24.14 Aligned_cols=45 Identities=29% Similarity=0.332 Sum_probs=25.7 Templat
e_Neff=3.900
No 10
>ACF55060.1|GH13_13.hmm|2.5e-47|336-542
Probab=1.94 E-value=17 Score=23.24 Aligned_cols=22 Identities=23% Similarity=0.371 Sum_probs=17.6 Templat
e_Neff=4.500
Two files are output and saved to the dataset directory, results.hhr and results.a3m. results.hhr is the hhsuite results
file, which is a summary of the results. results.a3m is the actual MSA file.
In the hhr file, the 'Prob' column describes the estimated probability of the query sequence being at least partially
homologous to the template. Probabilities of 95% or more are nearly certain, and probabilities of 30% or more call for
closer consideration. The E value tells you how many random matches with a better score would be expected if the
searched database was unrelated to the query sequence. These results show that none of the sequences align well with
our randomly chosen protein, which is to be expected because our query sequence was chosen at random.
Now let's check the results if we use a sequence that we know will align with something in the dbCAN database. I pulled
this protein from the dockerin.faa file in dbCAN.
dataset_path = 'protein2.fasta'
sequence_utils.hhsearch(dataset_path,database='dbCAN-fam-V9', data_dir=data_dir)
Query dockerin,22,NCBI-Bacteria,gi|125972715|ref|YP_001036625.1|,162-245,0.033
Match_columns 84
No_of_seqs 1 out of 1
Neff 1
Searched_HMMs 683
Date Fri Feb 11 12:48:14 2022
Command hhsearch -i /home/tony/github/deepchem/examples/tutorials/protein2.fasta -d hh/databases/dbCAN-fam
-V9 -oa3m /home/tony/github/deepchem/examples/tutorials/results.a3m -cpu 4 -e 0.001
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 lcl|consensus 97.0 5.9E-08 8.7E-11 43.5 0.0 21 4-24 1-21 (21)
2 ABN51673.1|GH124|2-334|2.6e-21 92.5 0.00033 4.8E-07 45.5 0.0 68 1-75 21-88 (318)
3 AAK20911.1|PL11|47-657|0 15.7 1.1 0.0017 27.6 0.0 14 1-14 329-342 (606)
4 AGE62576.1|PL11_1.hmm|0|1-596 10.2 2.1 0.0031 26.0 0.0 13 1-13 118-130 (602)
5 AAZ21803.1|GH103|26-328|1.7e-8 9.3 2.4 0.0035 22.4 0.0 10 4-13 175-184 (293)
6 AGE62576.1|PL11_1.hmm|0|1-596 5.5 4.8 0.007 23.9 0.0 12 1-12 329-340 (602)
7 AAK20911.1|PL11|47-657|0 5.5 4.8 0.007 23.8 0.0 13 1-13 118-130 (606)
8 APU21542.1|PL11_2.hmm|1.4e-162 4.9 5.6 0.0082 23.5 0.0 14 2-15 318-331 (579)
9 AAK20911.1|PL11|47-657|0 4.7 5.8 0.0084 23.4 0.0 10 3-12 184-193 (606)
10 AGE62576.1|PL11_1.hmm|0|1-596 4.6 6 0.0088 23.3 0.0 7 4-10 185-191 (602)
No 1
>lcl|consensus
Probab=97.03 E-value=5.9e-08 Score=43.48 Aligned_cols=21 Identities=57% Similarity=1.061 Sum_probs=20.1 T
emplate_Neff=4.300
No 2
>ABN51673.1|GH124|2-334|2.6e-219
Probab=92.52 E-value=0.00033 Score=45.54 Aligned_cols=68 Identities=31% Similarity=0.523 Sum_probs=51.6 T
emplate_Neff=1.400
No 3
>AAK20911.1|PL11|47-657|0
Probab=15.69 E-value=1.1 Score=27.56 Aligned_cols=14 Identities=50% Similarity=0.641 Sum_probs=10.4 Templ
ate_Neff=3.500
No 4
>AGE62576.1|PL11_1.hmm|0|1-596
Probab=10.22 E-value=2.1 Score=26.01 Aligned_cols=13 Identities=46% Similarity=0.772 Sum_probs=10.8 Templ
ate_Neff=3.300
No 5
>AAZ21803.1|GH103|26-328|1.7e-83
Probab=9.26 E-value=2.4 Score=22.41 Aligned_cols=10 Identities=40% Similarity=0.833 Sum_probs=9.2 Templat
e_Neff=5.600
No 6
>AGE62576.1|PL11_1.hmm|0|1-596
Probab=5.50 E-value=4.8 Score=23.90 Aligned_cols=12 Identities=58% Similarity=0.847 Sum_probs=7.5 Templat
e_Neff=3.300
No 7
>AAK20911.1|PL11|47-657|0
Probab=5.47 E-value=4.8 Score=23.84 Aligned_cols=13 Identities=46% Similarity=0.772 Sum_probs=10.6 Templa
te_Neff=3.500
No 8
>APU21542.1|PL11_2.hmm|1.4e-162|44-417
Probab=4.86 E-value=5.6 Score=23.51 Aligned_cols=14 Identities=50% Similarity=0.715 Sum_probs=9.4 Templat
e_Neff=2.600
No 9
>AAK20911.1|PL11|47-657|0
Probab=4.74 E-value=5.8 Score=23.38 Aligned_cols=10 Identities=50% Similarity=0.896 Sum_probs=5.6 Templat
e_Neff=3.500
No 10
>AGE62576.1|PL11_1.hmm|0|1-596
Probab=4.58 E-value=6 Score=23.30 Aligned_cols=7 Identities=71% Similarity=1.426 Sum_probs=0.0 Template_N
eff=3.300
- 12:48:14.084 INFO: 4 sequences belonging to 4 database HMMs found with an E-value < 0.001
- 12:48:14.084 INFO: Number of effective sequences of resulting query HMM: Neff = 1.39047
As you can see, there are 2 sequences which are a match for our query sequence.
Using hhblits
hhblits works in much the same way as hhsearch, but it is much faster and slightly less sensitive. This would be more
suited to searching very large databases, or producing a MSA with multiple sequences instead of just one. Let's make
use of that by using our query sequence to create an MSA. We could then use that MSA, with its family of proteins, to
search a larger database for potential matches. This will be much more effective than searching a large database with a
single sequence.
We will use the same dbCAN database. I will pull a glycoside hydrolase protein from UnipProt, so it will likely be related
to some proteins in dbCAN, which has carbohydrate-active enzymes.
The option -oa3m will tell hhblits to output an MSA as an a3m file. The -n option specifies the number of iterations. This
is recommended to keep between 1 and 4, we will try 2.
dataset_path = 'protein3.fasta'
sequence_utils.hhblits(dataset_path,database='dbCAN-fam-V9', data_dir=data_dir)
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 AAA91086.1|GH48|150-238|4.7e-1 100.0 7E-195 1E-197 1475.1 0.0 608 31-644 1-619 (620)
2 lcl|consensus 91.8 0.00051 7.4E-07 37.5 0.0 20 668-687 1-20 (21)
3 ABN51673.1|GH124|2-334|2.6e-21 52.5 0.096 0.00014 40.1 0.0 66 663-728 19-85 (318)
4 CAR68154.1|GH88|62-388|4.9e-13 10.5 2 0.003 30.7 0.0 43 421-463 181-223 (329)
5 ACY49347.1|GH105|46-385|1.1e-1 6.4 4 0.0058 28.2 0.0 60 324-383 169-228 (329)
6 QGI59602.1|GH16_22|78-291 5.4 4.9 0.0072 27.6 0.0 10 391-400 33-42 (224)
7 QGI59602.1|GH16_22|78-291 5.3 5 0.0073 27.5 0.0 18 581-598 204-221 (224)
8 AQA16748.1|GH5_51.hmm|7.4e-189 4.9 5.5 0.0081 28.6 0.0 37 644-680 253-291 (351)
9 CCF60459.1|GH5_12.hmm|1.2e-238 3.3 9.1 0.013 28.3 0.0 27 357-383 298-324 (541)
10 ACI55886.1|GH25|58-236|2.7e-60 3.0 10 0.015 22.2 0.0 41 594-634 18-61 (174)
No 1
>AAA91086.1|GH48|150-238|4.7e-10
Probab=100.00 E-value=6.7e-195 Score=1475.15 Aligned_cols=608 Identities=60% Similarity=1.105 Sum_probs=60
4.0 Template_Neff=2.700
No 2
>lcl|consensus
Probab=91.79 E-value=0.00051 Score=37.45 Aligned_cols=20 Identities=55% Similarity=0.811 Sum_probs=11.9 T
emplate_Neff=4.300
No 3
>ABN51673.1|GH124|2-334|2.6e-219
Probab=52.47 E-value=0.096 Score=40.07 Aligned_cols=66 Identities=35% Similarity=0.533 Sum_probs=48.5 Tem
plate_Neff=1.400
Q tr|G8M3C3|G8M3 663 DIKLGDINFDGDINSIDYALLKAHLLGINKLSGDAL-KAADVDQNGDVNSIDYAKMKSYLLGISKDF 728 (728)
Q Consensus 663 diklgdinfdgdinsidyallkahllginklsgdal-kaadvdqngdvnsidyakmksyllgiskdf 728 (728)
.+..||.|-||-+|--||.|+|..|.-|.+...+.- -..+++....++.+|-.-+|.|||.+-++|
T Consensus 19 kav~GD~n~dgvv~isd~vl~k~~l~~~a~~~a~~d~w~g~vN~dd~I~D~d~~~~kryll~mir~~ 85 (318)
T ABN51673.1|GH1 19 KAVIGDVNADGVVNISDYVLMKRILRIIADFPADDDMWVGDVNGDDVINDIDCNYLKRYLLHMIREF 85 (318)
Confidence 567899999999999999999997766666543321 123444445577888888999999876553
No 4
>CAR68154.1|GH88|62-388|4.9e-137
Probab=10.51 E-value=2 Score=30.74 Aligned_cols=43 Identities=19% Similarity=0.234 Sum_probs=34.3 Templat
e_Neff=5.400
No 5
>ACY49347.1|GH105|46-385|1.1e-131
Probab=6.37 E-value=4 Score=28.22 Aligned_cols=60 Identities=25% Similarity=0.330 Sum_probs=50.1 Template
_Neff=6.200
No 6
>QGI59602.1|GH16_22|78-291
Probab=5.38 E-value=4.9 Score=27.58 Aligned_cols=10 Identities=30% Similarity=0.245 Sum_probs=6.0 Templat
e_Neff=3.000
No 7
>QGI59602.1|GH16_22|78-291
Probab=5.33 E-value=5 Score=27.55 Aligned_cols=18 Identities=28% Similarity=0.730 Sum_probs=9.8 Template_
Neff=3.000
No 8
>AQA16748.1|GH5_51.hmm|7.4e-189|58-409
Probab=4.89 E-value=5.5 Score=28.57 Aligned_cols=37 Identities=32% Similarity=0.524 Sum_probs=28.0 Templa
te_Neff=2.200
No 9
>CCF60459.1|GH5_12.hmm|1.2e-238|14-567
Probab=3.29 E-value=9.1 Score=28.28 Aligned_cols=27 Identities=26% Similarity=0.540 Sum_probs=17.1 Templa
te_Neff=4.100
No 10
>ACI55886.1|GH25|58-236|2.7e-60
Probab=3.03 E-value=10 Score=22.20 Aligned_cols=41 Identities=20% Similarity=0.114 Sum_probs=30.1 Templat
e_Neff=7.700
- 12:48:16.115 INFO: 4 sequences belonging to 4 database HMMs found with an E-value < 0.001
- 12:48:16.115 INFO: Number of effective sequences of resulting query HMM: Neff = 2.41642
We can see that the exact protein was found in dbCAN in hit 1, but also some highly related proteins were found in hits
1-5. This query.a3m MSA can then be useful if we want to search a larger database like UniProt or Uniclust because it
includes this more diverse selection of related protein sequences.
hhfilter: Filter an MSA by max sequence identity, coverage, and other criteria
hhmakemodel.py: Generates coarse 3D models from HHsearch or HHblits results and modifies cif files such that they
are compatible with MODELLER
hhsuitedb.py: Build HHsuite database with prefiltering, packed MSA/HMM, and index files
renumberpdb.pl: Generate PDB file with indices renumbered to match input sequence indices
HHPaths.pm: Configuration file with paths to the PDB, BLAST, PSIPRED etc. mergeali.pl: Merge MSAs in A3M format
according to an MSA of their seed sequences
pdb2fasta.pl: Generate FASTA sequence file from SEQRES records of globbed pdb files
cif2fasta.py: Generate a FASTA sequence from the pdbx_seq_one_letter_code entry of the entity_poly of globbed cif files
References:
[1] Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S J, and Söding J (2019) HH-suite3 for fast remote
homology detection and deep protein annotation, BMC Bioinformatics, 473. doi: 10.1186/s12859-019-3019-7
[2] Kunzmann, P., Mayer, B.E. & Hamacher, K. Substitution matrix based color schemes for sequence alignment
visualization. BMC Bioinformatics 21, 209 (2020). https://doi.org/10.1186/s12859-020-3526-6
[3]Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.
https://www.researchgate.net/publication/12812078_
[4]https://github.com/soedinglab/hh-suite/wiki#what-are-hmm-hmm-comparisons-and-why-are-they-so-powerful
ANNDATA was presented alongside ScanPy as a generic class for handling annotated data matrices that can deal with
the sparsity inherent in gene expression data.
This tutorial is largely adapted from the original tutorials which can be found in Scanpy's read the docs and from this
notebook.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Note: you may need to restart the kernel to use updated packages.
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
results_file = 'write/pbmc3k.h5ad' # the file that will store the analysis results
adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True) # write a cache file for faster subsequent reading
Pre-processing
Check for highly expressed genes
Show genes that yield the highest fraction of counts in each single cell, across all cells. The
sc.pl.highest_expr_genes command normalizes counts per cell, and plots the genes that are most abundant in each
cell.
sc.pl.highest_expr_genes(adata, n_top=20, )
Note that MALAT1, a non-coding RNA that is known to be extremely abundant in many cells, ranks at the top.
Basic filtering: remove cells and genes with low expression or missing
values.
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells
Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017):
High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of
cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript
molecules and less likely to escape through tears in the cell membrane.
print(adata)
# slice the adata object so you only keep genes and cells that pass the QC
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
print(adata)
sc.pp.normalize_total(adata, target_sum=1e4)
Log transform the data for later use in differential gene expression as well as in visualizations. The natural logarithm is
used, and log1p means that an extra read is added to cells of the count matrix as a pseudo-read. See here for more
information on why log scale makes more sense for genomic data.
sc.pp.log1p(adata)
Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in
differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
adata.raw = adata
Filter the adata object so that only genes that are highly variable are kept.
Correct for the effects of counts per cell and mitochondrial gene
expression
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. This can consume
some memory and take some time because the input data is sparse.
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:00)
We can make a scatter plot in the PCA coordinates, but we will not use that later on.
sc.pl.pca(adata, color="CST3")
The variance ratio plot lists contributions of individual principal components (PC) to the total variance in the data. This
piece of information helps us to choose an appropriate number of PCs in order to compute the neighborhood
relationships between the cells, for instance, using the clustering method Louvain sc.tl.louvain() or the embedding
method tSNE sc.tl.tsne() for dimension-reduction.
According to the authors of Scanpy, a rough estimate of the number of PCs does fine.
sc.pl.pca_variance_ratio(adata, log=True)
! mkdir -p write
adata.write(results_file)
Note that our adata object has following elements: observations annotation (obs), variables (var), unstructured
annotation (uns), multi-dimensional observations annotation (obsm), and multi-dimensional variables annotation (varm).
The meanings of these parameters are documented in the anndata package, available at anndata documentation.
adata
The auhours of Scanpy suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018). UMAP is
potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories.
sc.tl.umap(adata)
As we set the .raw attribute of adata, the previous plots showed the “raw” (normalized, logarithmized, but
uncorrected) gene expression. You can also plot the scaled and corrected gene expression by explicitly stating that you
don’t want to use .raw .
In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be
remedied by running:
tl.paga(adata)
pl.paga(adata, plot=False) # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')
sc.tl.leiden(adata)
Plot the clusters using sc.pl.umap . Note that the color parameter accepts both individual genes and the clustering
method (leiden in this case).
adata.write(results_file)
As an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al.
(2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are
uni-variate. Clark et al. (2014) has more details.
adata = sc.read(results_file)
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names', 'pvals']}).head(5)
If you want to compare a certain gene across groups, use the following.
new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
Now that we annotated the cell types, let us visualize the marker genes.
During the course of this analysis, the AnnData accumulated the following annotations.
adata
adata.write(results_file, compression='gzip') # `compression='gzip'` saves disk space, but slows down writing and su
Get a rough overview of the file using h5ls, which has many options - for more details see here. The file format might
still be subject to further optimization in the future. All reading functions will remain backwards-compatible, though.
If you want to share this file with people who merely want to use it for visualization, a simple way to reduce the file size
is by removing the dense scaled and corrected data matrix. The file still contains the raw data used in the visualizations
in adata.raw .
adata.raw.to_adata().write('./write/pbmc3k_withoutX.h5ad')
scvi-tools (single-cell variational inference tools) is a package for probabilistic modeling and analysis of single-cell omics
data, built on top of PyTorch and AnnData that aims to address some of the limitations that arise when developing and
implementing probabilistic models. scvi-tools is used in tandem with Scanpy for which Deepchem also offers a tutorial.
In the broader analysis pipeline, scVI sits downstream of initial quality control (QC)-driven preprocessing and generates
outputs that may be further interpreted via general single-cell analysis tools.
In this introductory tutorial, we go through the different steps of an scvi-tools workflow. While we focus on scVI in this
tutorial, the API is consistent across all models. Please note that this tutorial was largely adapted from the one provided
by scvi-tools and you can head to their page to find more information.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
import scvi
import scanpy as sc
import matplotlib.pyplot as plt
Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., ... & Teichmann, S. A.
(2020). Cells of the adult human heart. Nature, 588(7838), 466-472.
Important
All scvi-tools models require AnnData objects as input.
adata = scvi.data.heart_cell_atlas_subsampled()
Now we preprocess the data to remove, for example, genes that are very lowly expressed and other outliers. For these
tasks we prefer the Scanpy preprocessing module.
sc.pp.filter_genes(adata, min_counts=3)
In scRNA-seq analysis, it's popular to normalize the data. These values are not used by scvi-tools, but given their
popularity in other tasks as well as for visualization, we store them in the anndata object separately (via the .raw
attribute).
Important
Unless otherwise specified, scvi-tools models require the raw counts (not log library size normalized).
Finally, we perform feature selection, to reduce the number of features (genes in this case) used as input to the scvi-
tools model. For best practices of how/when to perform feature selection, please refer to the model-specific tutorial. For
scVI, we recommend anywhere from 1,000 to 10,000 HVGs, but it will be context-dependent.
sc.pp.highly_variable_genes(
adata,
n_top_genes=1200,
subset=True,
layer="counts",
flavor="seurat_v3",
batch_key="cell_source"
)
Now it's time to run setup_anndata() , which alerts scvi-tools to the locations of various matrices inside the anndata.
It's important to run this function with the correct arguments so scvi-tools is notified that your dataset has batches,
annotations, etc. For example, if batches are registered with scvi-tools, the subsequent model will correct for batch
effects. See the full documentation for details.
In this dataset, there is a "cell_source" categorical covariate, and within each "cell_source", multiple "donors", "gender"
and "age_group". There are also two continuous covariates we'd like to correct for: "percent_mito" and "percent_ribo".
These covariates can be registered using the categorical_covariate_keys argument. If you only have one
categorical covariate, you can also use the batch_key argument instead.
scvi.model.SCVI.setup_anndata(
adata,
layer="counts",
categorical_covariate_keys=["cell_source", "donor"],
continuous_covariate_keys=["percent_mito", "percent_ribo"]
)
Warning
If the adata is modified after running setup_anndata , please run setup_anndata again, before creating an
instance of a model.
model = scvi.model.SCVI(adata)
model
Important
All scvi-tools models run faster when using a GPU. By default, scvi-tools will use a GPU if one is found to be available.
Please see the installation page for more information about installing scvi-tools when a GPU is available.
model.train()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 400/400: 100%|██████████| 400/400 [05:43<00:00, 1.16it/s, loss=284, v_num=1]
# model.save("my_model/")
It's often useful to store the outputs of scvi-tools back into the original anndata, as it permits interoperability with
Scanpy.
adata.obsm["X_scVI"] = latent
The model.get...() functions default to using the anndata that was used to initialize the model. It's possible to also
query a subset of the anndata, or even use a completely independent anndata object as long as the anndata is
organized in an equivalent fashion.
adata.layers["scvi_normalized"] = model.get_normalized_expression(
library_size=10e4
)
Warning
We use UMAP to qualitatively assess our low-dimension embeddings of cells. We do not advise using UMAP or any
similar approach quantitatively. We do recommend using the embeddings produced by scVI as a plug-in replacement
of what you would get from PCA, as we show below.
First, we demonstrate the presence of nuisance variation with respect to nuclei/whole cell, age group, and donor by
plotting the UMAP results of the top 30 PCA components for the raw count data.
sc.pl.umap(
adata,
color=["cell_type"],
frameon=False,
)
sc.pl.umap(
adata,
color=["donor", "cell_source"],
ncols=2,
frameon=False,
)
We see that while the cell types are generally well separated, nuisance variation plays a large part in the variation of
the data.
sc.pl.umap(
adata,
color=["cell_type"],
frameon=False,
)
sc.pl.umap(
adata,
color=["donor", "cell_source"],
ncols=2,
frameon=False,
)
We can see that scVI was able to correct for nuisance variation due to nuclei/whole cell, age group, and donor, while
maintaining separation of cell types.
sc.pl.umap(
adata,
color=["leiden_scVI"],
frameon=False,
)
Differential expression
We can also use many scvi-tools models for differential expression. For further details on the methods underlying these
functions as well as additional options, please see the API docs.
adata.obs.cell_type.head()
AACTCCCCACGAGAGT-1-HCAHeart7844001 Myeloid
ATAACGCAGAGCTGGT-1-HCAHeart7829979 Ventricular_Cardiomyocyte
GTCAAGTCATGCCACG-1-HCAHeart7702879 Fibroblast
GGTGATTCAAATGAGT-1-HCAHeart8102858 Endothelial
AGAGAATTCTTAGCAG-1-HCAHeart8102863 Endothelial
Name: cell_type, dtype: category
Categories (11, object): ['Adipocytes', 'Atrial_Cardiomyocyte', 'Endothelial', 'Fibroblast', ..., 'Neuronal', '
Pericytes', 'Smooth_muscle_cells', 'Ventricular_Cardiomyocyte']
de_df = model.differential_expression(
groupby="cell_type",
group1="Endothelial",
group2="Fibroblast"
)
de_df.head()
SOX17 0.9998 0.0002 8.516943 0.001615 0.000029 0.0 0.25 6.222365 6.216846 1.9675
SLC9A3R2 0.9996 0.0004 7.823621 0.010660 0.000171 0.0 0.25 5.977907 6.049340 1.6721
ABCA10 0.9990 0.0010 6.906745 0.000081 0.006355 0.0 0.25 -8.468659 -9.058912 2.9593
EGFL7 0.9986 0.0014 6.569875 0.008471 0.000392 0.0 0.25 4.751251 4.730982 1.5463
VWF 0.9984 0.0016 6.436144 0.014278 0.000553 0.0 0.25 5.013347 5.029471 1.7587
5 rows × 22 columns
We can also do a 1-vs-all DE test, which compares each cell type with the rest of the dataset:
de_df = model.differential_expression(
groupby="cell_type",
)
de_df.head()
CIDEC 0.9988 0.0012 6.724225 0.002336 0.000031 0.0 0.25 7.082959 7.075700 2.681833
ADIPOQ 0.9988 0.0012 6.724225 0.003627 0.000052 0.0 0.25 7.722131 7.461277 3.332577
GPAM 0.9986 0.0014 6.569875 0.025417 0.000202 0.0 0.25 7.365266 7.381156 2.562121
PLIN1 0.9984 0.0016 6.436144 0.004482 0.000048 0.0 0.25 7.818194 7.579515 2.977385
GPD1 0.9974 0.0026 5.949637 0.002172 0.000044 0.0 0.25 6.543847 6.023436 2.865962
5 rows × 22 columns
We now extract top markers for each cluster using the DE results.
markers = {}
cats = adata.obs.cell_type.cat.categories
for i, c in enumerate(cats):
cid = "{} vs Rest".format(c)
cell_type_df = de_df.loc[de_df.comparison == cid]
markers[c] = cell_type_df.index.tolist()[:3]
sc.tl.dendrogram(adata, groupby="cell_type", use_rep="X_scVI")
sc.pl.dotplot(
adata,
markers,
groupby='cell_type',
dendrogram=True,
color_map="Blues",
swap_axes=True,
use_raw=True,
standard_scale="var",
)
We can also visualize the scVI normalized gene expression values with the layer option.
sc.pl.heatmap(
adata,
markers,
groupby='cell_type',
layer="scvi_normalized",
standard_scale="var",
dendrogram=True,
figsize=(8, 12)
)
Logging information
Verbosity varies in the following way:
This function's behaviour can be customized, please refer to its documentation for information about the different
parameters available.
In general, you can use scvi.settings.verbosity to set the verbosity of the scvi package. Note that verbosity
corresponds to the logging levels of the standard python logging module. By default, that verbosity level is set to
INFO (=20). As a reminder the logging levels are:
ERROR 40
WARNING 30
INFO 20
DEBUG 10
NOTSET 0
Reference
If you use scvi-tools in your research, please consider citing
@article{Gayoso2022,
author={Gayoso, Adam and Lopez, Romain and Xing, Galen and Boyeau, Pierre and Valiollah Pour Amiri, Valeh
title={A Python library for probabilistic analysis of single-cell omics data},
journal={Nature Biotechnology},
year={2022},
month={Feb},
day={07},
issn={1546-1696},
doi={10.1038/s41587-021-01206-w},
url={https://doi.org/10.1038/s41587-021-01206-w}
}
@manual{Bioinformatics,
title={Deep Probabilistic Analysis of Single-Cell Omics Data},
organization={DeepChem},
author={Paiz, Paulina},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Deep_probabilistic_analysis
year={2022},
}
Cell Counting
Cell counting is a fundamental task found in many biological research and medical diagnostic processes. It underlies
decisions in cell culture, drug development, and disease analysis. However, traditional manual cell counting methods
are often time-consuming and prone to human error. This variability can hinder research progress and lead to
inconsistencies across studies.
Although cell counting machines exist, they are expensive and may not be readily available to all researchers.
Automating cell counting using machine learning offers a powerful solution to this problem. ML-powered cell counters
can quickly and accurately analyze large volumes of cell samples, freeing up researchers' time and minimizing
inconsistencies.
Ready to build your own cell counter and revolutionize your research efficiency? This tutorial equips you with the
knowledge and skills to create a customized tool that streamlines your cell counting needs.
Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
Setup
To run DeepChem within Colab, you'll need to run the following installation commands. You can of course run this
tutorial locally if you prefer. In that case, don't run these cells since they will download and install DeepChem in your
local machine again.
import deepchem as dc
dc.__version__
import numpy as np
import matplotlib.pyplot as plt
BBBC Datasets
We used the image set BBBC002v1 [Carpenter et al., Genome Biology, 2006] from the Broad Bioimage Benchmark
Collection [Ljosa et al., Nature Methods, 2012] for this tutorial.
The Broad Bioimage Benchmark Collection Dataset 002 (BBBC002) contains images of Drosophila Kc167 cells. The
ground truth labels consist of cell counts. Full details about this dataset are present at
https://bbbc.broadinstitute.org/BBBC002.
For counting cells, our dataset needs to have images as inputs and the corresponding cell counts as the ground truth
labels. We have several BBBC datasets that can be loaded using the deepchem package. These datasets are an
extension to MoleculeNet and can be accessed through dc.molnet .
The BBBC002 dataset consists of 60 images, each 512x512 pixels in size, which are split into train, validation and test
sets in a 80/10/10 split by default.
We also use splitter='random' in order to ensure that these images are randomly split into the train, validation and
test sets in the above mention ratios.
bbbc2_dataset = dc.molnet.load_bbbc002(splitter='random')
tasks, dataset, transforms = bbbc2_dataset
train, val, test = dataset
Now that we've loaded the dataset and randomly split it, let's take a look at the data.
We can confirm that a sample from our dataset is in the form of a 512x512 image. Let's visualize this sample:
i = 2
plt.figure(figsize=(5, 5))
plt.imshow(train_x[i])
plt.title(f"Cell Count: {train_y[i]}")
plt.show()
PyTorch based CNN Models require that images be in the shape of (C, H, W), wherein 'C' is the number of input
channels, 'H' is the height of the image and 'W' is the width of the image. So we will reshape the data.
For more information on how to use callbacks, refer to this tutorial on Advanced Model Training
We will use the CNN model from the deepchem package. Since cell counting is a relational problem, we will use the
regression mode.
We will use a 2D CNN model, with 6 hidden layers of the following sizes [32, 64, 128, 128, 64, 32] and a kernel size of 3
across all the filters, you can modify both the kernel size and the number of filters per layer. We have also used average
pooling made residual connections and added dropout layers between subsequent layers in order to improve
performance. Feel free to experiment with various models.
regression_metric = dc.metrics.Metric(dc.metrics.rms_score)
model = CNN(n_tasks=1, n_features=1, dims=2, layer_filters = [32, 64, 128, 128, 64, 32], kernel_size=3, learning_rate
mode='regression', padding='same', batch_size=4, residual=True, dropouts=0.1, pool_type='average')
We can see that the model performs fairly well with a test loss of about 14.6. This means that on average, the predicted
number of cells for a sample image is off by 14.6 cells when compared to the ground truth. Although this seems like a
very high value for test loss, we will see that a difference of about 15 cells is actually not bad for this particular task.
test_metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)
plt.figure(figsize=(4, 4))
plt.title("True vs. Predicted")
plt.plot(test_y, color='red', label='true')
plt.plot(preds, color='blue', label='preds')
plt.legend()
plt.show()
Train loss: 19.05
Val Loss: 22.2
Test Loss: 14.6
Let us print out the mean cell count of our predictions and compare them with the ground truth. We will also print out
the maximum difference between the ground truth and the prediction from the test set.
diff = []
for i in range(len(test_y)):
diff.append(abs(test_y[i] - preds[i]))
We can observe that the averages of our predictions and the ground truth are very close with a difference of just 0.20.
Although we see a maximum difference of 31 cells between the prediction and true value, when we take into account
the Test Loss , the close proximity of the means of predictions and the true labels, and the small size of our test set,
we can say that our model performs fairly well.
@manual{Bioinformatics,
title={Cell Counting Tutorial},
organization={DeepChem},
author={Menezes, Aaron},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Cell_Counting_Tutorial.ipyn
year={2024},
}
Introduction To Material Science
Table of Contents:
Introduction
Setup
Featurizers
Crystal Featurizers
Compound Featurizers
Datasets
Predicting structural properties of a crystal
Further Reading
Introduction
One of the most exciting applications of machine learning in the recent time is it's application to material science
domain. DeepChem helps in development and application of machine learning to solid-state systems. As a starting point
of applying machine learning to material science domain, DeepChem provides material science datasets as part of the
MoleculeNet suite of datasets, data featurizers and implementation of popular machine learning algorithms specific to
material science domain. This tutorial serves as an introduction of using DeepChem for machine learning related tasks
in material science domain.
Traditionally, experimental research were used to find and characterize new materials. But traditional methods have
high limitations by constraints of required resources and equipments. Material science is one of the booming areas
where machine learning is making new in-roads. The discovery of new material properties holds key to lot of problems
like climate change, development of new semi-conducting materials etc. DeepChem acts as a toolbox for using machine
learning in material science.
This tutorial can also be used in Google colab. If you'd like to open this notebook in colab, you can use the following link.
This notebook is made to run without any GPU support.
Open in Colab
DeepChem for material science will also require the additiona libraries pymatgen and matminer . These two libraries
assist machine learning in material science. For graph neural network models which we will be used in the backend,
DeepChem requires dgl library. All these can be installed using pip . Note when using locally, install a higher version
of the jupyter notebook (>6.5.5, here on colab).
import deepchem as dc
dc.__version__
import pymatgen as mg
from pymatgen import core as core
import os
os.environ['DEEPCHEM_DATA_DIR'] = os.getcwd()
Featurizers
Material Structure Featurizers
Crystal are geometric structures which has to be featurized for using in machine learning algorithms. The following
featurizers provided by DeepChem helps in featurizing crystals:
The SineCoulombMatrix featurizer a crystal by calculating sine coulomb matrix for the crystals. It can be called using
dc.featurizers.SineCoulombMatrix function. [1]
The CGCNNFeaturizer calculates structure graph features of crystals. It can be called using
dc.featurizers.CGCNNFeaturizer function. [2]
The LCNNFeaturizer calculates the 2-D Surface graph features in 6 different permutations. It can be used using the
utility dc.feat.LCNNFeaturizer . [3]
[1] Faber et al. “Crystal Structure Representations for Machine Learning Models of Formation Energies”, Inter. J.
Quantum Chem. 115, 16, 2015. https://arxiv.org/abs/1503.07406
[2] T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction
of material properties”, Phys. Rev. Lett. 120, 2018, https://arxiv.org/abs/1710.10324
[3] Jonathan Lym, Geun Ho Gu, Yousung Jung, and Dionisios G. Vlachos, Lattice Convolutional Neural Network Modeling
of Adsorbate Coverage Effects, J. Phys. Chem. C 2019 https://pubs.acs.org/doi/10.1021/acs.jpcc.9b03370
The CsCl crystal is a cubic lattice with the chloride atoms lying upon the lattice points at the edges of the cube, while
the caesium atoms lie in the holes in the center of the cubes. The green colored atoms are the caesium atoms in this
crystal structure and chloride atoms are the grey ones.
Source: Wikipedia
# Atoms in a crystal
atomic_species = ["Cs", "Cl"]
# Coordinates of atoms in a crystal
cs_coords = [0, 0, 0]
cl_coords = [0.5, 0.5, 0.5]
structure = mg.core.Structure(lattice, atomic_species, [cs_coords, cl_coords])
structure
Structure Summary
Lattice
abc : 4.2 4.2 4.2
angles : 90.0 90.0 90.0
volume : 74.08800000000001
A : 4.2 0.0 0.0
B : 0.0 4.2 0.0
C : 0.0 0.0 4.2
pbc : True True True
PeriodicSite: Cs (0.0, 0.0, 0.0) [0.0, 0.0, 0.0]
PeriodicSite: Cl (2.1, 2.1, 2.1) [0.5, 0.5, 0.5]
In above code sample, we first defined a cubic lattice using the cubic lattice parameter a . Then, we created a structure
with atoms in the crystal and their coordinates as features. A nice introduction to crystallographic coordinates can be
found here. Once a structure is defined, it can be featurized using CGCNN Featurizer. Featurization of a crystal using
CGCNNFeaturizer returns a DeepChem GraphData object which can be used for machine learning tasks.
featurizer = dc.feat.CGCNNFeaturizer()
features = featurizer.featurize([structure])
features[0]
The ElementPropertyFingerprint can be used to find fingerprint of elements based on elemental stoichiometry. It can
be used using a call to dc.featurizers.ElementPropertyFingerprint . [4]
The ElemNetFeaturizer returns a vector containing fractional compositions of each element in the compound. It can
be used using a call to dc.feat.ElemNetFeaturizer . [5]
[4] Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties
of inorganic materials. npj Comput Mater 2, 16028 (2016). https://doi.org/10.1038/npjcompumats.2016.28
[5] Jha, D., Ward, L., Paul, A. et al. "ElemNet: Deep Learning the Chemistry of Materials From Only Elemental
Composition", Sci Rep 8, 17593 (2018). https://doi.org/10.1038/s41598-018-35934-y
comp = core.Composition("Fe2O3")
featurizer = dc.feat.ElementPropertyFingerprint()
features = featurizer.featurize([comp])
features[0]
Datasets
DeepChem has the following material properties dataset as part of MoleculeNet suite of datasets. These datasets can be
used for a variety of tasks in material science like predicting structure formation energy, metallicity of a compound etc.
The Band Gap dataset contains 4604 experimentally measured band gaps for inorganic crystal structure
compositions. The dataset can be loaded using dc.molnet.load_bandgap utility.
The Perovskite dataset contains 18928 perovskite structures and their formation energies. It can be loaded using a
call to dc.molnet.load_perovskite .
The Formation Energy dataset contains 132752 calculated formation energies and inorganic crystal structures from
the Materials Project database. It can be loaded using a call to dc.molnet.load_mp_formation_energy .
The Metallicity dataset contains 106113 inorganic crystal structures from the Materials Project database labeled as
metals or nonmetals. It can be loaded using dc.molnet.load_mp_metallicity utility.
In the below example, we will demonstrate loading perovskite dataset and use it to predict formation energy of new
crystals. Perovskite structures are structures adopted by many oxides. Ideally it is a cubic structure but non-cubic
variants also exists. Each datapoint in the perovskite dataset contains the lattice structure as a
pymatgen.core.Structure object and the formation energy of the corresponding structure. It can be used by calling
for machine learning tasks by calling dc.molnet.load_perovskite utility. The utility takes care of loading, featurizing
and splitting the dataset for machine learning tasks.
train_dataset.get_data_shape
deepchem.data.datasets.DiskDataset.get_data_shape
def get_data_shape() -> Shape
losses = []
plt.plot(losses)
Once fitting the model, we evaluate the performance of the model using mean squared error metric since it is a
regression task. For selection a metric, dc.metrics.mean_squared_error function can be used and we evaluate the
model by calling dc.model.evaluate .`
metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))
Further Reading
For further reading on getting started on using machine learning for material science, here are two great resources:
Colab
This tutorial and the rest in this sequence can be done in Google Colab (although the visualization at the end doesn't
work correctly on Colab, so you might prefer to run this tutorial locally). If you'd like to open this notebook in colab, you
can use the following link.
Open in Colab
WARNING:tensorflow:From c:\Users\HP\anaconda3\envs\deep\lib\site-packages\tensorflow\python\util\deprecation.py:
588: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental
_relax_shapes is deprecated and will be removed in a future version.
Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'dgl'
Skipped loading modules with transformers dependency. No module named 'transformers'
cannot import name 'HuggingFaceModel' from 'deepchem.models.torch_models' (c:\users\hp\deepchem_2\deepchem\model
s\torch_models\__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'
'2.8.1.dev'
Reinforcement Learning
Reinforcement learning involves an agent that interacts with an environment. In this case, the environment is the video
game and the agent is the player. By trial and error, the agent learns a policy that it follows to perform some task
(winning the game). As it plays, it receives rewards that give it feedback on how well it is doing. In this case, it receives
a positive reward every time it scores a point and a negative reward every time the other player scores a point.
The first step is to create an Environment that implements this task. Fortunately, OpenAI Gym already provides an
implementation of Pong (and many other tasks appropriate for reinforcement learning). DeepChem's GymEnvironment
class provides an easy way to use environments from OpenAI Gym. We could just use it directly, but in this case we
subclass it and preprocess the screen image a little bit to make learning easier.
import deepchem as dc
import numpy as np
class PongEnv(dc.rl.GymEnvironment):
def __init__(self):
super(PongEnv, self).__init__('Pong-v4')
self._state_shape = (80, 80)
@property
def state(self):
# Crop everything outside the play area, reduce the image size,
# and convert it to black and white.
state_array = self._state
cropped = state_array[34:194, :, :]
reduced = cropped[0:-1:2, 0:-1:2]
grayscale = np.sum(reduced, axis=2)
bw = np.zeros(grayscale.shape, dtype=np.float32)
bw[grayscale != 233] = 1
return bw
env = PongEnv()
Next we create a model to implement our policy. This model receives the current state of the environment (the pixels
being displayed on the screen at this moment) as its input. Given that input, it decides what action to perform. In Pong
there are three possible actions at any moment: move the paddle up, move it down, or leave it where it is. The policy
model produces a probability distribution over these actions. It also produces a value output, which is interpreted as an
estimate of how good the current state is. This turns out to be important for efficient learning.
The model begins with two convolutional layers to process the image. That is followed by a dense (fully connected) layer
to provide plenty of capacity for game logic. We also add a small Gated Recurrent Unit (GRU). That gives the network a
little bit of memory, so it can keep track of which way the ball is moving. Just from the screen image, you cannot tell
whether the ball is moving to the left or to the right, so having memory is important.
We concatenate the dense and GRU outputs together, and use them as inputs to two final layers that serve as the
network's outputs. One computes the action probabilities, and the other computes an estimate of the state value
function.
We also provide an input for the initial state of the GRU, and return its final state at the end. This is required by the
learning algorithm.
import torch
import torch.nn as nn
import torch.nn.functional as F
class PongPolicy(dc.rl.Policy):
def __init__(self):
super(PongPolicy, self).__init__(['action_prob', 'value', 'rnn_state'], [np.zeros(16, dtype=np.float32)])
We will optimize the policy using the Advantage Actor Critic (A2C) algorithm. There are lots of hyperparameters we
could specify at this point, but the default values for most of them work well on this problem. The only one we need to
customize is the learning rate.
import torch.nn.functional as F
from deepchem.rl.torch_rl.torch_a2c import A2C
Optimize for as long as you have patience to. By 1 million steps you should see clear signs of learning. Around 3 million
steps it should start to occasionally beat the game's built in AI. By 7 million steps it should be winning almost every
time. Running on my laptop, training takes about 20 minutes for every million steps.
DeepChem makes it very easy to estimate the uncertainty of predicted outputs (at least for the models that support it—
not all of them do). Let's start by seeing an example of how to generate uncertainty estimates. We load a dataset,
create a model, train it on the training set, predict the output on the test set, and then derive some uncertainty
estimates.
Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in
colab, you can use the following link.
Open in Colab
We'll use the Delaney dataset from the MoleculeNet suite to run our experiments in this tutorial. Let's load up our
dataset for our experiments, and then make some uncertainty predictions.
import deepchem as dc
import numpy as np
import matplotlib.pyplot as plot
All of this looks exactly like any other example, with just two differences. First, we add the option uncertainty=True
when creating the model. This instructs it to add features to the model that are needed for estimating uncertainty.
Second, we call predict_uncertainty() instead of predict() to produce the output. y_pred is the predicted
outputs. y_std is another array of the same shape, where each element is an estimate of the uncertainty (standard
deviation) of the corresponding element in y_pred . And that's all there is to it! Simple, right?
Of course, it isn't really that simple at all. DeepChem is doing a lot of work to come up with those uncertainties. So now
let's pull back the curtain and see what is really happening. (For the full mathematical details of calculating uncertainty,
see https://arxiv.org/abs/1703.04977)
To begin with, what does "uncertainty" mean? Intuitively, it is a measure of how much we can trust the predictions.
More formally, we expect that the true value of whatever we are trying to predict should usually be within a few
standard deviations of the predicted value. But uncertainty comes from many sources, ranging from noisy training data
to bad modelling choices, and different sources behave in different ways. It turns out there are two fundamental types
of uncertainty we need to take into account.
Aleatoric Uncertainty
Consider the following graph. It shows the best fit linear regression to a set of ten data points.
How can we estimate the size of this uncertainty? By training a model to do it, of course! At the same time it is learning
to predict the outputs, it is also learning to predict how accurately each output matches the training data. For every
output of the model, we add a second output that produces the corresponding uncertainty. Then we modify the loss
function to make it learn both outputs at the same time.
Epistemic Uncertainty
Now consider these three curves. They are fit to the same data points as before, but this time we are using 10th degree
polynomials.
plot.figure(figsize=(12, 3))
line_x = np.linspace(0, 5, 50)
for i in range(3):
plot.subplot(1, 3, i+1)
plot.scatter(x, y)
fit = np.polyfit(np.concatenate([x, [3]]), np.concatenate([y, [i]]), 10)
plot.plot(line_x, np.poly1d(fit)(line_x))
plot.show()
Each of them perfectly interpolates the data points, yet they clearly are different models. (In fact, there are infinitely
many 10th degree polynomials that exactly interpolate any ten data points.) They make identical predictions for the
data we fit them to, but for any other value of x they produce different predictions. This is called epistemic uncertainty.
It means the data does not fully constrain the model. Given the training data, there are many different models we could
have found, and those models make different predictions.
The ideal way to measure epistemic uncertainty is to train many different models, each time using a different random
seed and possibly varying hyperparameters. Then use all of them for each input and see how much the predictions vary.
This is very expensive to do, since it involves repeating the whole training process many times. Fortunately, we can
approximate the same effect in a less expensive way: by using dropout.
Recall that when you train a model with dropout, you are effectively training a huge ensemble of different models all at
once. Each training sample is evaluated with a different dropout mask, corresponding to a different random subset of
the connections in the full model. Usually we only perform dropout during training and use a single averaged mask for
prediction. But instead, let's use dropout for prediction too. We can compute the output for lots of different dropout
masks, then see how much the predictions vary. This turns out to give a reasonable estimate of the epistemic
uncertainty in the outputs.
Uncertain Uncertainty?
Now we can combine the two types of uncertainty to compute an overall estimate of the error in each output:
This is the value DeepChem reports. But how much can you trust it? Remember how I started this tutorial: deep learning
models should not be used as black boxes. We want to know how reliable the outputs are. Adding uncertainty estimates
does not completely eliminate the problem; it just adds a layer of indirection. Now we have estimates of how reliable the
outputs are, but no guarantees that those estimates are themselves reliable.
Let's go back to the example we started with. We trained a model on the SAMPL training set, then generated predictions
and uncertainties for the test set. Since we know the correct outputs for all the test samples, we can evaluate how well
we did. Here is a plot of the absolute error in the predicted output versus the predicted uncertainty.
abs_error = np.abs(y_pred.flatten()-test_dataset.y.flatten())
plot.scatter(y_std.flatten(), abs_error)
plot.xlabel('Standard Deviation')
plot.ylabel('Absolute Error')
plot.show()
The first thing we notice is that the axes have similar ranges. The model clearly has learned the overall magnitude of
errors in the predictions. There also is clearly a correlation between the axes. Values with larger uncertainties tend on
average to have larger errors. (Strictly speaking, we expect the absolute error to be less than the predicted uncertainty.
Even a very uncertain number could still happen to be close to the correct value by chance. If the model is working well,
there should be more points below the diagonal than above it.)
Now let's see how well the values satisfy the expected distribution. If the standard deviations are correct, and if the
errors are normally distributed (which is certainly not guaranteed to be true!), we expect 95% of the values to be within
two standard deviations, and 99% to be within three standard deviations. Here is a histogram of errors as measured in
standard deviations.
plot.hist(abs_error/y_std.flatten(), 20)
plot.show()
All the values are in the expected range, and the distribution looks roughly Gaussian although not exactly. Perhaps this
indicates the errors are not normally distributed, but it may also reflect inaccuracies in the uncertainties. This is an
important reminder: the uncertainties are just estimates, not rigorous measurements. Most of them are pretty good, but
you should not put too much confidence in any single value.
Collecting deepchem[jax]
Downloading deepchem-2.6.0.dev20210924223259-py3-none-any.whl (609 kB)
import numpy as np
import functools
try:
import jax
import jax.numpy as jnp
import haiku as hk
import optax
from deepchem.models import PINNModel, JaxModel
from deepchem.data import NumpyDataset
from deepchem.models.optimizers import Adam
from jax import jacrev
has_haiku_and_optax = True
except:
has_haiku_and_optax = False
give_size = 10
in_given = np.linspace(-2 * np.pi, 2 * np.pi, give_size)
out_given = np.cos(in_given) + 0.1*np.random.normal(loc=0.0, scale=1, size=give_size)
plt.figure(figsize=(13, 7))
plt.plot(test, out_array, color = 'blue', alpha = 0.5)
plt.scatter(in_given, out_given, color = 'green', marker = "o")
plt.xlabel("x --> ", fontsize=18)
plt.ylabel("f (x) -->", fontsize=18)
plt.legend(["Actual data" ,"Supervised Data"], prop={'size': 16}, loc ="lower right")
# forward function defines the F which describes the mathematical operations like Matrix & dot products, Signmoid fun
# W is the init_params
def f(x):
net = hk.nets.MLP(output_sizes=[256, 128, 1], activation=jax.nn.softplus)
val = net(x)
return val
dataset_test = NumpyDataset(test)
nn_output = nn_model.predict(dataset_test)
plt.figure(figsize=(13, 7))
plt.plot(test, out_array, color = 'blue', alpha = 0.5)
plt.scatter(in_given, out_given, color = 'green', marker = "o")
plt.plot(test, nn_output, color = 'red', marker = "o", alpha = 0.7)
plt.xlabel("x --> ", fontsize=18)
plt.ylabel("f (x) -->", fontsize=18)
plt.legend(["Actual data", "Vanilla NN", "Supervised Data"], prop={'size': 16}, loc ="lower right")
@jax.jit
def eval_model(x, rng=None):
bu = forward_fn(params, rng, x)
return jnp.squeeze(bu)
return eval_model
@jax.jit
def model_loss(params, target, weights, rng, x_train):
return model_loss
initial_data = {
'X0': jnp.expand_dims(in_given, 1),
'u0': jnp.expand_dims(out_given, 1)
}
opt = Adam(learning_rate=1e-3)
pinn_model= PINNModel(
forward_fn=forward_fn,
params=params,
initial_data=initial_data,
batch_size=1000,
optimizer=opt,
grad_fn=gradient_fn,
eval_fn=create_eval_fn,
deterministic=True,
log_frequency=1000)
# defining our training data. We feed 100 points between [-2.5pi, 2.5pi] without the labels,
# which will be used as the differential loss(regulariser)
X_f = np.expand_dims(np.linspace(-3 * np.pi, 3 * np.pi, 1000), 1)
dataset = NumpyDataset(X_f)
pinn_model.fit(dataset, nb_epochs=3000)
pinn_output = pinn_model.predict(dataset_test)
plt.figure(figsize=(13, 7))
plt.plot(test, out_array, color = 'blue', alpha = 0.5)
plt.scatter(in_given, out_given, color = 'green', marker = "o")
# plt.plot(test, nn_output, color = 'red', marker = "x", alpha = 0.3)
plt.scatter(test, pinn_output, color = 'red', marker = "o", alpha = 0.7)
Open in Colab
Before getting our hands dirty with code , let us first understand little bit about what Neural ODEs are ?
Spot on ! Let's see the formal definition as stated by the original paper :
Neural ODEs are a new family of deep neural network models. Instead of specifying a discrete
sequence of
hidden layers, we parameterize the derivative of the hidden state using a neural network.
The output of the network is computed using a blackbox differential equation solver.These are
continuous-depth models that have constant memory
cost, adapt their evaluation strategy to each input, and can explicitly trade numerical
precision for speed.
In simple words perceive NeuralODEs as yet another type of layer like Linear, Conv2D, MHA...
In this tutorial we will be using torchdiffeq. This library provides ordinary differential equation (ODE) solvers
implemented in PyTorch framework. The library provides a clean API of ODE solvers for usage in deep learning
applications. As the solvers are implemented in PyTorch, algorithms in this repository are fully supported to run on the
GPU.
Installing Libraries
Import Libraries
import torch
import torch.nn as nn
import deepchem as dc
import matplotlib.pyplot as plt
Before diving into the core of this tutorial , let's first acquaint ourselves with usage of torchdiffeq. Let's solve following
differential equation .
when
def f(t,z):
return t
z0 = torch.Tensor([0])
t = torch.linspace(0,2,100)
out = odeint(f, z0, t);
Let's plot our result .It should be a parabola (remember general equation of parabola as
is the dimension.
Reference
The central idea now is to use a differential equation solver as part of a learnt differentiable computation graph (the sort
of computation graph ubiquitous to deep learning)
(RGB and 32x32 pixels), and wish to classify it as a picture of a cat or as a picture of a dog.
With torchdiffeq , we can solve even complex higher order differential equations too. Following is a real world example ,
a set of differential equations that models a spring - mass damper system
class SystemOfEquations:
x0 = torch.Tensor([1])
dx0 = torch.Tensor([0])
ddx0 = torch.Tensor([1])
This is precisely the same procedure as the more general neural ODEs we introduced earlier. At first glance, the NDE
approach of ‘putting a neural network in a differential equation’ may seem unusual, but it is actually in line with
standard practice. All that has happened is to change the parameterisation of the vector field.
Model
Let us have a look at how to embed an ODEsolver in a neural network .
class f(nn.Module):
def __init__(self, dim):
super(f, self).__init__()
self.model = nn.Sequential(
nn.Linear(dim,124),
nn.ReLU(),
nn.Linear(124,124),
nn.ReLU(),
nn.Linear(124,dim),
nn.Tanh()
)
embedded within a neural Network. ODE Block treats the received input x as the initial value of the differential equation.
The integration interval of ODE Block is fixed at [0, 1]. And it returns the output of the layer at
class ODEBlock(nn.Module):
# This is ODEBlock. Think of it as a wrapper over ODE Solver , so as to easily connect it with our neurons !
return out[1]
class ODENet(nn.Module):
#This is our main neural network that uses ODEBlock within a sequential module
out = self.fc1(x)
out = self.relu1(out)
out = self.norm1(out)
out = self.ode_block(out)
out = self.norm2(out)
out = self.dropout(out)
out = self.fc2(out)
return out
As mentioned before , Neural ODE Networks acts similar (has advantages though) to other neural networks , so we can
solve any problem with them as the existing models do. We are gonna reuse the training process mentioned in this
deepchem tutorial.
So Rather than demonstrating how to use NeuralODE model with a normal dataset, we shall use the Delaney solubility
dataset provided under deepchem . Our model will learn to predict the solubilities of molecules based on their
extended-connectivity fingerprints (ECFPs) . For performance metrics we use pearson_r2_score . Here loss is computed
directly from the model's output
Time to Train
We train our model for 50 epochs, with L2 as Loss Function.
# Like mentioned before one can use GPUs with PyTorch and torchdiffeq
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ODENet(in_dim=1024, mid_dim=1000, out_dim=1).to(device)
model = dc.models.TorchModel(model, dc.models.losses.L2Loss())
model.fit(train_set, nb_epoch=50)
Neural ODEs are invertible neural nets Reference Invertible neural networks have been a significant thread of research
in the ICML community for several years. Such transformations can offer a range of unique benefits:
They preserve information, allowing perfect reconstruction (up to numerical limits) and obviating the need to store
hidden activations in memory for backpropagation.
They are often designed to track the changes in probability density that applying the transformation induces (as in
normalizing flows).
Like autoregressive models, normalizing flows can be powerful generative models which allow exact likelihood
computations; with the right architecture, they can also allow for much cheaper sampling than autoregressive
models.
While many researchers are aware of these topics and intrigued by several high-profile papers, few are familiar enough
with the technical details to easily follow new developments and contribute. Many may also be unaware of the wide
range of applications of invertible neural networks, beyond generative modelling and variational inference.
Scientific advancement in machine learning hinges on the effective resolution of complex optimization problems. From
material property design to drug discovery, these problems often involve numerous variables and intricate relationships.
Traditional optimization techniques often face hurdles when addressing such challenges, often resulting in slow
convergence or solutions deemed unreliable. We introduce solutions that are differentiable and also seamlessly
integrable into machine learning systems, offering a novel approach to resolving these complexities.
This tutorials introduces DeepChem's comprehensive set of differentiable optimisation tools to empower researchers
across the physical sciences. DeepChem addresses limitations of conventional methods by offering a diverse set of
optimization algorithms. These includes established techniques like Broyden's first and second methods alongside
cutting-edge advancements, allowing researchers to select the most effective approach for their specific problem.
Along with these optimisation algorithms, DeepChem also provides a number of utilities for implementing more
algorithms.
Nonlinear equations are essential across various disciplines, including physics, engineering, economics, biology, and
finance. They describe complex relationships and phenomena that cannot be adequately modeled with linear equations.
From gravitational interactions in celestial bodies to biochemical reactions in living organisms, non-linear equations play
a vital role in understanding and predicting real-world systems. Whether it's optimizing structures, analyzing market
dynamics, or designing machine learning algorithms.
, is a trigonometric function defined for all real numbers. It represents the ratio of the length of the side opposite an
angle in a right triangle to the length of the hypotenuse.
, is an another trigonometric function. It represents the ratio of the length of the adjacent side of a right triangle to the
length of the hypotenuse when x is the measure of an acute angle.
, is a parabola, symmetric around the y-axis, with its vertex at the origin. It represents a mathematical model of
quadratic growth or decay. In physical systems, it often describes phenomena where the rate of change is proportional
to the square of the quantity involved.
plt.tight_layout()
plt.show()
At its core, rootfinding seeks to determine the solutions (roots) of equations, where a function equals zero. This
operation plays a pivotal role in numerous real-world applications, making it indispensable in both theoretical and
practical domains.
Broyden's Method is an extension of the Secant Method for systems of nonlinear equations. It iteratively updates an
approximation to the Jacobian matrix using the information from previous iterations. The algorithm converges to the
solution by updating the variables in the direction that minimizes the norm of the system of equations.
Steps:
References:
[1] "A class of methods for solving nonlinear simultaneous equations" by Charles G. Broyden
import torch
from deepchem.utils.differentiation_utils import rootfinder
def func1(y, A):
return torch.tanh(A @ y + 0.1) + y / 2.0
A = torch.tensor([[1.1, 0.4], [0.3, 0.8]]).requires_grad_()
y0 = torch.zeros((2,1))
(tensor(2.2752, grad_fn=<ViewBackward0>),
tensor(1.7881e-06, grad_fn=<AddBackward0>))
Steps:
Equilibrium methods are essential in machine learning for optimizing models, ensuring stability and convergence,
regularizing parameters, and analyzing strategic interactions in multi-agent systems. By leveraging equilibrium
principles and techniques, machine learning practitioners can train more robust and generalizable models capable of
addressing a wide range of real-world challenges.
, compute a fixed-point
such that
Classical Approach:
Steps:
Steps:
1.
, fixed-point mapping
2. Choose
(e.g.,
).
Select weights
iterations satisfying
.
3.
import torch
import matplotlib.pyplot as plt
from deepchem.utils.differentiation_utils.optimize.equilibrium import anderson_acc
x_value, f_value = [], []
def fcn(x, a):
x_value.append(x.item())
f_value.append((a/x + x).item()/2)
return (a/x + x)/2
a = 2.0
x0 = torch.tensor([1.0], requires_grad=True)
x = anderson_acc(fcn, x0, params=[a], maxiter=16)
print("Root by Anderson Acceleration:", x.item())
print("Function Value at Calculated Root:", fcn(x, a).item())
Minimizer
deepchem.utils.differentiation_utils.optimize.minimizer provides a collection of algorithms for minimizing
functions. These methods are designed to find the minimum of a function efficiently, making them indispensable for a
wide range of applications in mathematics, physics, engineering, and other fields.
Minimization algorithms, including variants of gradient descent like ADAM, are fundamental tools in various fields of
science, engineering, and optimization.
Gradient Descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for
finding a local minimum of a differentiable multivariate function.
It is used to minimize the cost function in various machine learning and optimization problems. It iteratively updates the
parameters in the direction of the negative gradient of the cost function.
Steps:
1.
2.
Adjust the parameters in the opposite direction of the gradient to minimize the cost function according to the
learning rate
import torch
from deepchem.utils.differentiation_utils.optimize.minimizer import gd
def fcn(x):
return 2 * x + (x - 2) ** 2, 2 * (x - 2) + 2
x0 = torch.tensor(0.0, requires_grad=True)
x = gd(fcn, x0, [])
print("Minimum by Gradient Descent:", x.item())
print("Function Value at Calculated Minimum:", fcn(x)[0].item())
Steps:
1.
2.
At each iteration of training, the gradients of the parameters concerning the loss function are computed.
3.
and
are updated using exponential decay, with momentum and RMSProp components respectively:
4.
Due to the initialization of the moving averages to zero vectors, there's a bias towards zero, especially during the
initial iterations. To correct this bias, ADAM applies a bias correction step:
5.
Finally, the parameters (weights and biases) of the model are updated using the moving averages and the learning
rate
import torch
from deepchem.utils.differentiation_utils.optimize.minimizer import adam
def fcn(x):
return 2 * x + (x - 2) ** 2, 2 * (x - 2) + 2
x0 = torch.tensor(10.0, requires_grad=True)
x = adam(fcn, x0, [], maxiter=20000)
print("X at Minimum by Adam:", x.item())
print("Function Value at Calculated Minimum:", fcn(x)[0].item())
Conclusion
Differentiable optimization techniques are essential for many advanced computational experiments involving
Environment Simulations like DFT, Physics Informed Neural Networks and as fundamental mathematical foundation for
Molecular Simulation like Monte Carlo and Molecular Dynamics.
By integrating deep learning into simulations, we optimize efficiency and accuracy by leveraging trainable neural
networks to replace costly or less precise components. This advancement holds immense potential for expediting
scientific advancements and addressing longstanding mysteries with greater efficacy.
References
[1] Raissi M, Perdikaris P, Karniadakis GE. Physics-informed neural networks: A deep learning framework for solving
forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics (2019)
[2] Muhammad F. Kasim, Sam M. Vinko. Learning the exchange-correlation functional from nature with fully
differentiable density functional theory. 2021 American Physical Society
[3] Nathan Argaman, Guy Makov. Density Functional Theory -- an introduction. American Journal of Physics 68 (2000),
69-79
[4] John Ingraham et al. Learning Protein Structure with a Differentiable Simulator. ICLR. 2019.
@manual{Quantum Chemistry,
title={Differentiation Infrastructure in Deepchem},
organization={DeepChem},
author={Singh, Rakshit kr.},
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Differentiation_Infrastructure_in_D
year={2024},
}
Ordinary Differential Equations (ODEs) are a cornerstone of mathematical modeling, essential for understanding
dynamic systems in scientific and engineering fields.
ODEs consist of unknown functions and their derivatives, establishing relationships that describe how a quantity
changes over time or space. These equations are fundamental in expressing the dynamics of systems.
Here,
is the derivative of
with regards to
, and
is a function of
and
Physics: To describe the motion of particles, the evolution of wave functions, and more.
Biology: To model population dynamics, the spread of diseases, and biological processes.
Control Systems and Robotics: In control systems and robotics, ODEs are fundamental in describing the dynamics of
systems.
Euler's Method
Mid Point Method
3/8 Method
RK-4 Method
t = torch.linspace(0, 20, 5)
y_0 = torch.tensor([1])
a = torch.tensor([1])
for n = 0, 1, 2, 3, ...
given,
and
at t=0 is -5
Procedure:
from deepchem.utils.differentiation_utils.integrate.explicit_rk import rk4_ivp
import matplotlib.pyplot as plt
import torch
a = params[0]
dydt = z
dzdt = a * y - z
params = torch.tensor([6])
t = torch.linspace(0, 1, 100)
y0 = torch.tensor([5, -5])
sol = rk4_ivp(sode, y0, t, params)
yy = 2 * torch.exp(2 * t) + 3 * torch.exp(-3 * t)
The Lotka–Volterra system of equations is an example of a Kolmogorov model, which is a more general framework that
can model the dynamics of ecological systems with predator–prey interactions, competition, disease, and mutualism.
solver_param = [lotka_volterra,
torch.tensor([10., 1.]),
t,
torch.tensor([1.1, 0.4, 0.1, 0.4])]
sol_rk38 = rk38_ivp(*solver_param)
plt.plot(t, sol_rk38)
plt.show()
lotka volterra (Parammeter Estimation)
Parameter Estimation is used to estimate the values of the changable parameters in the ODE. The parameters describe
an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator
attempts to approximate the unknown parameters using the measurements.
import pandas as pd
dataset = pd.read_csv('assets/population_data.csv')
years = torch.tensor(dataset['year'])
fish_pop = torch.tensor(dataset['fish_hundreds'])
bears_pop = torch.tensor(dataset['bears_hundreds'])
y0 = torch.tensor([fish_pop[0], bears_pop[0]])
loss = 0
for i in range(len(years)):
data_fish = fish_pop[i]
model_fish = output[i,0]
data_bears = bears_pop[i]
model_bears = output[i,1]
loss += res
return(loss)
import scipy.optimize
alpha_fit = minimum[0]
beta_fit = minimum[1]
delta_fit = minimum[2]
gamma_fit = minimum[3]
y0 = torch.tensor([fish_pop[0], bears_pop[0]])
plt.plot(t, output)
plt.show()
SIR Epidemiology
The SIR model is one of the simplest compartmental models, and many models are derivatives of this basic form. The
model consists of three compartments:
S: The number of susceptible individuals. When a susceptible and an infectious individual come into "infectious
contact", the susceptible individual contracts the disease and transitions to the infectious compartment.
I: The number of infectious individuals. These are individuals who have been infected and are capable of infecting
susceptible individuals.
R: The number of removed (and immune) or deceased individuals. These are individuals who have been infected
and have either recovered from the disease and entered the removed compartment, or died. It is assumed that the
number of deaths is negligible with respect to the total population. This compartment may also be called
"recovered" or "resistant".
import torch
import matplotlib.pyplot as plt
from deepchem.utils.differentiation_utils.integrate.explicit_rk import rk4_ivp
N = S + I + R
dSdt = - beta * I * S / N
dIdt = beta * I * S / N - gamma * I
dRdt = gamma * I
beta = 0.04
gamma = 0.01
y0 = torch.tensor([100, 1, 0])
plt.plot(t, y)
plt.legend(["Susceptible", "Infectious", "Removed"])
plt.show()
SIS Model
Some infections, for example, those from the common cold and influenza, do not confer any long-lasting immunity. Such
infections may give temporary resistance but do not give long-term immunity upon recovery from infection, and
individuals become susceptible again.
Model:
Total Population:
import torch
import matplotlib.pyplot as plt
from deepchem.utils.differentiation_utils.integrate.explicit_rk import rk4_ivp
N = S + I
beta = 0.04
gamma = 0.01
y0 = torch.tensor([100, 1])
plt.plot(t, y)
plt.legend(["Susceptible", "Infectious"])
plt.show()
References
1. More Computational Biology and Python by Mike Saint-Antoine https://www.youtube.com/playlist?
list=PLWVKUEZ25V97W2qS7faggHrv5gdhPcgjq
@manual{Differential Equation,
title={Differentiation Infrastructure in Deepchem},
organization={DeepChem},
author={Singh, Rakshit kr. and Ramsundar, Bharath},
howpublished =
{\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/ODE_Solving.ipynb}},
year={2024},
}
Introduction
In the preceding sections of this tutorial series, we focused on training models using DeepChem for various applications.
However, we haven't yet addressed the important topic of equivariant modeling.
Equivariant modeling ensures that the relationship between input and output remains consistent even when subjected
to symmetry operations. By incorporating equivariant modeling techniques, we can effectively analyze and predict
diverse properties by leveraging the inherent symmetries present in the data. This is particularly valuable in the fields of
cheminformatics, bioinformatics, and material sciences, where understanding the interplay between symmetries and
properties of molecules and materials is critical.
This tutorial aims to explore the concept of equivariance and its significance within the domains of chemistry, biology,
and material sciences. We will delve into the reasons why equivariant modeling is vital for accurately characterizing and
predicting the properties of molecules and materials. By the end, you will have a solid understanding of the importance
of equivariance and how it can significantly enhance our modeling capabilities in these areas.
You can follow this tutorial using the Google Colab. If you'd like to open this notebook in colab, you can use the following
link.
Open in Colab
What is Equivariance
A key aspect of the structure in our data is the presence of certain symmetries. To effectively capture this structure, our
model should incorporate our knowledge of these symmetries.Therefore, our model should retain the symmetries of the
input data in its outputs. In other words, when we apply a symmetry operation (denoted by σ) to the input and pass it
through the model, the result should be the same as applying σ to the output of the model.
f(σ(x)) = σ(f(x))
Here, f represents the function learned by our model. If this equation holds for every symmetry operation in a collection
S, we say that f is equivariant with respect to S.
While a precise definition of equivariance involves group theory and allows for differences between the applied
symmetry operations on the input and output, we'll focus on the case where they are identical to keep things simpler.
Group Equivariant Convolutional Networks exemplify this stricter definition of equivariance.
Interestingly, equivariance shares a similarity with linearity. Just as linear functions are equivariant with respect to
scalar multiplication, equivariant functions allow symmetry operations to be applied inside or outside the function.
To gain a better understanding, let's consider Convolutional Neural Networks (CNNs). The image below demonstrates
how CNNs exhibit equivariance with respect to translation: a shift in the input image directly corresponds to a shift in
the output features.
It is also useful to relate equivariance to the concept of invariance, which is more familiar. If a function f is invariant, its
output remains unchanged when σ is applied to the input. In this case, the equation simplifies to:
f(σ(x)) = f(x)
An equivariant embedding in one layer can be transformed into an invariant embedding in a subsequent layer. The
feasibility and meaningfulness of this transformation depend on the implementation of equivariance. Notably, networks
with multiple convolutional layers followed by a global average pooling layer (GAP) achieve this conversion. In such
cases, everything up to the GAP layer exhibits translation equivariance, while the output of the GAP layer (and the entire
network) becomes invariant to translations of the input.
3. Improved Generalization
Equivariant models have the advantage of generalizing well to unseen data. By incorporating the known symmetries
and structures of the domain into the model architecture, equivariance ensures that the model can effectively capture
and utilize these patterns even when presented with novel examples. This leads to improved generalization
performance, making equivariant models valuable in scenarios where extrapolation or prediction on unseen instances is
crucial.
4. Efficient Processing of Graph-Structured Data
Graph-structured data possess rich relational information and symmetries. Equivariant models specifically tailored for
graph data offer a natural and efficient way to model and reason about these complex relationships. By considering the
symmetries of the graph, equivariant models can effectively capture the local and global patterns, enabling tasks such
as node classification, link prediction, and graph generation.
Example
Traditional machine learning (ML) algorithms face challenges when predicting molecular properties due to the
representation of molecules. Typically, molecules are represented as 3D Cartesian arrays with a shape of (points, 3).
However, neural networks (NN) cannot directly process such arrays because each position in the array lacks individual
significance. For instance, a molecule can be represented by one Cartesian array centered at (0, 0, 0) and another
centered at (15, 15, 15), both representing the same molecule but with distinct numerical values. This exemplifies
translational variance. Similarly, rotational variance arises when the molecule is rotated instead of translated.
In these examples, if the different arrays representing the same molecule are inputted into the NN, it would perceive
them as distinct molecules, which is not the case. To address these issues of translational and rotational variance,
considerable efforts have been devoted to devising alternative input representations for molecules. Let’s demonstrate
with some code how to go about creating functions that obey set of equivariances. We won’t be training these models
because training has no effect on equivariances.
). The features are encoded as one-hot vectors, where [1, 0] indicates a carbon atom, and [0, 1] indicates a hydrogen
atom. In this specific example, our focus is on predicting the energy associated with the molecule. It's important to note
that we will not be training our models, meaning the predicted energy values will not be accurate.
import numpy as np
np.random.seed(42) # seed for reproducibility
An example of a model that lacks equivariances is a one-hidden layer dense neural network. In this model, we
concatenate the positions and features of our data into a single input tensor, which is then passed through a dense
layer. The dense layer utilizes the hyperbolic tangent (tanh) activation function and has a hidden layer dimension of 16.
The output layer, which performs regression to energy, does not have an activation function. The weights of the model
are always initialized randomly.
def hidden_model(r: np.ndarray, x: np.ndarray, w1: np.ndarray, w2: np.ndarray, b1: np.ndarray, b2: float) -> np.ndarr
r"""Computes the output of a 1-hidden layer neural network model.
Parameters
----------
r : np.ndarray
Input array for position values.
Shape: (num_atoms, num_positions)
x : np.ndarray
Input array for features.
Shape: (num_atoms, num_features)
w1 : np.ndarray
Weight matrix for the first layer.
Shape: (num_positions + num_features, hidden_size)
w2 : np.ndarray
Weight matrix for the second layer.
Shape: (hidden_size, output_size)
b1 : np.ndarray
Bias vector for the first layer.
Shape: (hidden_size,)
b2 : float
Bias value for the second layer.
Returns
-------
float
Predicted energy of the molecule
"""
i = np.concatenate((r, x), axis=1).flatten() # Stack inputs into one large input
v = np.tanh(i @ w1 + b1) # Apply activation function to first layer
v = v @ w2 + b2 # Multiply with weights and add bias for the second layer
return v
Although our model is not trained, we are not concerned about. Since, we only want see if our model is affected by
permutations, translations and rotations
permuted_R_i = np.copy(R_i)
permuted_R_i[0], permuted_R_i[1] = R_i[1], R_i[0] # Swap the rows of R_i
As expected, our model is not invariant to any permutations, translations, or rotations. Let's fix them.
Permutational Invariance
In a molecular context, the arrangement or ordering of points in an input tensor holds no significance. Therefore, it is
crucial to be cautious and avoid relying on this ordering. To ensure this, we adopt a strategy of solely performing atom-
wise operations within the network to obtain atomic property predictions. When predicting molecular properties, we
need to cumulatively combine these atomic predictions, such as using summation, to arrive at the desired result. This
approach guarantees that the model does not depend on the arbitrary ordering of atoms within the input tensor.
def hidden_model_perm(r: np.ndarray, x: np.ndarray, w1: np.ndarray, w2: np.ndarray, b1: np.ndarray, b2: float) -> np
r"""Computes the output of a 1-hidden layer neural network model with permutation invariance.
Parameters
----------
r : np.ndarray
Input array for position values.
Shape: (num_atoms, num_positions)
x : np.ndarray
Input array for features.
Shape: (num_atoms, num_features)
w1 : np.ndarray
Weight matrix for the first layer.
Shape: (num_positions + num_features, hidden_size)
w2 : np.ndarray
Weight matrix for the second layer.
Shape: (hidden_size, output_size)
b1 : np.ndarray
Bias vector for the first layer.
Shape: (hidden_size,)
b2 : float
Bias value for the second layer.
Returns
-------
float
Predicted energy of the molecule
"""
i = np.concatenate((r, x), axis=1) # Stack inputs into one large input
v = np.tanh(i @ w1 + b1) # Apply activation function to first layer
v = np.sum(v, axis=0) # Reduce the output by summing across the axis which gives permutational invariance
v = v @ w2 + b2 # Multiply with weights and add bias for the second layer
return v
# Initialize weights
w1 = np.random.normal(size=(5, 16))
b1 = np.random.normal(size=(16,))
w2 = np.random.normal(size=(16,))
b2 = np.random.normal()
In the original implementation, the model computes intermediate activations v for each input position separately and
then concatenates them along the axis 0. By summing across axis 0 with (np.sum(v, axis=0)), the model effectively
collapses all the intermediate activations into a single vector, regardless of the order of the input positions.
This reduction operation allows the model to be permutation invariant because the final output is only dependent on the
aggregated information from the intermediate activations and is not affected by the specific order of the input positions.
Therefore, the model produces the same output for different permutations of the input positions, ensuring permutation
invariance.
Now let's see if this changes affected our model's sensitivity to permutations.
Indeed! As anticipated, our model demonstrates invariance to permutations while remaining sensitive to translations or
rotations.
Translational Invariance
To address the issue of translational variance in modeling molecules, one approach is to compute the distance matrix of
the molecule. This distance matrix provides a representation that is invariant to translation. However, this approach
introduces a challenge as the distance features change from having three features per atom to
features per atom. Consequently, we have introduced a dependency on the number of atoms in our distance features,
making it easier to inadvertently break permutation invariance. To mitigate this issue, we can simply sum over the
newly added axis, effectively collapsing the information into a single value. This summation ensures that the model
remains invariant to permutations, restoring the desired permutation invariance property.
def hidden_model_permute_translate(r: np.ndarray, x: np.ndarray, w1: np.ndarray, w2: np.ndarray, b1: np.ndarray, b2
r"""Computes the output of a 1-hidden layer neural network model with permutation and translation invariance.
Parameters
----------
r : np.ndarray
Input array for position values.
Shape: (num_atoms, num_positions)
x : np.ndarray
Input array for features.
Shape: (num_atoms, num_features)
w1 : np.ndarray
Weight matrix for the first layer.
Shape: (num_positions + num_features, hidden_size)
w2 : np.ndarray
Weight matrix for the second layer.
Shape: (hidden_size, output_size)
b1 : np.ndarray
Bias vector for the first layer.
Shape: (hidden_size,)
b2 : float
Bias value for the second layer.
Returns
-------
float
Predicted energy of the molecule
"""
d = r - r[:, np.newaxis] # Compute pairwise distances using broadcasting
v = np.sum(v, axis=(0, 1)) # Reduce the output over both axes by summing
v = v @ w2 + b2 # Multiply with weights and add bias for the second layer
return v
To achieve translational invariance, the function calculates pairwise distances between the position values in the r
array. This is done by subtracting r from r[:, np.newaxis], which broadcasts r along a new axis, enabling element-wise
subtraction.
The pairwise distance calculation is based on the fact that subtracting the positions r from each other effectively
measures the distance or difference between them. By including the pairwise distances in the input, the model can learn
and capture the relationship between the distances and the features. This allows the model to be invariant to
translations, meaning that shifting the positions within each set while preserving their relative distances will result in the
same output.
Now let's see if this changes affected our model's sensitivity to permutations.
Yes! Our model is invariant to both permutations and translations but not to rotations.
Rotational Invariance
Atom-centered symmetry functions exhibit rotational invariance due to the invariance of the distance matrix. While this
property is suitable for tasks where scalar values, such as energy, are predicted from molecules, it poses a challenge for
problems that depend on directionality. In such cases, achieving rotational equivariance is desired, where the output of
the network rotates in the same manner as the input. Examples of such problems include force prediction and molecular
dynamics.
To address this, we can convert the pairwise vectors into pairwise distances. To simplify the process, we utilize squared
distances. This conversion allows us to incorporate directional information while maintaining simplicity. By considering
the squared distances, we enable the network to capture and process the relevant geometric relationships between
atoms, enabling rotational equivariance and facilitating accurate predictions for direction-dependent tasks.
def hidden_model_permute_translate_rotate(r: np.ndarray, x: np.ndarray, w1: np.ndarray, w2: np.ndarray, b1: np.ndarra
r"""Computes the output of a 1-hidden layer neural network model with permutation, translation, and rotation inva
Parameters
----------
r : np.ndarray
Input array for position values.
Shape: (num_atoms, num_positions)
x : np.ndarray
Input array for features.
Shape: (num_atoms, num_features)
w1 : np.ndarray
Weight matrix for the first layer.
Shape: (num_positions, hidden_size)
w2 : np.ndarray
Weight matrix for the second layer.
Shape: (hidden_size, output_size)
b1 : np.ndarray
Bias vector for the first layer.
Shape: (hidden_size,)
b2 : float
Bias value for the second layer.
Returns
-------
float
Predicted energy of the molecule
"""
# Compute pairwise distances using broadcasting
d = r - r[:, np.newaxis]
# Compute squared distances
d2 = np.sum(d**2, axis=-1, keepdims=True)
v = v @ w2 + b2 # Multiply with weights and add bias for the second layer
return v
# Initialize weights
w1 = np.random.normal(size=(3, 16))
b1 = np.random.normal(size=(16,))
w2 = np.random.normal(size=(16,))
b2 = np.random.normal()
The hidden_model_permute_trans_rotate function achieves rotational invariance through the utilization of pairwise
squared distances between atoms, instead of the pairwise vectors themselves. By using squared distances, the function
is able to incorporate directional information while still maintaining simplicity in the calculation.
Squared distances inherently encode geometric relationships between atoms, such as their relative positions and
orientations. This information is essential for capturing the directionality of interactions and phenomena in tasks like
force prediction and molecular dynamics, where rotational equivariance is desired.
The conversion from pairwise vectors to pairwise squared distances allows the model to capture and process these
geometric relationships. Since squared distances only consider the magnitudes of vectors, disregarding their directions,
the resulting network output remains invariant under rotations of the input.
Now let's see if this changes affected our model's sensitivity to rotations.
Yes! Now our model is invariant to both permutations, translations, and rotations.
With these new changes, our model exhibits improved representation capacity and generalization while preserving the
symmetry of the molecules.
References
Bronstein, M.M., Bruna, J., Cohen, T., & Velivckovi'c, P. (2021). Geometric Deep Learning: Grids, Groups, Graphs,
Geodesics, and Gauges. ArXiv, abs/2104.13478.
White, A.D. (2022). Deep learning for molecules and materials. Living Journal of Computational Molecular Science.
Geiger, M., & Smidt, T.E. (2022). e3nn: Euclidean Neural Networks. ArXiv, abs/2207.09453.