Preview 2022 Python For Chemists Aramis Tanemura
Preview 2022 Python For Chemists Aramis Tanemura
7e5030
This is a limited PDF preview of the primer. The entire work is available in
ePub3 and includes additional multimedia.
Diego Sierra-Costa
Michigan State University
Individual sales
Institutional sales
https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030
Preface
Python is a simple and concise scripting language (1). Unlike compiled languages, Python is an interpreted language, known for its ease of
coding and not for its computational speed. Python code specialized and optimized for a particular application is portable between projects as
Python packages. The ease of development makes Python an attractive language to apply in the broad field of chemistry, which includes a diverse
skill set but with many opportunities for algorithmic solutions regardless.
Learning Python is easy and should not be a barrier at all. Knowing a handful of Python syntax is enough to begin solving problems.
Learning to code in Python can be done in one sitting. Also, there are plenty of resources online to look up the syntax and address bugs.
However, contextualizing chemical problems in Python is not always obvious. Programming in Python empowers chemists to apply their domain
knowledge to scales unreachable by manual effort.
In this digital primer, readers will explore practical use cases of Python for chemical data analysis, cheminformatics, machine learning, and
molecular modeling. We aim to guide readers in developing an intuition for finding algorithmic solutions by reading this digital primer. To
accomplish this, we explore a broad set of chemical problems and illustrate solutions implemented in Python. We do not expect all the problems
to be directly applicable to all readers. Instead, we have readers develop the skill to identify problems in their research for which code may
automate operations and scale large volumes of data or calculation. This digital primer utilizes the many functionalities of Python and is not
intended to be a resource to learn particular algorithms or chemistry. Readers are encouraged to supplement such concepts by further reading
external sources.
In the first chapter, we explore basic Python and introduce relevant packages throughout subsequent chapters, with exposure to Python code
applied to chemical problems. The material is not intended to comprehensively discuss topics and packages developed for chemical applications in
Python, but rather to get you started on using Python in chemistry research fast.
Therefore, we shorten the time from “learning” to “using” Python as much as possible, because the only time you substantively learn Python is
when you use it. There is no point in getting overwhelmed by endless books and tutorials, never to arrive at the actual programming part. Just
start solving problems and turn them into a project. If completing a project in Python was rewarding and you want to delve deeper into writing
software in Python, you can choose to gain training at an intermediate level. If not, you gained foundational Python knowledge to apply later
when it becomes useful.
Maybe the code in your first project turns out messy and difficult to use for the first time. That is fine because it is the first version that can be
incrementally improved. And if you stick with programming, when you come back to the project, you can rewrite the entire thing beautifully and
a lot faster. There will never be such progress unless you start.
The only time you will not encounter a bug is when you are not coding. Do not take a bug as a failure or any inadequacy on your part. Not
only is it normal, but each bug you encounter is something new you learn about your data and code. Be excited about the discovery of the codeʼs
unexpected behavior and make your code even stronger by fixing the bug.
Kiyoto Aramis Tanemura is a Ph.D. student working in the research group of Prof. Kenneth M. Merz in the Department of
Chemistry, Michigan State University. At the interface of computational chemistry and artificial intelligence, his research aims
to develop methodologies to predict spectral properties of small organic molecules for high throughput identification. He
completed his B.A. at Kalamazoo College in Chemistry and Mathematics, with concentrations in Biological Chemistry and
Molecular Biology as well as Biological Physics. He uses Python every day in all aspects of his research.
Diego Sierra-Costa is a doctoral candidate at the Department of Chemistry at Michigan State University under the
supervision of Prof. Kenneth M. Merz. His research in mathematical artificial intelligence and chemistry focuses on developing
new representations of small molecules for the prediction and calculation of physicochemical properties. Diego received his
B.Sc. in Physics from the National Autonomous University of Mexico where he focused on quantum optics and cold atoms.
Photo credit: Delilah Pacheco
Kenneth M. Merz, Jr. is the Joseph Zichis Chair in Chemistry and a University Distinguished Professor at Michigan State
University. He is also the Editor-in-Chief of the ACS Journal of Chemical Information and Modeling. His research interest lies in
the development of theoretical and computational tools and their application to biological problems including structure and
ligand-based drug design, mechanistic enzymology, and methodological verification and validation. He has received several
honors including election as an ACS Fellow, the 2010 ACS Award for Computers in Chemical and Pharmaceutical Research,
election as a fellow of the American Association for the Advancement of Science, and a John Simon Guggenheim Fellowship.
3 Cheminformatics
3 Cheminformatics
3.1 Introduction
3.2 The SMILES and SMARTS Languages
3.2.1 SMILES
3.2.2 SMARTS
3.3 RDKit
3.4 Atoms and Bonds
3.5 Reactions
3.6 Inspecting a Database
3.7 Finding Substructures
3.8 Fingerprints
3.9 Molecular Similarity
3.10 That’s a Wrap
3.11 Read These Next
CHAPTER 1
1.3.1 Indexing
1.3.2 Primer Design for Polymerase Chain Reaction
1.3.3 Practice Problems
1.4 Functions
1.6.1 For-Loop
1.6.1.1 List Comprehension
1.6.1.2 Iterables
1.6.2 While-Loop
1.6.3 Continue, Break, and Pass
1.6.4 Practice Problems
1.7 That’s a Wrap
1.1 INTRODUCTION
Programming is increasingly valuable in modern chemistry. Algorithmic approaches allow us to interpret data that are impractical or unfeasible
to manually inspect by a domain expert. As scientists, we produce, maintain, interpret, and communicate data. Also we are tasked with faithfully
analyzing the data we collect, regardless of the format or volume of the data. We can navigate larger data sets through programming.
In this chapter, we cover basic Python syntax and follow it with chemically relevant problems. This chapter focuses on base Python with minimal
external packages so that we learn to design programmatic solutions to chemical problems rather than relying on existing solutions. To get
comfortable in coding, try the problems in this chapter as you read along to check your understanding. As you solve the problems, consider where
you might use codes like this in your research.
Using existing solutions is crucial in research. While there is value in knowing how the software works, it is impossible to know the inner
workings of every tool we use. At a certain level, the software we use is a black box. So, while we use only base Python in this chapter, this is
for educational purposes. The codes you may actually write and use in research would likely import relevant packages already optimized for the
particular tasks. The intuition we build by solving chemical problems in base Python will help us identify problems we encounter in our own
work, with algorithmic solutions.
We can complete many tasks using Python by understanding how to use the following operations:
• Numerical operations
• String operations
• Functions
• Conditional statements
• Loops
Let us become proficient in these commands in the subsequent sections of the chapter. In each section, we introduce the associated statements/
commands and use them on chemically relevant problems.
As scientists who handle quantitative data, we use a scientific calculator for various applications. We can run the same calculations in Python. Let
us illustrate how we can perform some simple arithmetic operations.
# addition
5+7
12
# subtraction
5-7
−2
# multiplication
5*7
35
# division
5/7
0.7142857142857143
# exponentiation
5 ** 7
78125
# modulus
5%7
We can assign quantities to variables using the equal sign. The name of the variable can be almost anything; however, it is good practice to use
long, descriptive names for variables so that the quantity saved to the variable is obvious. Generic names like x or y are fine, but it is not always
obvious when a reader sees an operation done on these variables.
five = 5
seven = 7
five + seven
12
We get the expected value when we save the numerical values 5 and 7 to variables five and seven, respectively, and perform a numerical
operation on the variable.
Variables are useful to organize the many numerical values we handle as chemists; they may be known constants, experimental parameters,
controlled variables, and measured data. The calculation is made modular and readable by using Python variables.
To illustrate, let us suppose we calculate the number of phenylalanine molecules in 1.00 g of phenylalanine hydrochloride. We can do this as
follows:
(1.1)
Note that the text that follows “#” in the code creates comments. Comments explain the content of the code to a developer and are ignored by the
Python interpreter. Code must be annotated with comments so that it is coherent and can be used by other users and by yourself later.
# print the result with units using the print function in base Python
print('1.00 g of Phe-HCl contains %.2e Phe molecules' %num_phe)
Tip: When writing code, try writing the comments before the code. In the comments, explain what exactly the code should do. Then, start
writing code where the comments appear. This helps break down the code into many smaller problems and ensures the finished code is well
annotated.
Of course, we can do this calculation in one line on a scientific calculator, but we document our calculation using variables rather than direct
quantities. The below calculation is identical to the former, but the code is less transparent with what calculation was done without context.
Additionally, performing the calculation across several lines avoids some common error like missing parentheses.
2.9863625092982893e+21
Modules are external Python codes that can be imported. They are optimized to perform specific tasks. Here, we highlight some mathematical
variables and functions that we can import from the math module. The math module is a built-in module for mathematical functions and
variables not explicitly available in base Python. The functionality of the module can be utilized in a Python script using the import statement. To
access particular functions or variables in the module, we put a period after the module and invoke the object. Below, we compute the factorial of
five.
# We can import entire modules. We can call any functions within the math module.
import math
math.factorial(5)
120
5*4*3*2*1
120
We may want to import only certain items from the module. In this case, we can use the from <module> import <object> syntax. Take for
instance, we import only the constant π and the natural log function from the math module.
pi
3.141592653589793
1.6094379124341003
So far, we have computed algebraic expressions using mostly base Python. Functions not available explicitly in base Python can be imported from
modules, such as the math module.
Practice problems are available for download as a Jupyter Notebook file. This notebook documents the code, which can be executed by opening
a session with an interactive kernel. You can open and edit the downloaded notebook by opening it locally in a Jupyter Notebook session or
uploading it to a web service for this task.
Calculate the mass of each reagent/catalyst to add to run the reaction at 1.0 mmol scale.
# molecular weights
benzodioxaborole_mw = 202.06 # g/mol
bromopropene_mw = 135.00 # g/mol
pd_catalyst_mw = 1155.59 # g/mol
B. The Aspergillus niger L-arabinose reductase LarA catalyzes the reduction of L-arabinose to L-arabitol. The assay for LarA activity with
L-arabinose as the substrate follows Michaelis–Menton kinetics with Km of 54 mM (3). An assay with 0.72 mg of enzyme and 10 mM of
L-arabinose exhibited a velocity of 3.4 units. Calculate Vmax and kcat. Some equations are listed below.
Michaelis–Menton equation
(1.2)
in which v is the velocity of the reaction, Vmax is the velocity at excess substrate, Km is the Michaelis–Menton constant, and [S] is the
concentration of the substrate.
turnover number
(1.3)
in which kcat is the turnover number and [E] is the concentration of the enzyme.
C. Suppose we are quantifying chlorophyll a concentration in environmental samples. The specific absorption coefficient ϵ was previously
determined as 84.3 Lg·cm at 664 nm for organic extract of chlorophyll a (4). Five milliliters of water sample was extracted in ether to a final
volume of 13 mL and yielded an absorbance of 0.31 units at 664 nm with 1 cm path length. Use the Beer–Lambert law to estimate the
chlorophyll a concentration.
D. Atmospheric carbon dioxide level is estimated to have risen by over 100 ppm since the industrial revolution. Aqueous carbon dioxide is in
equilibrium with aqueous carbonic acid; thus an increase in atmospheric carbon dioxide can lower the pH of water. Acidification of the ocean is a
concern for marine ecosystems because shells and skeletons of marine organisms in coral reefs, which are made of calcium carbonate, can dissolve
at lower pH levels (5). Calculate the change in pH in pure water exposed to the atmosphere due to the increase in the CO2 level from 280 to 380
ppm. Assume that the dissociation from bicarbonate to carbonate is negligible. Relevant equilibrium constants are given (FIGURE 1.2).
Henry’s law
(1.4)
where P is the pressure and KH is Henry’s law constant for carbon dioxide (29.41 atm M)
You should now be equipped to process quantitative data in Python, much like how chemists may use a scientific calculator in research. However,
we likely handle other types of data in our research as well as numerical data. Let us consider next how to process string data using Python.
Strings are text data. They can be manipulated with Python operations. This is useful for extracting data from output files of software.
Biopolymers like proteins or nucleotides can be described using sequences, which is string data. Chemical structures can be stored in Simplified
Molecular Input Line Entry System (SMILES) representation in databases. The SMILES representation will be thoroughly described in SECTION
3.2; here we focus on operations involving strings and use a SMILES string as an example.
Let us do some simple operations on two string that simply read, “Cellar” and “Door”. The strings are concatenated (joined together) by adding
them together with +.
cellar = 'Cellar'
door = 'Door'
cellar + door
'CellarDoor'
Perhaps we prefer having them merged with a space in between. This can be done with the join method.
'Cellar Door'
1.3.1 Indexing
We use brackets to index the string. Three integers can be placed in brackets, separated by colons (:):
• the start index (inclusive, starting at zero)
• increments
Cel
or
Substrings are accessible by specifying or leaving blank the start and end indices.
'oor'
'Cell'
door[1:3] #
(excludes 3)
'oo'
'ella'
The third position indicates the increment of the string. The default is to print every letter (1).
'Cla'
'rooD'
With some of the basic syntax in mind, let us apply it to processing nucleotide sequence data.
rfp =
'agtttcagccagtgacagggtgagctgccaggtattctaacaagatgagttgttccaagaatgtgatcaaggagttcatgaggttcaaggttcgtatggaaggaa
cggtcaatgggcacgagtttgaaataaaaggcgaaggtgaagggaggccttacgaaggtcactgttccgtaaagcttatggtaaccaagggtggacctttgccatt
tgcttttgatattttgtcaccacaatttcagtatggaagcaaggtatatgtcaaacaccctgccgacataccagactataaaaagctgtcatttcctgagggattta
aatgggaaagggtcatgaactttgaagacggtggcgtggttactgtatcccaagattccagtttgaaagacggctgtttcatctacgaggtcaagttcattggggtg
aactttccttctgatggacctgttatgcagaggaggacacggggctgggaagccagctctgagcgtttgtatcctcgtgatggggtgctgaaaggagacatccatat
ggctctgaggctggaaggaggcggccattacctcgttgaattcaaaagtatttacatggtaaagaagccttcagtgcagttgccaggctactattatgttgactcca
aactggatatgacgagccacaacgaagattacacagtcgttgagcagtatgaaaaaacccagggacgccaccatccgttcattaagcctctgcagtgaactcggctc
agtcatggattagcggtaatggccacaaaaggcacgatgatcgttttttaggaatgcagccaaaaattgaaggttatgacagtagaaatacaagcaacaggctttgc
ttattaaacatgtaattgaaaac'
len(rfp)
876
Suppose we want to express RFP from a gene inserted to a plasmid in a bacterial vector. We use the polymerase chain reaction (PCR) to amplify
this gene from a template DNA (7). We specify the segment of DNA to amplify by designing the appropriate primers and short single-strand
DNA synthetically produced.
We need the start and end of the gene to define the region to amplify by PCR. Let us print the first and last 30 nucleotides of the RFP gene.
We use the upper() method to capitalize the string. This will be useful to standardize the case of the sequence, as the operations are case-sensitive.
The reverse primer should be in the reverse complement so that it hybridizes to the sense (coding) strand. To reverse a sequence, we take the full
sequence, but with increments of negative one. Inside the brackets, we do not specify the start or end index, only the increment.
'CAAAAGTTAATGTACAAATTATTCGTTTCG'
Let us use a dictionary to obtain the complement of the reverse sequence. Dictionaries are like lists but indexed by keys. They are specified by
{key0: value0, key1: value1}, and so forth.
'T'
We index the dictionary by “A”, so it returns “T”, the value associated with the key.
We can iterate over strings using loops. The string is split by letters. By passing these letters, or bases, to the dictionary, we replace the nucleotide
by its complement. Let us explore this syntax, known as a list comprehension, in more depth in a later section.
['G', 'T', 'T', 'T', 'T', 'C', 'A', 'A', 'T', 'T']
The letter-wise operation returns a list (we show only first 10 letters). Let us join this list back to one string, representing the reverse complement
or the suffix in the 5′ to 3′ direction. We can do this by using the join() method of strings.
rfp_end_rc = ''.join(rfp_end_rc)
rfp_end_rc
'GTTTTCAATTACATGTTTAATAAGCAAAGC'
We joined the list of letters with '' (an empty string with no space). With the two coding sequences in the correct orientation, they can be
trimmed to match the melting temperature, below the elongation temperature of 72 °C. The trimmed sequences are as follows, with calculated
annealing temperature at 62 °C.
rfp_start = 'AGTTTCAGCCAGTGACAG'
rfp_end_rc = 'GTTTTCAATTACATGTTTAATAAGCAAAGC'
Let us concatenate the subsequences together to get a forward (sense strand) primer with:
• Spacer sequence
• EcoRI site
• NotI site
• XbaI site
• extra G spacer
These are components for the prefix of a protein coding sequence in BioBrick construct (8).
spacer1 = 'GTTTCTTC'
EcoRI = 'GAATTC'
NotI = 'GCGGCCGC'
spacer2 = 'T'
XbaI = 'TCTAGA'
spacer3 = 'G'
'GTTTCTTCGAATTCGCGGCCGCTTCTAGAGAGTTTCAGCCAGTGACAG'
• SpeI site
• NotI site
• PstI site
• Spacer
These components are for the suffix of an insert for a BioBrick construct (8). However, we need the sequence to be the reverse complement for
proper amplification. Because we already obtained the coding sequence in reverse complement, we will omit this and append it at the end.
stop_codon = 'TAA'
SpeI = 'ACTAGT'
spacer1 = 'A'
NotI = 'GCGGCCGC'
PstI = 'CTGCAG'
spacer2 = 'AAGAAAC'
'TAAACTAGTAGCGGCCGCCTGCAGAAGAAAC'
Follow the same operation to obtain the reverse complement of the flanking region of the reverse primer.
'GTTTCTTCTGCAGGCGGCCGCTACTAGTTTA'
To the flanking region of the reverse primer, append the reverse complement of the end of the RFP gene.
'GTTTCTTCTGCAGGCGGCCGCTACTAGTTTAGTTTTCAATTACATGTTTAATAAGCAAAGC'
The designed primers can be synthesized and used to amplify the RFP gene for cloning into a BioBrick construct.
We have used basic string operations to document how we design primers for the PCR amplification of RFP gene. Working with Python variables
allows us to save and organize sections of nucleotide sequences. We can modify and join these sequences using string methods.
A. The gene sequence of the sense strand (coding side) encoding for RFP was given. Print the antisense strand (reverse complement) of the whole
open reading frame.
B. Suppose you are designing a guide RNA to knockout Arabidopsis thaliana transcription factor TRY (9). The three exons of TRY are given (10).
The SpCas9 recognizes the protospacer adjacent motif (PAM), NGG and CCN (two consecutive guanine or cytosine plus one nucleotide). Find
the index of at least one PAM in each of the exons of the gene of TRY using the find() method of strings.
exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAGT
CTACACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACATTTCCT
TCTCTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTCTCCAAACT
CAAAATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGTCGTCGCCGTCG
TAAGCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACT
TGTCGGTGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAACAGTGAAG
GCTTTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCTATCTATCCT
TCCTAGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTTCTTGTTAGTTT
GAAGAATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCATTTGGTAATCAACT
TTAATCCATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAATCCGCGGTATATCGCG
GTATAATTTACTTTTTAAAGTTAATATATATTAAAACTTG'
C. Suppose we have a list of organic compounds to subject to further analysis. We have their SMILES structures, but notice some of the entries
are complexed with smaller ions. We want to retain only the molecule with the longest SMILES structures. Disconnected structures are separated
by periods (.) in SMILES. Use the string method split() to obtain a list of structures for D-glucosamine sulfate, determine the lengths of each
SMILE, and print the SMILES with the ion removed.
d_glucosamine_sulfate = 'C([C@H]([C@H]([C@@H]([C@H](C=O)N)O)O)O)O.OS(=O)(=O)O'
# write the rest of the code
D. You are extracting optimized energy values from a geometry optimization calculation using the QUICK program (11, 12) The line containing
the total energy value is given. Extract the total energy value as a float.
You should be familiar with string data and some operations to process them. Try to identify where you handle string data in research, and how it
may be processed using Python string operations. Let us begin organizing these numerical and string operations into Python functions.
1.4 FUNCTIONS
So far, we worked with one-liners or blocks of script. This is great, but what if we want to apply the same scripts to a new data set? Copy/pasting
the script to a new file and modifying it over and over for each use will produce an unnecessary number of scripts and introduces the possibilities
of errors. Instead, we can begin organizing the operations in modular fashion by defining functions. Functions take arguments (inputs) of various
types and return an output.
Previous operations may have felt elementary. We were solving chemical problems; however, we used the Python scripting language as a generic
scientific calculator or text editor (which is fine). Writing functions in Python as a chemist is like building your own chemical calculator.
def trivial_function():
return 'this function is trivial'
trivial_function()
We define the name of the function using the def keyword. The function name is followed by a parenthesis and colon. The operations inside the
function are indicated by one indentation. The operation terminates either when it reaches the last indented line or when it reaches the first return
statement.
Let us define a new function that takes an argument. The following function returns the square plus one of a number. The argument val is a local
variable. We do not define it before defining a function. Instead, it is defined when we pass an argument to the function.
def square_plus_one(val):
return val ** 2 + 1
square_plus_one(3)
10
Functions can have multiple arguments. Let us define a function that calculates electrostatic force using Coulomb’s law:
(1.5)
in which ke is Coulombʼs constant, qi is the charge of particle i, and r is the radius between particles.
return coulomb_constant * q1 * q2 / r ** 2
The force between particles with charges 1.0e-05 C and -2.0e-05 C at 0.1 m is -1.80e+02 N.
We can define default values for arguments so that if they are not passed, the function assumes these values. For instance, we can define a function
which returns relative probability based on free energy difference:
(1.6)
in which is the probability of state 1 relative to state 2, Fi is the free energy of state i, kB is Boltzmann constant, and T is temperature.
12.395578077607523
Suppose we want to vary the temperature as well. We add temperature as an additional argument.
temp = 250
print('At %d K: %f.1' %(temp, relative_probability(0.0, 1.5, temp)))
temp = 400
print('At %d K: %f.1' %(temp, relative_probability(0.0, 1.5, temp)))
At 250 K: 20.507850.1
At 400 K: 6.606175.1
Functions are not limited to numerical operations. Let us take the SMILES of cortisone and produce the SMILES of its enantiomer. Feel free to
paste the SMILES into a software like ChemDraw to confirm the SMILES does in fact encode the structure of cortisone. We define a function
that inverts all stereogenic centers. The configuration of stereogenic centers is communicated by @ and @@.
cortisone = 'O=C(C=C1CC[C@@]2([H])[C@]3([H])CC[C@@](O)([C@]3(C4)C)C(CO)=O)CC[C@]1(C)[C@@]2([H])C4=O'
def enantiomer(smiles):
'''return the SMILES of the enantiomer
smiles (string): input structure SMILES'''
# first replace all @ with @@. This will cause @@ to become @@@@
switch1 = smiles.replace('@', '@@')
# because @@@@ was originally @@, we replace it with @
return switch1.replace('@@@@', '@')
enantiomer(cortisone)
'O=C(C=C1CC[C@]2([H])[C@@]3([H])CC[C@](O)([C@@]3(C4)C)C(CO)=O)CC[C@@]1(C)[C@]2([H])C4=O'
Of course, we can do the operation in one line as below, but it is less obvious than if we name the function as enantiomer(...). The string
operation replace(...) itself has no chemical sense. We assign chemical meaning to it by organizing it in the function.
'O=C(C=C1CC[C@]2([H])[C@@]3([H])CC[C@](O)([C@@]3(C4)C)C(CO)=O)CC[C@@]1(C)[C@]2([H])C4=O'
A lambda function applies a function locally in one line. It only returns one expression. It follows the syntax: lambda input: expression
cube = lambda x: x ** 3
cube(4)
64
While developing, you may write out the names of all the functions you plan to have, then implement them individually. Empty functions will
raise an error, preventing you from testing each function. During development, put a pass statement in functions to avoid errors.
def identity_function(value):
return value
def inverse(value):
pass
identity_function(7.5)
7.5
We observe the inverse function is not yet implemented, but we can properly test the identity function during development.
We have taken the numerical and spring operations and organized them into functions. Functions allow chemists to assign chemical meaning to
the operations implemented in code. Once defined, we can use the function an arbitrary number of times. Therefore, we do not have to rewrite
the Python code every time we want to perform a particular operation.
A. Write a function that returns the index of a tobacco etch virus (TEV) protease recognition site in a peptide sequence. The TEV site is the
sequence, ENLYFQ. While the sequence, ENLYFQS exhibits the greatest catalytic efficiency, the last position can also be G, A, M, C, or H. Run
the function on the peptide sequence of recombinant sarafotoxin (13).
recombinant_sarafotoxin =
'MKDDAAIQQTLAKMGIKSSDIQPAPVAGMKTVLTNSGVLYITDDGKHIIQGPMYDVSGTAPVNVTNKMLLKQLNALE
KEMIVYKAPQEKHVITVFTDITCGYCHKLHEQMADYNALGITVRYLAFPRQGLDSDAEKEMKAIWCAKDKNKAFDDVMAGKSVAP
ASCDVDIADHYALGVQLGVSGTPAVVLSNGTLVPGYQPPKEMKEFLDEHQKMTSGKGSTSGSGHHHHHHGTMTSLYKKAGLENLYF
QCTCKDMTDKECLYFCHQDIIW'
B. Write two functions: the first function returns the reaction quotient Q, the second is the Nernst equation which returns the reduction
potential.
Nernst equation:
(1.7)
where ΔE is the change in reduction potential, ΔE° is the change in standard reduction potential, R is the ideal gas constant (8.314 J·mol·K), T is
the temperature, n is the number of electrons involved, F is the Faraday’s constant (96.49 kJ·V·mol), and Q is the reaction quotient.
Calculate the reduction potential of the following reaction, in which [Ag+] = 0.04 mM and [Mn2+] = 0.13 mM.
C. The source code of a function is given but is written poorly. The function and variable names are cryptic and there is no annotation.
Retain the behavior of the function but update it in a readable way. Use the updated function to solve the following problem:
3-Hydroxy-1-(naphthalen-1-yl)pent-4-en-1-one (C15H14O2) was synthesized. The exact mass of the sodiated adduct was determined to be
249.0881 via high-resolution mass spectrometry (HRMS). Determine whether the measurement is consistent with calculated exact mass (14).
exact_masses = {'C': 12.000000, 'H': 1.007825, 'O': 15.994915, 'N': 14.003074, 'Na': 22.989770}
D. The radioactive decay of carbon-11 to boron-11 has a half-life of 20.364 min. Radioactive decay follows first order kinetics, for which Ni is
the number of atoms at time i, k is the radioactive decay constant, and t is the time. The occupational value of carbon-11 dioxide derived air
concentration (DAC) is 0.03 nCi/mL. Suppose a measurement of 0.03 nCi/mL is considered acceptable at a workplace; however, 0.17 nCi/mL
was measured. Operations need to be shut down for 51 min to return to 0.03 nCi/mL; however, something is wrong in the code below to
calculate this. Debug the script.
def k_from_halflife(t_half):
'''Calculate decay constant k from halflife in inverse time'''
return - log(1/2 / (t_half))
You should now be able to write custom functions to perform operations not explicitly implemented in base Python. The function defined
once can be run on many inputs. However, our data is heterogeneous and not all input data may be processed in one function. Conditional
statements can implement control flow to dynamically respond to exceptions in the input data.
Conditional statements allow us to test a variable against a value and perform an action if the condition is met by the variable or perform another
action if not. The type of variables that control this kind of statement are called Boolean.
if answer == 13 * 17 + 19:
print('Correct')
else:
print('Incorrect')
Correct
Here two conditional statements are used. The if statement receives a Boolean. If the statement is true, then it prints “Correct”. The statement is:
the response equals 13 times 17 plus 19.
Another statement used here is the else statement. If the prior if statement is false, the code block of the else statement is executed instead. The
if-block and else-block are mutually exclusive.
Mutually exclusive blocks of code can be strung together using the elif statement, short for “else if ”. Suppose we want to assess whether an integer
is divisible by 16 or 4.
# if the mod of the integer by 16 equals zero (if the integer is divisible by 16)
if an_integer % 16 == 0:
print('%d is divisible by 16.' %an_integer)
# else if the mod of the integer by 4 equals zero
elif an_integer % 4 == 0:
print('%d is divisible by 4.' %an_integer)
else:
print('%d is divisible by neither 16 or 4.' %an_integer)
88 is divisible by 4.
If the integer is divisible by 16, then it is divisible by 4. Thus, it is unnecessary to run the code block specifying the integer is divisible by 4, so if
the first block executes, we want to skip the rest. There is no limit to the number of elif statements to follow an if statement.
# equals
5 == 7
False
# inequal
5 != 7
True
# less than
5<7
True
# greater than
5>7
False
True
False
sierra = 'Sierra'
nevada = 'Nevada'
sierra == nevada
False
sierra != nevada
True
We can also search whether a subsequence is found within another sequence. Take the SMILES for L-alanine. Let us observe whether it contains a
carboxylic acid.
ala = 'N[C@@H](C)C(O)=O'
acid = 'C(O)=O'
acid in ala
True
False
FIGURE 1.4 Venn Diagrams for AND, OR, and NOT Logic Gates.
# AND gate
True and True
True
False
False
# OR gate
True or True
True
False or True
True
False or False
False
# NOT gate
not True
False
not False
True
Other logic gates can be implemented as a combination of these logic gates. For instance, consider the exclusive-or gate XOR, in which A XOR B
is true if exactly one of A or B is true. This can be implemented by (A OR B) AND (NOT (A AND B)).
An amino acid contains an amine and a carboxylic acid. An amine is an alkylic derivative of ammonia. Let us take a simplistic definition and
consider all nonaromatic organic nitrogens that are not amides as amines. Let us see whether aniline is an amino acid.
acid = 'C(O)=O'
amide = 'NC=O'
def is_amino_acid(smiles):
'''returns boolean for whether the provided SMILES is an amino acid'''
contains_acid = acid in smiles # boolean for whether acid motif is present
no_amide = smiles.replace(amide, '') # remove amide motif
contains_amine = 'N' in no_amide # boolean for whether a nonaromatic nitrogen is present after removing amides
return contains_acid and contains_amine
aniline = 'Nc1ccccc1'
if is_amino_acid(aniline):
print('Aniline is an amino acid')
else:
print('Aniline is not an amino acid')
if is_amino_acid(ala):
print('L-alanine is an amino acid')
else:
print('L-alanine is not an amino acid')
We defined two Booleans, whether an amine was present and whether a carboxylic acid was present. We passed it through an AND gate, and
determined that aniline was not an amino acid. L-alanine, on the other hand, was determined to be an amino acid.
Due to conditional statements, we can implement logic in how we process the data. By specifying the cases and the blocks of code to execute
for each case, we can break problems down into smaller components. Therefore, we do not have to find a solution to work on all input data but
collect individual solutions together through conditional statements.
A. We use the rectified linear unit (ReLU), which is a popular activation function in deep learning. Activation functions introduce nonlinearity to
the model. The ReLU function is shown. Implement the ReLU function which accepts a scalar.
(1.8)
B. Chemical shifts measured on proton nuclear resonance spectroscopy are correlated to functional groups. Write a function which returns
possible functional groups based on a chemical shift. Some functional groups and the interval in which they typically appear are listed.
• aliphatic: 0.5–2.0 ppm
Consider what we have accomplished automating so far. Numerical and string operations can be encoded and organized into functions. However,
our data may contain exceptions, so we use conditional statements to define how various inputs can be processed. To execute these automated
operations on a large volume of data, we learn loops.
1.6 LOOPS
You might be wondering why you need to go through the trouble of learning Python if you already have a scientific calculator and a text editor
available. Certainly, manually solving problems is a great short-term solution. Programming is a long-term solution. Suppose you spend one hour
writing a script to complete a task that takes 2 min manually. You would have saved time if you repeated the operations more than 30 times, and
now you can consider inputs many orders of magnitude greater in quantity.
The strength of programming is to automate repetitive tasks in loops. Loops are how we run operations at a massive scale and obtain insight from
large data. The two types of loops we will consider are the for-loop and while-loop. A for-loop repeats the operation over an iterable such as a list.
A while-loop repeats the operation until a certain condition is met.
1.6.1 For-Loop
As an example, given a list of single point energy values in Hartrees, let us convert them to units of kcal/mol. We iterate over a list of energy values
in a for-loop, and save it in a new list.
spes_in_kcal_mol
[-278107.34065111657,
-278128.25745077094,
-278111.8461572177,
-278136.95093000156,
-278123.45355199225,
-278124.9084477681,
-278134.8310197654,
-278130.56658397894,
-278103.86927749123,
-278132.4470780154]
In the for-loop, we define a local variable for each item of the iterable, in this case the_energy. The instructions to be repeated are indented. For
each item in the list of energies in Hartree, we convert the units and append them to a new list, using the append(...) method of lists.
[-278107.34065111657,
-278128.25745077094,
-278111.8461572177,
-278136.95093000156,
-278123.45355199225,
-278124.9084477681,
-278134.8310197654,
-278130.56658397894,
-278103.86927749123,
-278132.4470780154]
The syntax inside the list is: <the operation> for <local variable> in <iterable>. We might read it like: Do this for each item of this list.
1.6.1.2 Iterables
For-loops can iterate over other iterables, such as sets, dictionaries, and tuples. Sets are nonredundant collections of objects. As an example, let us
look at the sequence of Saccharomyces cerevisiae alcohol dehydrogenase ADH1 (15). We convert the string into a set to observe the unique amino
acids present.
adh1 =
'MSIPETQKGVIFYESHGKLEYKDIPVPKPKANELLINVKYSGVCHTDLHAWHGDWPLPVKLPLVGGHEGAGVVVGMGENVKGWKI
GDYAGIKWLNGSCMACEYCELGNESNCPHADLSGYTHDGSFQQYATADAVQAAHIPQGTDLAQVAPILCAGITVYKALKSAN
LMAGHWVAISGAAGGLGSLAVQYAKAMGYRVLGIDGGEGKEELFRSIGGEVFIDFTKEKDIVGAVLKATDGGAHGVINVSVS
EAAIEASTRYVRANGTTVLVGMPAGAKCCSDVFNQVVKSISIVGSYVGNRADTREALDFFARGLVKSPIKVVGLSTLPEIYE
KMEKGQIVGRYVVDTSK'
set(adh1)
{'A',
'C',
'D',
'E',
'F',
'G',
'H',
'I',
'K',
'L',
'M',
'N',
'P',
'Q',
'R',
'S',
'T',
'V',
'W',
'Y'}
len(set(adh1))
20
ADH1 contains all 20 standard amino acids. Let us make a set from the SMILES of reduced nicotinamide adenine dinucleotide (NADH).
nadh = 'O=C(N)C1CC=C[N](C=1)[C@@H]2O[C@@H]([C@@H](O)[C@H]2O)COP([O-])
(=O)OP(=O)([O-])OC[C@H]5O[C@@H](n4cnc3c(ncnc34)N)[C@H](O)[C@@H]5O'
set(nadh)
{'(',
')',
'-',
'1',
'2',
'3',
'4',
'5',
'=',
'@',
'C',
'H',
'N',
'O',
'P',
'[',
']',
'c',
'n'}
Suppose we want to know all the heavy atom elements present. We can save the item if the letter is alphabetical. We iterate through each item of
the set in a for-loop.
# initialize list
heavy_atoms = []
heavy_atoms
NADH contains carbon, phosphorus, nitrogen, and oxygen. The lowercase elements represent atoms in aromatic motifs. So, if we were to
perform energy calculations on NADH, we must confirm the potential handles all these atom types.
1.6.2 While-Loop
The for-loop has a defined end to the loop: the number of items in the iterable. Sometimes the number of operations is not defined. We want to
repeat the operation until a task is done. For this, we use the while-loop. The while-loop defines a block of code to repeat until a condition is met.
Let us simulate a one-dimensional Brownian particle. We initialize the particle at 0, then observe the time taken for the particle to reach 10.
steps_taken = 0
print('%d steps were taken for the particle to arrive from 0 to 10.' %steps_taken)
Nesting loops should be avoided if possible because it increases the computational complexity. Let us simulate a two-dimensional Brownian
particle. The particle can take steps of (1, 0), (0, 1), (−1, 0), (0, −1) in one iteration. We track the number of steps and observe how many
trajectories reach (2, 1) within 103 steps.
# initialize parameter
max_steps = 10 ** 3
# initialize list for results
steps_list = []
# repeat 100 times
for i in range(100):
# initialize coordinate and count
coord = [0, 0] # x and y coordinate
steps_taken = 0
# repeat until the coordinate is (2, 1)
while coord[0] != 2 or coord[1] != 1:
# determine whether the step is in x (0) or y (1) axis
axis = sample([0, 1], 1)[0]
# obtain step size
step = sample([-1, 1], 1)[0]
# update the coordinate
coord[axis] += step
# add one to steps taken
steps_taken += 1
# if max steps were reached, exit the while loop
if steps_taken >= max_steps:
break
# once while-loop is complete, append result
steps_list.append(steps_taken)
steps_list
[117,
1000,
1000,
141,
1000,
1000,
1000,
91,
1000,
11,
9,
13,
1000,
71,
1000,
41,
1000,
931,
21,
1000,
611,
1000,
1000,
7,
73,
1000,
1000,
1000,
65,
1000,
1000,
1000,
1000,
405,
1000,
5,
61,
1000,
1000,
157,
1000,
1000,
3,
341,
7,
1000,
17,
3,
1000,
1000,
1000,
25,
11,
25,
19,
1000,
7,
675,
1000,
1000,
1000,
939,
1000,
237,
7,
1000,
227,
1000,
9,
473,
15,
5,
285,
1000,
293,
369,
1000,
1000,
1000,
1000,
1000,
15,
19,
831,
1000,
9,
1000,
5,
1000,
5,
403,
493,
3,
29,
141,
11,
3,
19,
281,
1000]
The while-loop was nested inside a for-loop to repeat 100 simulations of Brownian motion. Notice how the indentation is added to the
while-loop because it is inside the for-loop. We have an if statement with a break. More on this later. Let us count the simulations which
reached (2, 1) within 103 steps. We can subset lists by list comprehension. Simply return the item of the list but add a condition at the end. The
condition is the item is less than the specified maximum numbers of steps allowed. Because not all items of the list are below 1000 steps, the
resulting list is a subset of the original list.
[117,
141,
91,
11,
9,
13,
71,
41,
931,
21,
611,
7,
73,
65,
405,
5,
61,
157,
3,
341,
7,
17,
3,
25,
11,
25,
19,
7,
675,
939,
237,
7,
227,
9,
473,
15,
5,
285,
2
93,
369,
15,
19,
831,
9,
5,
5,
403,
493,
3,
29,
141,
11,
3,
19,
281]
55
def mean(alist):
'''return the mean of a list'''
return sum(alist) / len(alist)
If the Brownian particle reaches (2,1) in 1000 steps, it takes 165 steps on average.
# from 0 through 9,
for i in range(10):
# if the value is divisible by 2
if i % 2 == 0:
continue
print(i)
1
3
5
7
9
Only odd integers were printed because we skipped even integers with a continue before reaching the print statement.
i=0
0
1
2
3
4
5
6
7
8
9
10
The loop will continue indefinitely because the statement passed to the while-loop is always true. However, we exit the loop using break.
Similar to functions, pass can fill in loops during development to avoid errors.
for i in range(5):
pass
We are ready to process large volumes of data by encoding operations for individual data points and executing them in a loop.
A. Write a function which accepts the whole number i, and returns the ith number in the Fibonacci sequence. The Fibonacci sequence is a
sequence of natural numbers in which the ith entry is the sum of the (i − 2)th and (i − 1)th numbers. The first 10 numbers of the Fibonacci
sequence are 1, 1, 2, 3, 5, 8, 13, 21, 34, and 55.
B. Suppose we have analytically determined the diffusion coefficient at 24 °C to find the Stokes–Einstein radius of drug molecules in water (16).
Assume the molecules are spheres. Calculate the Stokes–Einstein radius for all compounds in a for-loop.
(1.9)
where kB is the Boltzmann’s constant (1.38 × 10–23 J/K), T is the temperature, η is the viscosity of the liquid, and D is the diffusion coefficient of
ion.
C. We previously identified the indices of PAM in the exons of A. thaliana (9). Write a function that takes the sequence of the exon in lower case
and capitalizes all candidates for the guide RNA to recognize. The region must satisfy all of the following conditions:
• Located at the 5' end of the PAM (<sequence>-NGG or CCN-<sequence>)
• Is 20 nucleotides long
exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAGTC
TACACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACATTTCCT
TCTCTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTCTCCAAAC
TCAAAATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGTCGTCGCCGT
CGTAAGCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACTTGTCG
GTGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAAC
AGTGAAGGCTTTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCT
ATCTATCCTTCCTAGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTT
CTTGTTAGTTTGAAGAATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCAT
TTGGTAATCAACTTTAATCCATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAAT
CCGCGGTATATCGCGGTATAATTTACTTTTTAAAGTTAATATATATTAAAACTTG'
D. You want to prototype the behavior of an automatic pH titrator, including a pH probe, 1 N hydrochloric acid pump, and 1 N sodium
hydroxide pump connected to the same computer. You want an enzymatic reaction to proceed for one hour with pH on the range of (7.0, 7.2).
The pH should be measured in 30 s increments if the pH is determined to be in the interval. If the pH goes out of range, add the appropriate
solution dropwise to return to the pH range, over increments of 1 s. Use the provided functions to draft the behavior. Test the code by speeding
up the time 60-fold. The pH fluctuates by sampling from a Gaussian distribution every 30 s, purely for illustration.
pH = 7.1
hcl_added = 0 # mL
naoh_added = 0 # mL
one_drop = 0.0648524 # mL
def add_hcl():
'''use by: hcl_added, pH = add_hcl()
This will update the record of added HCl and update the pH'''
return hcl_added + one_drop, pH - 0.05
def add_naoh():
'''use by: naoh_added, pH = add_naoh() '''
return naoh_added + one_drop, pH + 0.05
times = {
'reaction': 1 * 60 * 60, # s
'in_range': 30, # s
'out_range': 1 # s
}
# expedite the time by 60 fold
speedup = 60
times = {name:t / speedup for (name, t) in times.items()}
With solutions we have learned in base Python, we can now design algorithmic solutions to automate and apply on data we could not have
reached by manual efforts. We cover commands from external Python packages in subsequent chapters; however, we structure our scripts in
a similar manner. We build on these solutions in base Python; however, code should be made more concise and readable by using packages
optimized for the task in practical implementations.
Python is a simple language suitable to begin exploring algorithmic solutions in chemical problems. After trying out a few practice problems, you
should be comfortable opening a Python interactive developing environment (IDE) for solving chemical problems in base Python.
• Chemists can use Python as a scientific calculator. The code documents the calculations performed.
• Python can manipulate text using string operations, which is amenable for files, sequences, and chemical structures.
• Booleans (True/False) allows us to process various inputs by running certain blocks of code only under a given condition.
• McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython; OʼReilly Media, Inc., 2012 (17).
• Severance, C. R. Python for Everybody: Exploring Data in Python 3; North Charleston: CreateSpace Independent Publishing Platform,
2016 (18).
Appendix A.
Solutions to Practice Problems
1.2.1 Solutions
A.
# molecular weights
benzodioxaborole_mw = 202.06 # g/mol
bromopropene_mw = 135.00 # g/mol
pd_catalyst_mw = 1155.59 # g/mol
# equivalence
benzodioxaborole_equiv = 1.1
bromopropene_equiv = 1.0
pd_catalyst_equiv = 0.01
B.
C.
absorbance = 0.31 # au
D.
# equilibrium constants
K1 = 1.7* 10**(-3)
K2 = 2.5 * 10 ** (-4)
K1K2 = K1 * K2
# henry's law constant
kh = 29.41
Return to Section
1.3.3 Solutions
A.
B.
exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAG
TCTACACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACAT
TTCCTTCTCTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTC
TCCAAACTCAAAATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGT
CGTCGCCGTCGTAAGCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACTTGT
CGGTGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAACAGTGAAG
GCTTTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCTATCTAT
CCTTCCTAGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTTCTTGT
TAGTTTGAAGAATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCATTTGG
TAATCAACTTTAATCCATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAATCCG
CGGTATATCGCGGTATAATTTACTTTTTAAAGTTAATATATATTAAAACTTG'
pam = 'GG'
index1 = exon1.find(pam)
index2 = exon2.find(pam)
index3 = exon3.find(pam)
print('First occurances of the PAM site NGG in exons 1, 2, and 3 respectively: %d, %d, %d' %(index1, index2,
index3))```
C.
d_glucosamine_sulfate = 'C([C@H]([C@H]([C@@H]([C@H](C=O)N)O)O)O)O.OS(=O)(=O)O′
structure_list = d_glucosamine_sulfate.split('.')
len(structure_list)
print('lengths of structures:', len(structure_list[0]), len(structure_list[1]))
structure_list[0]
D.
float(saveline.split()[−1])
Return to Section
1.4.1 Solutions
A.
recombinant_sarafotoxin =
'MKDDAAIQQTLAKMGIKSSDIQPAPVAGMKTVLTNSGVLYITDDGKHIIQGPMYDVSGTAPVNVTNKMLLKQLNALEKEMIVYKA
PQEKHVITVFTDITCGYCHKLHEQMADYNALGITVRYLAFPRQGLDSDAEKEMKAIWCAKDKNKAFDDVMAGKSVAPASCDVDIAD
HYALGVQLGVSGTPAVVLSNGTLVPGYQPPKEMKEFLDEHQKMTSGKGSTSGSGHHHHHHGTMTSLYKKAGLENLYFQCTCKDMTD
KECLYFCHQDIIW'
tev_index(recombinant_sarafotoxin)
B.
R = 0.008314 # kJ / (mol K)
F = 96.49 # kJ / (V mol)
C.
exact_masses = {'C': 12.000000, 'H': 1.007825, 'O': 15.994915, 'N': 14.003074, 'Na': 22.989770}
D.
def k_from:halflife(t_half):
'''Calculate decay constant k from halflife in inverse time'''
return - log(1/2) / (t_half) # erroneous parentheses
Return to Section
1.5.3 Solutions
A.
def relu(val):
if val > 0:
return val
else:
return 0
relu(1.3)
B.
def functional_group(chemical_shift):
possible_functional_groups = []
if chemical_shift >= 0.5 and chemical_shift <= 2.0:
possible_functional_groups.append('aliphatic')
if chemical_shift >= 1.5 and chemical_shift <= 2.5:
possible_functional_groups.append('allylic')
if chemical_shift >= 2.5 and chemical_shift <= 4.5:
possible_functional_groups.append('CH2-X')
if chemical_shift >= 0.5 and chemical_shift <= 5.0:
possible_functional_groups.append('ROH')
if chemical_shift >= 4.5 and chemical_shift <= 6.5:
possible_functional_groups.append('vinylic')
if chemical_shift >= 6.0 and chemical_shift <= 8.5:
possible_functional_groups.append('aromatic')
return possible_functional_groups
functional_group(7)
Return to Section
1.6.4 Solutions
A.
def fibonacci(n):
'''Return the nth entry of the Fibonacci sequence'''
seq = [1, 1]
for i in range(n - 1):
seq.append(seq[−2] + seq[−1])
return seq[−1]
fibonacci(10)
B.
drug_se_radius = {}
C.
def grna_candidates(dna):
'''return candidate regions to hybridize to gRNA using SpCas9'''
# lower case the sequence
dna = dna.lower()
# store indices to capitalize. capitalize at end
cap_indices = []
# scan the sequence from 5′ to 3′
for i in range(len(dna) - 23):
# determine if on index of PAM (CCN)
is_pam = dna[i] == 'c' and dna[i + 1] == 'c'
grna_region = dna[i + 3:i + 23]
# move on if not PAM
if not is_pam:
continue
# skip if polyT is present
elif 'tttt' in grna_region:
continue
else:
# calculate the GC content
gc_cont = (grna_region.count('g') + grna_region.count('c')) / 20
if gc_cont >= 0.3 and gc_cont <= 0.8:
cap_indices += list(range(i + 3, i + 23))
# now scan in the opposite direction
for i in range(len(dna)-1, 22, −1):
# determine if on index of PAM (NGG)
is_pam = dna[i] == 'g' and dna[i - 1] == 'g'
grna_region = dna[i - 22: i - 2]
# move on if not PAM
if not is_pam:
continue
# skip if polyT is present
elif 'tttt' in grna_region:
continue
else:
# calculate the GC content
gc_cont = (grna_region.count('g') + grna_region.count('c')) / 20
if gc_cont >= 0.3 and gc_cont <= 0.8:
cap_indices + = list(range(i - 22, i - 2))
# save dna to list
dna = list(dna)
# capitalize at each index determined to be candidate site for gRNA recognition
for i in cap_indices:
dna[i] = dna[i].upper()
return ''.join(dna)
exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAGTCTA
CACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACATTTCCTTCT
CTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTCTCCAAACTCAA
AATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGTCGTCGCCGTCGTAA
GCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACTTGTCGG
TGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAACAGTGAAGGCT
TTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCTATCTATCCTTCCT
AGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTTCTTGTTAGTTTGAAGA
ATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCATTTGGTAATCAACTTTAATC
CATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAATCCGCGGTATATCGCGGTATAAT
TTACTTTTTAAAGTTAATATATATTAAAACTTG'
grna_candidates(exon1)
D.
pH = 7.1
hcl:added = 0 # mL
naoh_added = 0 # mL
one_drop = 0.0648524 # mL
def add_hcl():
'''use by: hcl:added, pH = add_hcl()
This will update the record of added HCl and update the pH'''
return hcl:added + one_drop, pH - 0.05
def add_naoh():
'''use by: naoh_added, pH = add_naoh() '''
return naoh_added + one_drop, pH + 0.05
times = {
'reaction': 1 * 60 * 60, # s
'in_range': 30, # s
'out_range': 1 # s
}
# expedite the time by 60 fold
speedup = 60
times = {name:t / speedup for (name, t) in times.items()}
sleep(times['out_range'])
# test out the code
if one_second_counter == 30:
one_second_counter = 0
pH += gauss(0, 0.05)
tic = time()
time_elapsed = (tic - toc)/60 * speedup # min
print('%.2f min|pH = %.2f' %(time_elapsed, pH))
Return to Section
Bibliography
1. Van Rossum, G., Drake, F. L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, 2009.
2. Miyaura, N.; Yamada, K.; Suzuki, A. A new stereospecific cross-coupling by the palladium-catalyzed reaction of 1-alkenylboranes with 1-alkenyl or 1-alkynyl halides.
Tetrahedron Lett. 1979, 20(36), 3437–3440, 10.1016/S0040-4039(01)95429-2.
3. Mojzita, D.; Penttilä, M.; Richard, P. Identification of an l-arabinose reductase gene in Aspergillus niger and its role in l-arabinose catabolism. J. Biol. Chem. 2010, 285(31),
23622–23628, 10.1074/jbc.M110.113399.
4. Boardman, N. K.; Thorne, S. W. Sensitive fluorescence method for the determination of chlorophyll achlorophyll b ratios. Biochim. Biophys. Acta (BBA)-Bioenergetics 1971,
253(1), 222–231, 10.1016/0005-2728(71)90248-9.
5. Eyre, B. D.; Cyronak, T.; Drupp, P.; De Carlo, E. H.; Sachs, J. P.; Andersson, A. J. Coral reefs will transition to net dissolving before end of century. Science 2018, 359(6378),
908–911, 10.1126/science.aao1118.
6. Fradkov, A. F.; Chen, Y.; Ding, L.; Barsova, E. V.; Matz, M. V.; Lukyanov, S. A. Novel fluorescent protein from discosoma coral and its mutants possesses a unique far-red
fluorescence. FEBS Lett. 2000, 479 (3), 127–130, 10.1016/S0014-5793(00)01895-0.
7. Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G., Erlich, H. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. In Cold Spring Harbor
Symposia on Quantitative Biology; Cold Spring Harbor Laboratory Press, 1986; vol. 51, pp. 263–273.
8. Knight, T. Idempotent Vector Design for Standard Assembly of Biobricks; MIT Artificial Intelligence Laboratory; MIT Synthetic Biology Working Group, 2003.
9. Grützner, R.; Martin, P.; Horn, C.; Mortensen, S.; Cram, E. J.; Lee-Parsons, C. W. T.; Stuttmann, J.; Marillonnet, S. High-efficiency genome editing in plants mediated by a
cas9 gene containing multiple introns. Plant Commun. 2021, 2(2), 100135, 10.1016/j.xplc.2020.100135.
10. National Center for Biotechnology Information. Gene ID: 835401, Arabidopsis thaliana Homeodomain-like superfamily protein (TRY), Chromosome: 5; National Library of
Medicine (US): Bethesda, MD, 1988. https://www.ncbi.nlm.nih.gov/gene/835401 (accessed 2022-01-08).
11. Manathunga, M.; Miao, Y.; Mu, D.; Götz, A. W.; Merz, K. M., Jr. Parallel implementation of density functional theory methods in the quantum interaction computational
kernel program. J. Chem. Theory Comput. 2020, 16(7), 4315–4326, 10.1021/acs.jctc.0c00290.
12. Manathunga, M.; Jin, C.; Cruzeiro, V. W. D.; Smith, V.; Keipert, K.; Pekurovsky, D.; Mu, D.; Miao, Y.; He, X.; Ayers, K.; Brothers, E.; Götz, A. W.; Merz, K. M.
Quick-21.03. https://github.com/merzlab/QUICK.
13. Sequeira, A. F.; Turchetto, J.; Saez, N. J.; Peysson, F.; Ramond, L.; Duhoo, Y.; Blémont, M.; Fernandes, V.O.; Gama, L. T.; Ferreira, L. M. A.; et al. Gene design, fusion
technology and tev cleavage conditions influence the purification of oxidized disulphide-rich venom peptides in Escherichia coli. Microbial Cell Factories 2017, 16(1), 4,
10.1186/s12934-016-0618-0.
14. Fernandes, R. A.; Gangani, A. J.; Panja, A.; Synthesis of 5-vinyl-2-isoxazolines by palladium-catalyzed intramolecular o-allylation of ketoximes. Org. Lett. 2021, 23(16),
6227–6231, 10.1021/acs.orglett.1c01897.
15. Bennetzen, J. L.; Hall, B. D. The primary structure of the Saccharomyces cerevisiae gene for alcohol dehydrogenase. J. Biol. Chem. 1982, 257 (6), 3018–3025, 10.1016/
S0021-9258(19)81067-0.
16. Di Cagno, M. P.; Clarelli, F.; Våbenø, J.; Lesley, C.; Darsim Rahman, S.; Cauzzo, J.; Franceschinis, E.; Realdon, N.; Stein, P. C. Experimental determination of drug diffusion
coefficients in unstirred aqueous environments by temporally resolved concentration measurements. Mol. Pharmaceutics 2018, 15(4), 1488–1494.
17. McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Ipython; O'Reilly Media, Inc., 2012.
18. Severance, C. R. Python for Everybody: Exploring Data in Python 3; CreateSpace Independent Publishing Platform: North Charleston, 2016.
19. Reilly, S. The role of libraries in supporting data exchange. In 78th IFLA General Conference and Assembly, 2012. http://conference.ifla.org/sites/default/files/files/papers/
wlic2012/116-reilly-en.pdf.
20. Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N. J.; Kern, R.; Picus, M.; Hoyer, S.;
van Kerkwijk, M. H.; Brett, M.; Haldane, A.; del Río, J. F.; Wiebe, M.; Peterson, P.; Gérard-Marchant, P.; Sheppard, K.; Reddy, T.; Weckesser, W.; Abbasi, H.; Gohlke, C.;
Oliphant, T. E. Array programming with NumPy. Nature 2020, 585 (7825), 357–362, 10.1038/s41586-020-2649-2.
21. Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 1976, 32 (5), 922–923, 10.1107/S0567739476001873.
22. McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference; Austin, TX, 2010; vol. 445, pp. 51–56.
23. Simmons, J. P.; Nelson, L. D.; Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol.
Sci. 2011, 22 (11), 1359–1366, 10.1177/0956797611417632.
24. Picache, J. A.; Rose, B. S.; Balinski, A.; Leaptrot, K. L.; Sherrod, S. D.; May, J. C.; McLean, J. A. Collision cross section compendium to annotate and predict multi-omic
compound identities. Chem. Sci. 2019, 10 (4), 983–993, 10.1039/C8SC04396E.
25. Wickham, H. Tidy data. J. Stat. Softw. 2014, 59 (10), 1–23, 10.18637/jss.v059.i10.
26. Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 2021, 6 (60), 3021, 10.21105/joss.03021.
27. Peng, R. The reproducibility crisis in science: a statistical counterattack. Significance 2015, 12 (3), 30–32, 10.1111/j.1740-9713.2015.00827.x.
28. Tetko, I. V.; Engkvist, O.; Koch, U.; Reymond, J.-L.; Chen, H. Bigchem: challenges and opportunities for big data analysis in chemistry. Mol. Inf. 2016, 35 (11–12),
615–621, 10.1002/minf.201600073.
29. Leonelli, S. Scientific research and big data. In The Stanford Encyclopedia of Philosophy, summer 2020 ed.; Zalta, E. N., Ed.; 2020. https://plato.stanford.edu/archives/
sum2020/entries/science-big-data/.
30. Prakash, N.; Gareja, D. A. Cheminformatics. J. Proteomics Bioinform. 2010, 03, 249–252, 10.4172/jpb.1000147.
31. Wishart, D. S. Introduction to cheminformatics. Curr. Protoc. Bioinform. 2016, 53 (1), 14–11, 10.1002/0471250953.bi1401s18.
32. Firdaus Begam, B.; Satheesh, J.; Kumar. A study on cheminformatics and its applications on modern drug discovery. Procedia Eng. 2012, 38, 1264–1275, 10.1016/
j.proeng.2012.06.156.
33. Anderson, E., Veith, G. D., Weininger, D. SMILES, a line notation and computerized interpreter for chemical structures; US Environmental Protection Agency, Environmental
Research Laboratory, 1987.
34. Daylight Chemical Information Systems, Inc. Smarts - a language for describing molecular patterns. Daylight Theory Manual, ver. 4.9, 2011. https://daylight.com/dayhtml/
doc/theory/theory.smarts.html (accessed Jan 22, 2022).
35. James, C.; Weininger, D.; Delany, J. Daylight theory manual, ver. 4.9., 2011. https://daylight.com/dayhtml/doc/theory (accessed Jan 22, 2022).
36. OpenEye Scientific Software, Inc. Oechem toolkit 3.2.0.0. https://docs.eyesopen.com/toolkits/python/oechemtk/index.html (accessed Jan 22, 2022).
37. OpenEye Scientific Software, Inc. Chemaxon–Cheminformatics platforms and desktop applications. https://chemaxon.com (accessed Jan 22, 2022).
38. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open babel: an open chemical toolbox. J. Cheminform. 2011, 3 (1), 1–14,
10.1186/1758-2946-3-33.
39. Landrum, G. RDKit: Open-source cheminformatics. http://www.rdkit.org (accessed Jan 22, 2022).
40. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E. Pubchem in 2021: new
data content and improved web interfaces. Nucleic Acids Res. 2021, 49(D1), D1388–D1395, 10.1093/nar/gkaa971.
41. Davies, M.; Nowotka, M.; Papadatos, G.; Dedman, N.; Gaulton, A.; Atkinson, F.; Bellis, L.; Overington, J. P. Chembl web services: streamlining access to drug discovery data
and utilities. Nucleic Acids Res. 2015, 43 (W1), W612–W620, 10.1093/nar/gkv352.
42. Wishart, D. S.; Guo, A. C.; Oler, E.; Wang, F.; Anjum, A.; Peters, H.; Dizon, R.; Sayeeda, Z.; Tian, S.; Lee, B. L.; Berjanskii, M.; Mah, R.; Yamamoto, M.; Jovel, J.;
Torres-Calzada, C.; Hiebert-Giesbrecht, M.; Lui, V. W.; Varshavi, D.; Varshavi, D.; Allen, D.; Arndt, D.; Khetarpal, N.; Sivakumaran, A.; Harford, K.; Sanford, S.; Yee, K.;
Cao, X.; Budinski, Z.; Liigand, J.; Zhang, L.; Zheng, J.; Mandal, R.; Karu, N.; Dambrova, M.; Schiöth, H. B.; Greiner, R.; Gautam, V. Hmdb 5.0: the human metabolome
database for 2022. Nucleic Acids Res. 2022, 50(D1), D622–D631, 10.1093/nar/gkab1062.
43. Bansal, P.; Morgat, A.; Axelsen, K. B.; Muthukrishnan, V.; Coudert, E.; Aimo, L.; Hyka-Nouspikel, N.; Gasteiger, E.; Kerhornou, A.; Neto, T. B.; Pozzato, M.; Blatter, M.-C.;
Ignatchenko, A.; Redaschi, N.; Bridge, A. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res. 2022, 50(D1), D693–D700, 10.1093/nar/gkab1016.
44. Kearnes, S. M.; Maser, M. R.; Wleklinski, M.; Kast, A.; Doyle, A. G.; Dreher, S. D.; Hawkins, J. M.; Jensen, K. F.; Coley, C. W. The open reaction database. J. Am. Chem. Soc.
2021, 143 (45), 18820–18826, 10.1021/jacs.1c09820.
45. Ruddigkeit, L.; Awale, M.; Reymond, J.-L. Expanding the fragrance chemical space for virtual screening. J. Cheminform. 2014, 6 (1), 1–12, 10.1186/1758-2946-6-27.
46. Ahmed, J.; Preissner, S.; Dunkel, M.; Worth, C. L.; Eckert, A.; Preissner, R. Supersweet—a resource on natural and artificial sweetening agents. Nucleic Acids Res. 2011, 39
(Database issue), D377–D382, 10.1093/nar/gkq917.
47. Wiener, A.; Shudler, M.; Levit, A.; Niv, M. Y. Bitterdb: a database of bitter compounds. Nucleic Acids Res. 2012, 40(Database issue), D413–D419, 10.1093/nar/gkr755.
48. Danishuddin; Khan, A. U. Descriptors and their selection methods in qsar analysis: paradigm for drug design. Drug Discovery Today 2016, 21 (8), 1291–1302, 10.1016/
j.drudis.2016.06.013.
49. Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of mdl keys for use in drug discovery. J. Chem. Inform. Comput. Sci. 2002, 42(6), 1273–1280.
10.1021/ci010132r.
50. Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at Chemical Abstracts Service. J. Chem. Document. 1965, 5 (2),
107–113, 10.1021/c160017a018.
51. Daylight Chemical Information Systems, Inc. Fingerprints - screening and similarity. Daylight Theory Manual, ver. 4.9, 2011. https://daylight.com/dayhtml/doc/theory/
theory.smarts.html (accessed Jan 22, 2022).
52. Maccs structural keys; Accelrys, Inc.: San Diego, CA, 2011.
53. Tanimoto, T. T. An Elementary Mathematical Theory of Classification and Prediction; International Business Machines Corporation, 1958.
54. Sheridan, R. P.; Kearsley, S. K. Why do we need so many chemical similarity search methods? Drug Discovery Today 2002, 7 (17), 903–911, 10.1016/
S1359-6446(02)02411-X.
55. Chen, H.; Kogej, T.; Engkvist, O. Cheminformatics in drug discovery, an industrial perspective. Mol. Inf. 2018, 37 (9-10), e1800041, 10.1002/minf.201800041.
56. Henderson, L. J. Concerning the relationship between the strength of acids and their capacity to preserve neutrality. Am. J. Physiol. Legacy Content 1908, 21 (2), 173–179,
10.1152/ajplegacy.1908.21.2.173.
57. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau,
D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
58. Subramanian, G.; Ramsundar, B.; Pande, V.; Denny, R. A. Computational modeling of β-secretase 1 (bace-1) inhibitors using ligand based approaches. J. Chem. Inf. Model.
2016, 56 (10), 1936–1949, 10.1021/acs.jcim.6b00290.
59. Venugopal, C.; Demos, C. M.; Jagannatha Rao, K. S.; Pappolla, M. A.; Sambamurti, K. Beta-secretase: structure, function, and evolution. CNS Neurol. Disord. Drug Target.
2008, 7 (3), 278–294, 10.2174/187152708784936626.
60. Breiman, L. Random forests. Mach. Learn. 2001, 45 (1), 5–32, 10.1023/A:1010933404324.
61. Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2 (11), 559–572,
10.1080/14786440109462720.
62. Van der Maaten, L., Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 2008, 2579-2605.
63. Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742–754, 10.1021/ci100050t.
64. Ward, J. H., Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58 (301), 236–244, 10.1080/01621459.1963.10500845.
65. Liu, F. T., Ting, K. M., Zhou, Z.-H. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining; IEEE, 2008; pp. 413–422.
66. Jurs, P. C.; Kowalski, B. R.; Isenhour, T. L.; Reilley, C. N. Computerized learning machines applied to chemical problems. molecular structure parameters from low resolution
mass spectrometry. Anal. Chem. 1970, 42 (12), 1387–1394, 10.1021/ac60294a015.
67. Artrith, N.; Butler, K. T.; Coudert, F.-X.; Han, S.; Isayev, O.; Jain, A.; Walsh, A. Best practices in machine learning for chemistry. Nat. Chem. 2021, 13 (6), 505–508,
10.1038/s41557-021-00716-z.
68. Dral, P. O. Quantum chemistry in the age of machine learning. J. Phys. Chem. Lett. 2020, 11 (6), 2336–2347, 10.1021/acs.jpclett.9b03664.
69. Panteleev, J.; Gao, H.; Jia, L. Recent applications of machine learning in medicinal chemistry. Bioorg. Med. Chem. Lett. 2018, 28 (17), 2807–2815, 10.1016/
j.bmcl.2018.06.046.
70. Strieth-Kalthoff, F.; Sandfort, F.; Segler, M. H. S.; Glorius, F. Machine learning the ropes: principles, applications and directions in synthetic chemistry. Chem. Soc. Rev. 2020,
49 (17), 6154–6168, 10.1039/C9CS00786E.
71. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A.
A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.;
Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly accurate protein structure prediction with alphafold.
Nature 2021, 596 (7873), 583–589, 10.1038/s41586-021-03819-2.
72. Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B.; Meyer, E. F., Jr.; Brice, M. D.; Rodgers, J. R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. The protein data bank: a
computer-based archival file for macromolecular structures. J. Mol. Biol. 1977, 112 (3), 535–542, 10.1016/S0022-2836(77)80200-3.
73. Xiao, T.; Lu, J.; Zhang, J.; Johnson, R. I.; McKay, L. G. A.; Storm, N.; Lavine, C. L.; Peng, H.; Cai, Y.; Rits-Volloch, S.; Lu, S.; Quinlan, B. D.; Farzan, M.; Seaman,
M. S.; Griffiths, A.; Chen, B. A trimeric human angiotensin-converting enzyme 2 as an anti-sars-cov-2 agent. Nat. Struct. Mol. Biol. 2021, 28 (2), 202–209, 10.1038/
s41594-020-00549-3.
74. Barca, G. M. J.; Bertoni, C.; Carrington, L.; Datta, D.; de Silva, N.; Deustua, J. E.; Fedorov, D. G.; Gour, J. R.; Gunina, A. O.; Guidez, E.; Harville, T.; Irle, S.; Ivanic,
J.; Kowalski, K.; Leang, S. S.; Li, H.; Li, W.; Lutz, J. J.; Magoulas, I.; Mato, J.; Mironov, V.; Nakata, H.; Pham, B. Q.; Piecuch, P.; Poole, D.; Pruitt, S. R.; Rendell, A. P.;
Roskop, L. B.; Ruedenberg, K.; Sattasathuchana, T.; Schmidt, M. W.; Shen, J.; Slipchenko, L.; Sosonkina, M.; Sundriyal, V.; Tiwari, A.; Galvez Vallejo, J. L.; Westheimer, B.;
Włoch, M.; Xu, P.; Zahariev, F.; Gordon, M. S. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152 (15), 154102,
10.1063/5.0005188.
75. Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett,
M.; Wilson, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.; Jones, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, I.; Feng, Y.; Moore, E. W.; VanderPlas, J.; Laxalde, D.;
Perktold, J.; Cimrman, R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.; Archibald, A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt, P. SciPy 1.0 Contributors. SciPy 1.0:
Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272, 10.1038/s41592-019-0686-2.
76. Briggs, T. S.; Rauscher, W. C. An oscillating iodine clock. J. Chem. Educ. 1973, 50 (7), 496, 10.1021/ed050p496.
77. Kim, K.-R.; Lee, D. J.; Shin, K. J. A simplified model for the Briggs–Rauscher reaction mechanism. J. Chem. Phys. 2002, 117 (6), 2710–2717, 10.1063/1.1491243.
78. Hjorth Larsen, A.; Jørgen Mortensen, J.; Blomqvist, J.; Castelli, I. E.; Christensen, R.; Dułak, M.; Friis, J.; Groves, M. N.; Hammer, B.; Hargus, C.; Hermes, E. D.; Jennings,
P. C.; Bjerre Jensen, P.; Kermode, J.; Kitchin, J. R.; Leonhard Kolsbjerg, E.; Kubal, J.; Kaasbjerg, K.; Lysgaard, S.; Bergmann Maronsson, J.; Maxson, T.; Olsen, T.; Pastewka,
L.; Peterson, A.; Rostgaard, C.; Schiøtz, J.; Schütt, O.; Strange, M.; Thygesen, K. S.; Vegge, T.; Vilhelmsen, L.; Walter, M.; Zeng, Z.; Jacobsen, K. W. The atomic simulation
environment—a python library for working with atoms. J. Phys.: Condens. Matter 2017, 29 (27), 273002, 10.1088/1361-648X/aa680e.
79. Smith, D. G. A.; Burns, L. A.; Simmonett, A. C.; Parrish, R. M.; Schieber, M. C.; Galvelis, R.; Kraus, P.; Kruse, H.; di Remigio, R.; Alenaizan, A.; James, A. M.; Lehtola, S.;
Misiewicz, J. P.; Scheurer, M.; Shaw, R. A.; Schriber, J. B.; Xie, Y.; Glick, Z. L.; Sirianni, D. A.; O'Brien, J. S.; Waldrop, J. M.; Kumar, A.; Hohenstein, E. G.; Pritchard, B. P.;
Brooks, B. R.; Schaefer, H. F., III; Sokolov, A. Y.; Patkowski, K.; DePrince, A. E., III; Bozkaya, U.; King, R. A.; Evangelista, F. A.; Turney, J. M.; Crawford, T. D.; Sherrill, C.
D. Psi4 1.4: open-source software for high-throughput quantum chemistry. J. Chem. Phys. 2020, 152 (18), 184108, 10.1063/5.0006002.
80. Anthony, N. G.; Johnston, B. F.; Khalaf, A. I.; MacKay, S. P.; Parkinson, J. A.; Suckling, C. J.; Waigh, R. D. Short lexitropsin that recognizes the DNA minor groove at
5′-ACTAGT-3′: Understanding the role of isopropyl-thiazole. J. Am. Chem. Soc. 2004, 126 (36), 11338–11349, 10.1021/ja030658n.
81. Devereux, C.; Smith, J. S.; Huddleston, K. K.; Barros, K.; Zubatyuk, R.; Isayev, O.; Roitberg, A. E. Extending the applicability of the ani deep learning molecular potential to
sulfur and halogens. J. Chem. Theory Comput. 2020, 16 (7), 4192–4202, 10.1021/acs.jctc.0c00121.
82. Cock, P. J. A.; Antao, T.; Chang, J. T.; Chapman, B. A.; Cox, C. J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; de Hoon, M. J. Biopython: freely
available python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25 (11), 1422–1423, 10.1093/bioinformatics/btp163.
83. Hamelryck, T.; Manderick, B. PDB file parser and structure class implemented in Python. Bioinformatics 2003, 19 (17), 2308–2310, 10.1093/bioinformatics/btg299.
84. Wang, S.; Wacker, D.; Levit, A.; Che, T.; Betz, R. M.; McCorvy, J. D.; Venkatakrishnan, A. J.; Huang, X.-P.; Dror, R. O.; Shoichet, B. K.; Roth, B. L. D4 dopamine receptor
high-resolution structures enable the discovery of selective agonists. Science 2017, 358 (6361), 381–386, 10.1126/science.aan5468.
85. Kundrotas, P. J.; Anishchenko, I.; Dauzhenka, T.; Kotthoff, I.; Mnevets, D.; Copeland, M. M.; Vakser, I. A. Dockground: a comprehensive data resource for modeling of
protein complexes. Protein Sci. 2018, 27 (1), 172–181, 10.1002/pro.3295.
86. She, M.; Decker, C. J.; Svergun, D. I.; Round, A.; Chen, N.; Muhlrad, D.; Parker, R.; Song, H. Structural basis of dcp2 recognition and activation by dcp1. Mol. Cell 2008,
29 (3), 337–349, 10.1016/j.molcel.2008.01.002.
87. Houk, K. N.; Liu, F. Holy grails for computational organic chemistry and biochemistry. Acc. Chem. Res. 2017, 50 (3), 539–543, 10.1021/acs.accounts.6b00532.
88. Markowetz, F. All biology is computational biology. PLoS Biol. 2017, 15 (3), e2002050, 10.1371/journal.pbio.2002050.
Glossary
Algorithm: A set of instructions to perform to solve problems.
Application Programming Interface (API): A standardized interface through which two or more software can interact.
Black box: Any system in which we know the input/output but not the inner workings of how the input is processed to produce the
output (e.g., programs which we know how to use but never seen the source code).
Bug: Error in the source code of a computer program which causes it unexpected behavior.
Codon: A sequence of three nucleotides interpreted at protein synthesis as adding a particular amino acid or start/stop peptide synthesis.
Compiled languages: Programming languages in which the whole source code is converted to machine code prior to execution.
Ion mobility spectrometry (IMS): An analytical technique by which gas phase ions are separated via mobility through inert gas.
K-fold cross validation: Evaluation of a supervised learning model by partitioning the data set into k-folds and using each fold independently
as the validation set.
Principal component analysis (PCA): A linear dimensionality reduction technique which selects principal components to retain variance.
Restriction site: A particular, short sequence of nucleotides which are recognized and cleave by the corresponding restriction enzyme (e.g., the
EcoRI restriction site is 5′-GAATTC-3′, and is cleaved by the restriction enzyme, EcoRI).
SMILES: Simplified Molecular-Input Line-Entry System (SMILES) is a single line representation that describes the structure of chemical species
in the form of short strings.
SMARTS: SMILES arbitrary target specification (SMARTS) is a language used to specify substructural patterns in molecules in a single string.
Syntax: The rule to combine symbols such that a line of code can be interpreted in a particular computer language.
Unsupervised learning: Machine learning algorithms which do not involve a target variable.