0% found this document useful (0 votes)
126 views51 pages

Preview 2022 Python For Chemists Aramis Tanemura

Uploaded by

Geogri Zhukov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views51 pages

Preview 2022 Python For Chemists Aramis Tanemura

Uploaded by

Geogri Zhukov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

https://pubs.acs.org/doi/book/10.1021/acsinfocus.

7e5030

This is a limited PDF preview of the primer. The entire work is available in
ePub3 and includes additional multimedia.

Python for Chemists

Kiyoto Aramis Tanemura


Michigan State University

Diego Sierra-Costa
Michigan State University

Kenneth M. Merz, Jr.


Michigan State University

Individual sales

Institutional sales
https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Preface

Python is a simple and concise scripting language (1). Unlike compiled languages, Python is an interpreted language, known for its ease of
coding and not for its computational speed. Python code specialized and optimized for a particular application is portable between projects as
Python packages. The ease of development makes Python an attractive language to apply in the broad field of chemistry, which includes a diverse
skill set but with many opportunities for algorithmic solutions regardless.

Learning Python is easy and should not be a barrier at all. Knowing a handful of Python syntax is enough to begin solving problems.
Learning to code in Python can be done in one sitting. Also, there are plenty of resources online to look up the syntax and address bugs.
However, contextualizing chemical problems in Python is not always obvious. Programming in Python empowers chemists to apply their domain
knowledge to scales unreachable by manual effort.

In this digital primer, readers will explore practical use cases of Python for chemical data analysis, cheminformatics, machine learning, and
molecular modeling. We aim to guide readers in developing an intuition for finding algorithmic solutions by reading this digital primer. To
accomplish this, we explore a broad set of chemical problems and illustrate solutions implemented in Python. We do not expect all the problems
to be directly applicable to all readers. Instead, we have readers develop the skill to identify problems in their research for which code may
automate operations and scale large volumes of data or calculation. This digital primer utilizes the many functionalities of Python and is not
intended to be a resource to learn particular algorithms or chemistry. Readers are encouraged to supplement such concepts by further reading
external sources.

In the first chapter, we explore basic Python and introduce relevant packages throughout subsequent chapters, with exposure to Python code
applied to chemical problems. The material is not intended to comprehensively discuss topics and packages developed for chemical applications in
Python, but rather to get you started on using Python in chemistry research fast.

Therefore, we shorten the time from “learning” to “using” Python as much as possible, because the only time you substantively learn Python is
when you use it. There is no point in getting overwhelmed by endless books and tutorials, never to arrive at the actual programming part. Just
start solving problems and turn them into a project. If completing a project in Python was rewarding and you want to delve deeper into writing
software in Python, you can choose to gain training at an intermediate level. If not, you gained foundational Python knowledge to apply later
when it becomes useful.

Maybe the code in your first project turns out messy and difficult to use for the first time. That is fine because it is the first version that can be
incrementally improved. And if you stick with programming, when you come back to the project, you can rewrite the entire thing beautifully and
a lot faster. There will never be such progress unless you start.

The only time you will not encounter a bug is when you are not coding. Do not take a bug as a failure or any inadequacy on your part. Not
only is it normal, but each bug you encounter is something new you learn about your data and code. Be excited about the discovery of the codeʼs
unexpected behavior and make your code even stronger by fixing the bug.

Kiyoto Aramis Tanemura is a Ph.D. student working in the research group of Prof. Kenneth M. Merz in the Department of
Chemistry, Michigan State University. At the interface of computational chemistry and artificial intelligence, his research aims
to develop methodologies to predict spectral properties of small organic molecules for high throughput identification. He
completed his B.A. at Kalamazoo College in Chemistry and Mathematics, with concentrations in Biological Chemistry and
Molecular Biology as well as Biological Physics. He uses Python every day in all aspects of his research.

Diego Sierra-Costa is a doctoral candidate at the Department of Chemistry at Michigan State University under the
supervision of Prof. Kenneth M. Merz. His research in mathematical artificial intelligence and chemistry focuses on developing
new representations of small molecules for the prediction and calculation of physicochemical properties. Diego received his
B.Sc. in Physics from the National Autonomous University of Mexico where he focused on quantum optics and cold atoms.
Photo credit: Delilah Pacheco

© 2022 American Chemical Society 1


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Kenneth M. Merz, Jr. is the Joseph Zichis Chair in Chemistry and a University Distinguished Professor at Michigan State
University. He is also the Editor-in-Chief of the ACS Journal of Chemical Information and Modeling. His research interest lies in
the development of theoretical and computational tools and their application to biological problems including structure and
ligand-based drug design, mechanistic enzymology, and methodological verification and validation. He has received several
honors including election as an ACS Fellow, the 2010 ACS Award for Computers in Chemical and Pharmaceutical Research,
election as a fellow of the American Association for the Advancement of Science, and a John Simon Guggenheim Fellowship.

© 2022 American Chemical Society 2


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Brief Table of Contents

About the Series


Preface

1 Begin Coding in Base Python

2 Data Analysis in Python

3 Cheminformatics

4 Machine Learning on Chemical Data

5 Modeling Chemical Systems


Appendix A. Solutions to Practice Problems
Bibliography
Glossary
Index

© 2022 American Chemical Society 3


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Detailed Table of Contents

About the Series


Preface

1 Begin Coding in Base Python


1.1 Introduction
1.2 Numerical Operations
1.2.1 Practice Problems
1.3 String Operations
1.3.1 Indexing
1.3.2 Primer Design for Polymerase Chain Reaction
1.3.3 Practice Problems
1.4 Functions
1.4.1 Practice Problems
1.5 Conditional Statements
1.5.1 Boolean Variables
1.5.2 Logic Gates
1.5.3 Practice Problems
1.6 Loops
1.6.1 For-Loop
1.6.1.1 List Comprehension
1.6.1.2 Iterables
1.6.2 While-Loop
1.6.3 Continue, Break, and Pass
1.6.4 Practice Problems
1.7 That’s a Wrap
1.8 Read These Next

2 Data Analysis in Python


2.1 Introduction
2.2 Scientific Computing with NumPy
2.2.1 Reshaping
2.2.2 Indexing
2.2.3 Algebra
2.2.4 Application
2.3 Pandas for Data Analysis
2.3.1 Loading the Data
2.3.2 Extraction from Raw Data
2.3.3 Exploratory Data Analysis
2.3.4 Data Manipulation
2.3.4.1 Subsetting
2.3.4.2 Sorting
2.3.4.3 Merging
2.4 Seaborn for Visualization
2.5 That’s a Wrap

© 2022 American Chemical Society 4


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

2.6 Read These Next

3 Cheminformatics
3.1 Introduction
3.2 The SMILES and SMARTS Languages
3.2.1 SMILES
3.2.2 SMARTS
3.3 RDKit
3.4 Atoms and Bonds
3.5 Reactions
3.6 Inspecting a Database
3.7 Finding Substructures
3.8 Fingerprints
3.9 Molecular Similarity
3.10 That’s a Wrap
3.11 Read These Next

4 Machine Learning on Chemical Data


4.1 Introduction
4.2 Background
4.2.1 Human β-Secretase 1
4.2.2 pIC 50
4.3 Supervised Learning
4.3.1 Data Preparation
4.3.1.1 Load Data Set
4.3.1.2 Randomizing the Order of the Instances
4.3.1.3 Data Partitioning
4.3.1.4 Standardizing the Features
4.3.2 Regression of pIC 50
4.3.2.1 Training the Model
4.3.2.2 Model Performance
4.3.2.3 Random Forest Regressor
4.3.3 Classification of BACE-1 Inhibitor/Noninhibitor
4.3.3.1 Logistic Regression Classifier
4.3.3.2 Random Forest Classifier
4.3.4 Further Discussion Items for Supervised Learning
4.3.4.1 k-Fold Cross Validation
4.3.4.2 Hyperparameter Selection
4.3.4.3 Saving Your Work
4.3.4.4 Understanding Your Work
4.4 Unsupervised Learning
4.4.1 Dimensionality Reduction
4.4.2 Clustering
4.4.3 Anomaly (Outlier) Detection
4.5 That’s a Wrap
4.6 Read These Next

© 2022 American Chemical Society 5


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

5 Modeling Chemical Systems


5.1 Introduction
5.2 File Formats
5.3 Dynamic Modeling in SciPy
5.4 Atomic Simulation Environment for Standard Interface
5.4.1 The Atoms Object
5.4.2 Calculators
5.4.3 Geometry Optimization
5.5 Protein Structures with Biopython
5.5.1 File I/O
5.5.2 Navigating Protein Structure
5.5.3 Application
5.6 That’s a Wrap
5.7 Read These Next
Appendix A. Solutions to Practice Problems
Bibliography
Glossary
Index

© 2022 American Chemical Society 6


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

CHAPTER 1

Begin Coding in Base Python


1.1 Introduction

1.2 Numerical Operations

1.2.1 Practice Problems


1.3 String Operations

1.3.1 Indexing
1.3.2 Primer Design for Polymerase Chain Reaction
1.3.3 Practice Problems
1.4 Functions

1.4.1 Practice Problems


1.5 Conditional Statements

1.5.1 Boolean Variables


1.5.2 Logic Gates
1.5.3 Practice Problems
1.6 Loops

1.6.1 For-Loop
1.6.1.1 List Comprehension
1.6.1.2 Iterables
1.6.2 While-Loop
1.6.3 Continue, Break, and Pass
1.6.4 Practice Problems
1.7 That’s a Wrap

1.8 Read These Next

1.1 INTRODUCTION

Programming is increasingly valuable in modern chemistry. Algorithmic approaches allow us to interpret data that are impractical or unfeasible
to manually inspect by a domain expert. As scientists, we produce, maintain, interpret, and communicate data. Also we are tasked with faithfully
analyzing the data we collect, regardless of the format or volume of the data. We can navigate larger data sets through programming.

In this chapter, we cover basic Python syntax and follow it with chemically relevant problems. This chapter focuses on base Python with minimal
external packages so that we learn to design programmatic solutions to chemical problems rather than relying on existing solutions. To get
comfortable in coding, try the problems in this chapter as you read along to check your understanding. As you solve the problems, consider where
you might use codes like this in your research.

Using existing solutions is crucial in research. While there is value in knowing how the software works, it is impossible to know the inner
workings of every tool we use. At a certain level, the software we use is a black box. So, while we use only base Python in this chapter, this is
for educational purposes. The codes you may actually write and use in research would likely import relevant packages already optimized for the
particular tasks. The intuition we build by solving chemical problems in base Python will help us identify problems we encounter in our own
work, with algorithmic solutions.

We can complete many tasks using Python by understanding how to use the following operations:
• Numerical operations

• String operations

• Functions

• Conditional statements

• Loops

© 2022 American Chemical Society 7


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Let us become proficient in these commands in the subsequent sections of the chapter. In each section, we introduce the associated statements/
commands and use them on chemically relevant problems.

1.2 NUMERICAL OPERATIONS

As scientists who handle quantitative data, we use a scientific calculator for various applications. We can run the same calculations in Python. Let
us illustrate how we can perform some simple arithmetic operations.

# addition
5+7

12

# subtraction
5-7

−2

# multiplication
5*7

35

# division
5/7

0.7142857142857143

# exponentiation
5 ** 7

78125

# modulus
5%7

We can assign quantities to variables using the equal sign. The name of the variable can be almost anything; however, it is good practice to use
long, descriptive names for variables so that the quantity saved to the variable is obvious. Generic names like x or y are fine, but it is not always
obvious when a reader sees an operation done on these variables.

five = 5
seven = 7
five + seven

12

We get the expected value when we save the numerical values 5 and 7 to variables five and seven, respectively, and perform a numerical
operation on the variable.

Variables are useful to organize the many numerical values we handle as chemists; they may be known constants, experimental parameters,
controlled variables, and measured data. The calculation is made modular and readable by using Python variables.

To illustrate, let us suppose we calculate the number of phenylalanine molecules in 1.00 g of phenylalanine hydrochloride. We can do this as
follows:

(1.1)

© 2022 American Chemical Society 8


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Note that the text that follows “#” in the code creates comments. Comments explain the content of the code to a developer and are ignored by the
Python interpreter. Code must be annotated with comments so that it is coherent and can be used by other users and by yourself later.

# assign the constants to variables


avogadro_number = 6.022 * 10 ** 23 # avogadro's number in mol^(−1)
phe_hcl_mw = 201.65 # the molecular weight of L-phenylalanine hydrochloride in g/mol

# calculate the number of Phe molecules


num_phe = 1.00 / phe_hcl_mw * avogadro_number

# print the result with units using the print function in base Python
print('1.00 g of Phe-HCl contains %.2e Phe molecules' %num_phe)

1.00 g of Phe-HCl contains 2.99e+21 Phe molecules

Tip: When writing code, try writing the comments before the code. In the comments, explain what exactly the code should do. Then, start
writing code where the comments appear. This helps break down the code into many smaller problems and ensures the finished code is well
annotated.

Of course, we can do this calculation in one line on a scientific calculator, but we document our calculation using variables rather than direct
quantities. The below calculation is identical to the former, but the code is less transparent with what calculation was done without context.
Additionally, performing the calculation across several lines avoids some common error like missing parentheses.

1.00 / 201.65 * 6.022 * 10 ** 23

2.9863625092982893e+21

Modules are external Python codes that can be imported. They are optimized to perform specific tasks. Here, we highlight some mathematical
variables and functions that we can import from the math module. The math module is a built-in module for mathematical functions and
variables not explicitly available in base Python. The functionality of the module can be utilized in a Python script using the import statement. To
access particular functions or variables in the module, we put a period after the module and invoke the object. Below, we compute the factorial of
five.

# We can import entire modules. We can call any functions within the math module.
import math

math.factorial(5)

120

5*4*3*2*1

120

We may want to import only certain items from the module. In this case, we can use the from <module> import <object> syntax. Take for
instance, we import only the constant π and the natural log function from the math module.

# Sometimes we only want some functions from a module.


# We can specify the few functions/variables we import
from math import pi, log # import the Python variable pi and the natural logarithm function

pi

3.141592653589793

log(5) # natural log of five

© 2022 American Chemical Society 9


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

1.6094379124341003

So far, we have computed algebraic expressions using mostly base Python. Functions not available explicitly in base Python can be imported from
modules, such as the math module.

1.2.1 Practice Problems


Test your understanding with practice problems with some routine chemistry questions. Solve the problem on paper or by calculator. Then put it
into code.

Practice problems are available for download as a Jupyter Notebook file. This notebook documents the code, which can be executed by opening
a session with an interactive kernel. You can open and edit the downloaded notebook by opening it locally in a Jupyter Notebook session or
uploading it to a web service for this task.

A. A Suzuki cross-coupling reaction to synthesize (E)-2-methyl-2,4-nonadiene is shown in FIGURE 1.1 (2).

FIGURE 1.1 The Suzuki Cross-Coupling Reaction (2).

Calculate the mass of each reagent/catalyst to add to run the reaction at 1.0 mmol scale.

# molecular weights
benzodioxaborole_mw = 202.06 # g/mol
bromopropene_mw = 135.00 # g/mol
pd_catalyst_mw = 1155.59 # g/mol

# write the rest of the code

B. The Aspergillus niger L-arabinose reductase LarA catalyzes the reduction of L-arabinose to L-arabitol. The assay for LarA activity with
L-arabinose as the substrate follows Michaelis–Menton kinetics with Km of 54 mM (3). An assay with 0.72 mg of enzyme and 10 mM of
L-arabinose exhibited a velocity of 3.4 units. Calculate Vmax and kcat. Some equations are listed below.

Michaelis–Menton equation

(1.2)

in which v is the velocity of the reaction, Vmax is the velocity at excess substrate, Km is the Michaelis–Menton constant, and [S] is the
concentration of the substrate.

turnover number

(1.3)

in which kcat is the turnover number and [E] is the concentration of the enzyme.

© 2022 American Chemical Society 10


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

# assign known values to variables


# solve for Vmax
# solve for kcat

C. Suppose we are quantifying chlorophyll a concentration in environmental samples. The specific absorption coefficient ϵ was previously
determined as 84.3 Lg·cm at 664 nm for organic extract of chlorophyll a (4). Five milliliters of water sample was extracted in ether to a final
volume of 13 mL and yielded an absorbance of 0.31 units at 664 nm with 1 cm path length. Use the Beer–Lambert law to estimate the
chlorophyll a concentration.

# assign known values to variables

# after writing out how to solve the problem, add comments


# to describe the steps to arrive to the original chlorophyll a concentration
# then write code under the comments.

D. Atmospheric carbon dioxide level is estimated to have risen by over 100 ppm since the industrial revolution. Aqueous carbon dioxide is in
equilibrium with aqueous carbonic acid; thus an increase in atmospheric carbon dioxide can lower the pH of water. Acidification of the ocean is a
concern for marine ecosystems because shells and skeletons of marine organisms in coral reefs, which are made of calcium carbonate, can dissolve
at lower pH levels (5). Calculate the change in pH in pure water exposed to the atmosphere due to the increase in the CO2 level from 280 to 380
ppm. Assume that the dissociation from bicarbonate to carbonate is negligible. Relevant equilibrium constants are given (FIGURE 1.2).

Henry’s law

(1.4)

where P is the pressure and KH is Henry’s law constant for carbon dioxide (29.41 atm M)

FIGURE 1.2 Equilibria Involving Aqueous Carbon Dioxide.

# write your code

Appendix A: Solutions to Practice Problems

You should now be equipped to process quantitative data in Python, much like how chemists may use a scientific calculator in research. However,
we likely handle other types of data in our research as well as numerical data. Let us consider next how to process string data using Python.

1.3 STRING OPERATIONS

Strings are text data. They can be manipulated with Python operations. This is useful for extracting data from output files of software.
Biopolymers like proteins or nucleotides can be described using sequences, which is string data. Chemical structures can be stored in Simplified
Molecular Input Line Entry System (SMILES) representation in databases. The SMILES representation will be thoroughly described in SECTION
3.2; here we focus on operations involving strings and use a SMILES string as an example.

© 2022 American Chemical Society 11


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Let us do some simple operations on two string that simply read, “Cellar” and “Door”. The strings are concatenated (joined together) by adding
them together with +.

# save strings to variables

cellar = 'Cellar'
door = 'Door'

# concatenate the two strings

cellar + door

'CellarDoor'

Perhaps we prefer having them merged with a space in between. This can be done with the join method.

# concatenate a list of strings with space between each substring

' '.join([cellar, door])

'Cellar Door'

We joined the words stored in a list with a space (' ').

1.3.1 Indexing
We use brackets to index the string. Three integers can be placed in brackets, separated by colons (:):
• the start index (inclusive, starting at zero)

• the end index (exclusive)

• increments

Thus, the string subset would look like, string[start_index:end_index:increments].

# print first, second, and third letter of 'Cellar'


print(cellar[0], cellar[1], cellar[2])

Cel

The negative indices begin from the end of a string.

# print second to last and last letter of 'Door'


print(door[-2], door[-1])

or

Substrings are accessible by specifying or leaving blank the start and end indices.

door[1:] # skip first 1 letter

'oor'

cellar[:4] # include up to first 4 letters

'Cell'

© 2022 American Chemical Society 12


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

door[1:3] #
(excludes 3)

'oo'

cellar[1:-1] # skip first and omit last letter

'ella'

The third position indicates the increment of the string. The default is to print every letter (1).

cellar[::2] # print every other letter

'Cla'

door[::-1] # print the string in reverse

'rooD'

With some of the basic syntax in mind, let us apply it to processing nucleotide sequence data.

1.3.2 Primer Design for Polymerase Chain Reaction


The red fluorescent protein (RFP) is widely used in molecular biology as a chromophoric marker of gene expression. Its gene sequence is
available (6). Let us save it to the variable: rfp. We confirm the length is 876 nucleotides long.

rfp =
'agtttcagccagtgacagggtgagctgccaggtattctaacaagatgagttgttccaagaatgtgatcaaggagttcatgaggttcaaggttcgtatggaaggaa
cggtcaatgggcacgagtttgaaataaaaggcgaaggtgaagggaggccttacgaaggtcactgttccgtaaagcttatggtaaccaagggtggacctttgccatt
tgcttttgatattttgtcaccacaatttcagtatggaagcaaggtatatgtcaaacaccctgccgacataccagactataaaaagctgtcatttcctgagggattta
aatgggaaagggtcatgaactttgaagacggtggcgtggttactgtatcccaagattccagtttgaaagacggctgtttcatctacgaggtcaagttcattggggtg
aactttccttctgatggacctgttatgcagaggaggacacggggctgggaagccagctctgagcgtttgtatcctcgtgatggggtgctgaaaggagacatccatat
ggctctgaggctggaaggaggcggccattacctcgttgaattcaaaagtatttacatggtaaagaagccttcagtgcagttgccaggctactattatgttgactcca
aactggatatgacgagccacaacgaagattacacagtcgttgagcagtatgaaaaaacccagggacgccaccatccgttcattaagcctctgcagtgaactcggctc
agtcatggattagcggtaatggccacaaaaggcacgatgatcgttttttaggaatgcagccaaaaattgaaggttatgacagtagaaatacaagcaacaggctttgc
ttattaaacatgtaattgaaaac'
len(rfp)

876

Suppose we want to express RFP from a gene inserted to a plasmid in a bacterial vector. We use the polymerase chain reaction (PCR) to amplify
this gene from a template DNA (7). We specify the segment of DNA to amplify by designing the appropriate primers and short single-strand
DNA synthetically produced.

We need the start and end of the gene to define the region to amplify by PCR. Let us print the first and last 30 nucleotides of the RFP gene.

rfp_start = rfp[:30].upper() # first thirty nucleotides


rfp_end = rfp[-30:].upper() # last thirty nucleotides

print('First 30 nucleotides of RFP:')


print(rfp_start)
print('Last 30 nucleotides of RFP:')
print(rfp_end)

First 30 nucleotides of RFP:


AGTTTCAGCCAGTGACAGGGTGAGCTGCCA

© 2022 American Chemical Society 13


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Last 30 nucleotides of RFP:


GCTTTGCTTATTAAACATGTAATTGAAAAC

We use the upper() method to capitalize the string. This will be useful to standardize the case of the sequence, as the operations are case-sensitive.

The reverse primer should be in the reverse complement so that it hybridizes to the sense (coding) strand. To reverse a sequence, we take the full
sequence, but with increments of negative one. Inside the brackets, we do not specify the start or end index, only the increment.

# get reverse order


rfp_end_r = rfp_end[::-1]
rfp_end_r

'CAAAAGTTAATGTACAAATTATTCGTTTCG'

Let us use a dictionary to obtain the complement of the reverse sequence. Dictionaries are like lists but indexed by keys. They are specified by
{key0: value0, key1: value1}, and so forth.

complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}


complement_dict['A']

'T'

We index the dictionary by “A”, so it returns “T”, the value associated with the key.

We can iterate over strings using loops. The string is split by letters. By passing these letters, or bases, to the dictionary, we replace the nucleotide
by its complement. Let us explore this syntax, known as a list comprehension, in more depth in a later section.

rfp_end_rc = [complement_dict[x] for x in rfp_end_r]


rfp_end_rc[:10]

['G', 'T', 'T', 'T', 'T', 'C', 'A', 'A', 'T', 'T']

The letter-wise operation returns a list (we show only first 10 letters). Let us join this list back to one string, representing the reverse complement
or the suffix in the 5′ to 3′ direction. We can do this by using the join() method of strings.

rfp_end_rc = ''.join(rfp_end_rc)
rfp_end_rc

'GTTTTCAATTACATGTTTAATAAGCAAAGC'

We joined the list of letters with '' (an empty string with no space). With the two coding sequences in the correct orientation, they can be
trimmed to match the melting temperature, below the elongation temperature of 72 °C. The trimmed sequences are as follows, with calculated
annealing temperature at 62 °C.

rfp_start = 'AGTTTCAGCCAGTGACAG'
rfp_end_rc = 'GTTTTCAATTACATGTTTAATAAGCAAAGC'

Let us concatenate the subsequences together to get a forward (sense strand) primer with:
• Spacer sequence

• EcoRI site

• NotI site

• Extra base spacer

• XbaI site

© 2022 American Chemical Society 14


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

• extra G spacer

• The start of the RFP gene, starting with a start codon

These are components for the prefix of a protein coding sequence in BioBrick construct (8).

spacer1 = 'GTTTCTTC'
EcoRI = 'GAATTC'
NotI = 'GCGGCCGC'
spacer2 = 'T'
XbaI = 'TCTAGA'
spacer3 = 'G'

# The forward primer is made by appending the above strings in order


forward = spacer1 + EcoRI + NotI + spacer2 + XbaI + spacer3 + rfp_start
forward

'GTTTCTTCGAATTCGCGGCCGCTTCTAGAGAGTTTCAGCCAGTGACAG'

We append strings together using + between strings.

Let us also design the reverse primer consisting of:


• The last 30 nucleotides of the coding sequence, with stop codon

• An additional stop codon

• SpeI site

• Extra base spacer

• NotI site

• PstI site

• Spacer

These components are for the suffix of an insert for a BioBrick construct (8). However, we need the sequence to be the reverse complement for
proper amplification. Because we already obtained the coding sequence in reverse complement, we will omit this and append it at the end.

stop_codon = 'TAA'
SpeI = 'ACTAGT'
spacer1 = 'A'
NotI = 'GCGGCCGC'
PstI = 'CTGCAG'
spacer2 = 'AAGAAAC'

reverse = stop_codon + SpeI + spacer1 + NotI + PstI + spacer2


reverse

'TAAACTAGTAGCGGCCGCCTGCAGAAGAAAC'

Follow the same operation to obtain the reverse complement of the flanking region of the reverse primer.

reverse_r = reverse[::-1] # reverse the string


reverse_rc = [complement_dict[x] for x in reverse_r] # get the complement as a list
reverse_rc = ''.join(reverse_rc) # concatenate the letters together in one string
reverse_rc

© 2022 American Chemical Society 15


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

'GTTTCTTCTGCAGGCGGCCGCTACTAGTTTA'

To the flanking region of the reverse primer, append the reverse complement of the end of the RFP gene.

reverse_rc += rfp_end_rc # same as >>> reverse_rc = reverse_rc + rfp_end_rc


reverse_rc

'GTTTCTTCTGCAGGCGGCCGCTACTAGTTTAGTTTTCAATTACATGTTTAATAAGCAAAGC'

The designed primers can be synthesized and used to amplify the RFP gene for cloning into a BioBrick construct.

We have used basic string operations to document how we design primers for the PCR amplification of RFP gene. Working with Python variables
allows us to save and organize sections of nucleotide sequences. We can modify and join these sequences using string methods.

1.3.3 Practice Problems


Test your understanding through practice problems. If it helps, you can sketch out the operations on paper, and then implement it in code.

A. The gene sequence of the sense strand (coding side) encoding for RFP was given. Print the antisense strand (reverse complement) of the whole
open reading frame.

# write the rest of the code

B. Suppose you are designing a guide RNA to knockout Arabidopsis thaliana transcription factor TRY (9). The three exons of TRY are given (10).
The SpCas9 recognizes the protospacer adjacent motif (PAM), NGG and CCN (two consecutive guanine or cytosine plus one nucleotide). Find
the index of at least one PAM in each of the exons of the gene of TRY using the find() method of strings.

exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAGT
CTACACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACATTTCCT
TCTCTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTCTCCAAACT
CAAAATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGTCGTCGCCGTCG
TAAGCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACT
TGTCGGTGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAACAGTGAAG
GCTTTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCTATCTATCCT
TCCTAGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTTCTTGTTAGTTT
GAAGAATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCATTTGGTAATCAACT
TTAATCCATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAATCCGCGGTATATCGCG
GTATAATTTACTTTTTAAAGTTAATATATATTAAAACTTG'

# write the rest of the code

C. Suppose we have a list of organic compounds to subject to further analysis. We have their SMILES structures, but notice some of the entries
are complexed with smaller ions. We want to retain only the molecule with the longest SMILES structures. Disconnected structures are separated
by periods (.) in SMILES. Use the string method split() to obtain a list of structures for D-glucosamine sulfate, determine the lengths of each
SMILE, and print the SMILES with the ion removed.

d_glucosamine_sulfate = 'C([C@H]([C@H]([C@@H]([C@H](C=O)N)O)O)O)O.OS(=O)(=O)O'
# write the rest of the code

D. You are extracting optimized energy values from a geometry optimization calculation using the QUICK program (11, 12) The line containing
the total energy value is given. Extract the total energy value as a float.

© 2022 American Chemical Society 16


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

saveline = ' TOTAL ENERGY = -884.004943174\n'


# write the rest of the code

Appendix A: Solutions to Practice Problems

You should be familiar with string data and some operations to process them. Try to identify where you handle string data in research, and how it
may be processed using Python string operations. Let us begin organizing these numerical and string operations into Python functions.

1.4 FUNCTIONS

So far, we worked with one-liners or blocks of script. This is great, but what if we want to apply the same scripts to a new data set? Copy/pasting
the script to a new file and modifying it over and over for each use will produce an unnecessary number of scripts and introduces the possibilities
of errors. Instead, we can begin organizing the operations in modular fashion by defining functions. Functions take arguments (inputs) of various
types and return an output.

Previous operations may have felt elementary. We were solving chemical problems; however, we used the Python scripting language as a generic
scientific calculator or text editor (which is fine). Writing functions in Python as a chemist is like building your own chemical calculator.

Let us take a trivial function and build it up to something more practical.

def trivial_function():
return 'this function is trivial'

trivial_function()

'this function is trivial'

We define the name of the function using the def keyword. The function name is followed by a parenthesis and colon. The operations inside the
function are indicated by one indentation. The operation terminates either when it reaches the last indented line or when it reaches the first return
statement.

Let us define a new function that takes an argument. The following function returns the square plus one of a number. The argument val is a local
variable. We do not define it before defining a function. Instead, it is defined when we pass an argument to the function.

def square_plus_one(val):
return val ** 2 + 1

square_plus_one(3)

10

Functions can have multiple arguments. Let us define a function that calculates electrostatic force using Coulomb’s law:

(1.5)

in which ke is Coulombʼs constant, qi is the charge of particle i, and r is the radius between particles.

# define relevant constants


coulomb_constant = 8.988 * 10 ** 9 # N m^2 / C^2

# define the function for Coulomb force


def coulomb_force(q1, q2, r):
'''q1 (float): charge of particle 1 in Coulombs
q2 (float): charge of particle 2 in Coulombs
r (float): radius between particles in meters''' # add a block of text to explain the function, using triple quotes

© 2022 American Chemical Society 17


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

return coulomb_constant * q1 * q2 / r ** 2

force = coulomb_force(10**(-5), -2 * 10**(-5), 0.1)


print('The force between particles with charges %.1e C and %.1e C at %.1f m is %.2e N.' % (10**(-5), -2 * 10**(-5),
0.1, force))

The force between particles with charges 1.0e-05 C and -2.0e-05 C at 0.1 m is -1.80e+02 N.

We can define default values for arguments so that if they are not passed, the function assumes these values. For instance, we can define a function
which returns relative probability based on free energy difference:

(1.6)

in which is the probability of state 1 relative to state 2, Fi is the free energy of state i, kB is Boltzmann constant, and T is temperature.

We may assume a physiological temperature of 300 K.

# we need to import the exponentiation function


from math import exp
# define relevant constants
kB = 1.38 * 10 ** (-23) # J/K
avogadro_number = 6.022 * 10 ** 23 # avogadro's number in mol^(-1)
# convert kB from J/K to kcal / (mol K)
kB = kB * avogadro_number / 4184.00

def relative_probability(F1, F2, T=300):


'''Returns probability of state 1 relative to state 2
F1 (float): free energy of state 1 in kcal/mol
F2 (float): free energy of state 2 in J
T (float): temperature of system in K'''
return exp((F2 - F1) / (kB * T))

# suppose state 2 has a relative free energy of 1.5 kcal/mol to state 1


relative_probability(0.0, 1.5)

12.395578077607523

Suppose we want to vary the temperature as well. We add temperature as an additional argument.

temp = 250
print('At %d K: %f.1' %(temp, relative_probability(0.0, 1.5, temp)))

temp = 400
print('At %d K: %f.1' %(temp, relative_probability(0.0, 1.5, temp)))

At 250 K: 20.507850.1
At 400 K: 6.606175.1

Functions are not limited to numerical operations. Let us take the SMILES of cortisone and produce the SMILES of its enantiomer. Feel free to
paste the SMILES into a software like ChemDraw to confirm the SMILES does in fact encode the structure of cortisone. We define a function
that inverts all stereogenic centers. The configuration of stereogenic centers is communicated by @ and @@.

© 2022 American Chemical Society 18


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

cortisone = 'O=C(C=C1CC[C@@]2([H])[C@]3([H])CC[C@@](O)([C@]3(C4)C)C(CO)=O)CC[C@]1(C)[C@@]2([H])C4=O'

def enantiomer(smiles):
'''return the SMILES of the enantiomer
smiles (string): input structure SMILES'''
# first replace all @ with @@. This will cause @@ to become @@@@
switch1 = smiles.replace('@', '@@')
# because @@@@ was originally @@, we replace it with @
return switch1.replace('@@@@', '@')

enantiomer(cortisone)

'O=C(C=C1CC[C@]2([H])[C@@]3([H])CC[C@](O)([C@@]3(C4)C)C(CO)=O)CC[C@@]1(C)[C@]2([H])C4=O'

Of course, we can do the operation in one line as below, but it is less obvious than if we name the function as enantiomer(...). The string
operation replace(...) itself has no chemical sense. We assign chemical meaning to it by organizing it in the function.

cortisone.replace('@', '@@').replace('@@@@', '@')

'O=C(C=C1CC[C@]2([H])[C@@]3([H])CC[C@](O)([C@@]3(C4)C)C(CO)=O)CC[C@@]1(C)[C@]2([H])C4=O'

A lambda function applies a function locally in one line. It only returns one expression. It follows the syntax: lambda input: expression

cube = lambda x: x ** 3

cube(4)

64

While developing, you may write out the names of all the functions you plan to have, then implement them individually. Empty functions will
raise an error, preventing you from testing each function. During development, put a pass statement in functions to avoid errors.

def identity_function(value):
return value

def inverse(value):
pass

identity_function(7.5)

7.5

We observe the inverse function is not yet implemented, but we can properly test the identity function during development.

We have taken the numerical and spring operations and organized them into functions. Functions allow chemists to assign chemical meaning to
the operations implemented in code. Once defined, we can use the function an arbitrary number of times. Therefore, we do not have to rewrite
the Python code every time we want to perform a particular operation.

1.4.1 Practice Problems


To test your understanding of how functions are structured in Python, try out some practice problems.

A. Write a function that returns the index of a tobacco etch virus (TEV) protease recognition site in a peptide sequence. The TEV site is the
sequence, ENLYFQ. While the sequence, ENLYFQS exhibits the greatest catalytic efficiency, the last position can also be G, A, M, C, or H. Run
the function on the peptide sequence of recombinant sarafotoxin (13).

© 2022 American Chemical Society 19


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

recombinant_sarafotoxin =
'MKDDAAIQQTLAKMGIKSSDIQPAPVAGMKTVLTNSGVLYITDDGKHIIQGPMYDVSGTAPVNVTNKMLLKQLNALE
KEMIVYKAPQEKHVITVFTDITCGYCHKLHEQMADYNALGITVRYLAFPRQGLDSDAEKEMKAIWCAKDKNKAFDDVMAGKSVAP
ASCDVDIADHYALGVQLGVSGTPAVVLSNGTLVPGYQPPKEMKEFLDEHQKMTSGKGSTSGSGHHHHHHGTMTSLYKKAGLENLYF
QCTCKDMTDKECLYFCHQDIIW'

# write the rest of the code

B. Write two functions: the first function returns the reaction quotient Q, the second is the Nernst equation which returns the reduction
potential.

Nernst equation:

(1.7)

where ΔE is the change in reduction potential, ΔE° is the change in standard reduction potential, R is the ideal gas constant (8.314 J·mol·K), T is
the temperature, n is the number of electrons involved, F is the Faraday’s constant (96.49 kJ·V·mol), and Q is the reaction quotient.

Calculate the reduction potential of the following reaction, in which [Ag+] = 0.04 mM and [Mn2+] = 0.13 mM.

FIGURE 1.3 Electron Transfer of Manganese to Silver.

# write the rest of the code

C. The source code of a function is given but is written poorly. The function and variable names are cryptic and there is no annotation.
Retain the behavior of the function but update it in a readable way. Use the updated function to solve the following problem:
3-Hydroxy-1-(naphthalen-1-yl)pent-4-en-1-one (C15H14O2) was synthesized. The exact mass of the sodiated adduct was determined to be
249.0881 via high-resolution mass spectrometry (HRMS). Determine whether the measurement is consistent with calculated exact mass (14).

exact_masses = {'C': 12.000000, 'H': 1.007825, 'O': 15.994915, 'N': 14.003074, 'Na': 22.989770}

def op(w, x, y, z):


w+x+y+z
x = 12.000000 * w + 1.007825 * x + 14.003074 * z + 15.994915 * y
return x

def operation(x, y):


return x - (y + 22.989770)

D. The radioactive decay of carbon-11 to boron-11 has a half-life of 20.364 min. Radioactive decay follows first order kinetics, for which Ni is
the number of atoms at time i, k is the radioactive decay constant, and t is the time. The occupational value of carbon-11 dioxide derived air
concentration (DAC) is 0.03 nCi/mL. Suppose a measurement of 0.03 nCi/mL is considered acceptable at a workplace; however, 0.17 nCi/mL
was measured. Operations need to be shut down for 51 min to return to 0.03 nCi/mL; however, something is wrong in the code below to
calculate this. Debug the script.

from math import log

© 2022 American Chemical Society 20


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

def k_from_halflife(t_half):
'''Calculate decay constant k from halflife in inverse time'''
return - log(1/2 / (t_half))

def time_relative(fraction, k):


'''Calculate the time taken for radioactive decay with decay constant k
to achieve a fraction of the initial amount.'''
return log(fraction)/ k

halflife = 20.364 # minutes


k = k_from_halflife(t_half):
time_shutdown = time_relative(0.03/0.17, k)
print(time_shutdown, 'minutes')

Appendix A: Solutions to Practice Problems

You should now be able to write custom functions to perform operations not explicitly implemented in base Python. The function defined
once can be run on many inputs. However, our data is heterogeneous and not all input data may be processed in one function. Conditional
statements can implement control flow to dynamically respond to exceptions in the input data.

1.5 CONDITIONAL STATEMENTS

Conditional statements allow us to test a variable against a value and perform an action if the condition is met by the variable or perform another
action if not. The type of variables that control this kind of statement are called Boolean.

1.5.1 Boolean Variables


We encode logic using conditional statements. Conditional statements run a block of code if the statement is true and skips it otherwise. The
Boolean data type represents the true/false binary. There are no partial truths. A statement is either true or false, and nothing else.

Try the following algebra problem.

# Compute the following: 13 * 17 + 19 = x .


answer = 240

if answer == 13 * 17 + 19:
print('Correct')
else:
print('Incorrect')

Correct

Here two conditional statements are used. The if statement receives a Boolean. If the statement is true, then it prints “Correct”. The statement is:
the response equals 13 times 17 plus 19.

Another statement used here is the else statement. If the prior if statement is false, the code block of the else statement is executed instead. The
if-block and else-block are mutually exclusive.

Mutually exclusive blocks of code can be strung together using the elif statement, short for “else if ”. Suppose we want to assess whether an integer
is divisible by 16 or 4.

# prompt user for response, then convert it to an integer


an_integer = 88

# if the mod of the integer by 16 equals zero (if the integer is divisible by 16)

© 2022 American Chemical Society 21


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

if an_integer % 16 == 0:
print('%d is divisible by 16.' %an_integer)
# else if the mod of the integer by 4 equals zero
elif an_integer % 4 == 0:
print('%d is divisible by 4.' %an_integer)
else:
print('%d is divisible by neither 16 or 4.' %an_integer)

88 is divisible by 4.

If the integer is divisible by 16, then it is divisible by 4. Thus, it is unnecessary to run the code block specifying the integer is divisible by 4, so if
the first block executes, we want to skip the rest. There is no limit to the number of elif statements to follow an if statement.

Let us list some numerical comparisons.

# equals
5 == 7

False

# inequal
5 != 7

True

# less than
5<7

True

# greater than
5>7

False

# less than or equal to


5 <= 7

True

# greater than or equal to


5 >= 7

False

Strings can also be compared.

sierra = 'Sierra'
nevada = 'Nevada'

sierra == nevada

False

sierra != nevada

True

© 2022 American Chemical Society 22


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

We can also search whether a subsequence is found within another sequence. Take the SMILES for L-alanine. Let us observe whether it contains a
carboxylic acid.

ala = 'N[C@@H](C)C(O)=O'
acid = 'C(O)=O'

acid in ala

True

acid not in ala

False

1.5.2 Logic Gates


We handle multiple Booleans using logic gates. Let A and B be two statements. A AND B is true only if both A and B are true. A OR B is true if
at least one of A or B is true. The NOT gate flips true and false, so NOT A is true if A is false (FIGURE 1.4).

FIGURE 1.4 Venn Diagrams for AND, OR, and NOT Logic Gates.

# AND gate
True and True

True

True and False

False

False and False

False

# OR gate
True or True

True

False or True

True

False or False

False

© 2022 American Chemical Society 23


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

# NOT gate
not True

False

not False

True

Other logic gates can be implemented as a combination of these logic gates. For instance, consider the exclusive-or gate XOR, in which A XOR B
is true if exactly one of A or B is true. This can be implemented by (A OR B) AND (NOT (A AND B)).

def xor(A, B):


'''exclusive or of booleans A and B'''
return (A or B) and (not (A and B))

print('xor(True, True): ', xor(True, True))


print('xor(True, False): ', xor(True, False))
print('xor(False, True): ', xor(False, True))
print('xor(False, False): ', xor(False, False))

xor(True, True): False


xor(True, False): True
xor(False, True): True
xor(False, False): False

An amino acid contains an amine and a carboxylic acid. An amine is an alkylic derivative of ammonia. Let us take a simplistic definition and
consider all nonaromatic organic nitrogens that are not amides as amines. Let us see whether aniline is an amino acid.

acid = 'C(O)=O'
amide = 'NC=O'

def is_amino_acid(smiles):
'''returns boolean for whether the provided SMILES is an amino acid'''
contains_acid = acid in smiles # boolean for whether acid motif is present
no_amide = smiles.replace(amide, '') # remove amide motif
contains_amine = 'N' in no_amide # boolean for whether a nonaromatic nitrogen is present after removing amides
return contains_acid and contains_amine

aniline = 'Nc1ccccc1'

if is_amino_acid(aniline):
print('Aniline is an amino acid')
else:
print('Aniline is not an amino acid')

Aniline is not an amino acid

if is_amino_acid(ala):
print('L-alanine is an amino acid')
else:
print('L-alanine is not an amino acid')

L-alanine is an amino acid

© 2022 American Chemical Society 24


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

We defined two Booleans, whether an amine was present and whether a carboxylic acid was present. We passed it through an AND gate, and
determined that aniline was not an amino acid. L-alanine, on the other hand, was determined to be an amino acid.

Due to conditional statements, we can implement logic in how we process the data. By specifying the cases and the blocks of code to execute
for each case, we can break problems down into smaller components. Therefore, we do not have to find a solution to work on all input data but
collect individual solutions together through conditional statements.

1.5.3 Practice Problems


We handled a new type of object, Booleans (True/False). Test your understanding of how to use Booleans through practice problems.

A. We use the rectified linear unit (ReLU), which is a popular activation function in deep learning. Activation functions introduce nonlinearity to
the model. The ReLU function is shown. Implement the ReLU function which accepts a scalar.

(1.8)

# write the rest of the code

B. Chemical shifts measured on proton nuclear resonance spectroscopy are correlated to functional groups. Write a function which returns
possible functional groups based on a chemical shift. Some functional groups and the interval in which they typically appear are listed.
• aliphatic: 0.5–2.0 ppm

• allylic: 1.5–2.5 ppm

• CH2-X: 2.5–4.5 ppm

• ROH: 0.5–5.0 ppm

• vinylic: 4.5–6.5 ppm

• aromatic: 6.0–8.5 ppm

# write the rest of the code

Appendix A: Solutions to Practice Problems

Consider what we have accomplished automating so far. Numerical and string operations can be encoded and organized into functions. However,
our data may contain exceptions, so we use conditional statements to define how various inputs can be processed. To execute these automated
operations on a large volume of data, we learn loops.

1.6 LOOPS

You might be wondering why you need to go through the trouble of learning Python if you already have a scientific calculator and a text editor
available. Certainly, manually solving problems is a great short-term solution. Programming is a long-term solution. Suppose you spend one hour
writing a script to complete a task that takes 2 min manually. You would have saved time if you repeated the operations more than 30 times, and
now you can consider inputs many orders of magnitude greater in quantity.

The strength of programming is to automate repetitive tasks in loops. Loops are how we run operations at a massive scale and obtain insight from
large data. The two types of loops we will consider are the for-loop and while-loop. A for-loop repeats the operation over an iterable such as a list.
A while-loop repeats the operation until a certain condition is met.

1.6.1 For-Loop
As an example, given a list of single point energy values in Hartrees, let us convert them to units of kcal/mol. We iterate over a list of energy values
in a for-loop, and save it in a new list.

© 2022 American Chemical Society 25


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

# single point energy values in Hartree


spes = [-443.19225788, -443.225590922, -443.199437861,
-443.239444863, -443.217935422, -443.220253946,
-443.236066571, -443.22927076, -443.186725894, -443.232267518]
kcal_mol_over_hartree = 627.5094740631 # Hartree mol / kcal
# initialize the list of energy in kcal/mol
spes_in_kcal_mol = []

for the_energy in spes:


# use conversion factor to obtain value in kcal/mol
energy_in_kcal_mol = the_energy * kcal_mol_over_hartree
# append resulting value in the new list
spes_in_kcal_mol.append(energy_in_kcal_mol)

spes_in_kcal_mol

[-278107.34065111657,
-278128.25745077094,
-278111.8461572177,
-278136.95093000156,
-278123.45355199225,
-278124.9084477681,
-278134.8310197654,
-278130.56658397894,
-278103.86927749123,
-278132.4470780154]

In the for-loop, we define a local variable for each item of the iterable, in this case the_energy. The instructions to be repeated are indented. For
each item in the list of energies in Hartree, we convert the units and append them to a new list, using the append(...) method of lists.

1.6.1.1 List Comprehension


List comprehension is a concise means to utilize a for-loop. We place the for-loop inside a list to return a new list. Let us perform the same
operation using a list comprehension.

spes_in_kcal_mol = [the_energy * kcal_mol_over_hartree for the_energy in spes]


spes_in_kcal_mol

[-278107.34065111657,
-278128.25745077094,
-278111.8461572177,
-278136.95093000156,
-278123.45355199225,
-278124.9084477681,
-278134.8310197654,
-278130.56658397894,
-278103.86927749123,
-278132.4470780154]

The syntax inside the list is: <the operation> for <local variable> in <iterable>. We might read it like: Do this for each item of this list.

© 2022 American Chemical Society 26


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

1.6.1.2 Iterables
For-loops can iterate over other iterables, such as sets, dictionaries, and tuples. Sets are nonredundant collections of objects. As an example, let us
look at the sequence of Saccharomyces cerevisiae alcohol dehydrogenase ADH1 (15). We convert the string into a set to observe the unique amino
acids present.

adh1 =
'MSIPETQKGVIFYESHGKLEYKDIPVPKPKANELLINVKYSGVCHTDLHAWHGDWPLPVKLPLVGGHEGAGVVVGMGENVKGWKI
GDYAGIKWLNGSCMACEYCELGNESNCPHADLSGYTHDGSFQQYATADAVQAAHIPQGTDLAQVAPILCAGITVYKALKSAN
LMAGHWVAISGAAGGLGSLAVQYAKAMGYRVLGIDGGEGKEELFRSIGGEVFIDFTKEKDIVGAVLKATDGGAHGVINVSVS
EAAIEASTRYVRANGTTVLVGMPAGAKCCSDVFNQVVKSISIVGSYVGNRADTREALDFFARGLVKSPIKVVGLSTLPEIYE
KMEKGQIVGRYVVDTSK'
set(adh1)

{'A',
'C',
'D',
'E',
'F',
'G',
'H',
'I',
'K',
'L',
'M',
'N',
'P',
'Q',
'R',
'S',
'T',
'V',
'W',
'Y'}

len(set(adh1))

20

ADH1 contains all 20 standard amino acids. Let us make a set from the SMILES of reduced nicotinamide adenine dinucleotide (NADH).

nadh = 'O=C(N)C1CC=C[N](C=1)[C@@H]2O[C@@H]([C@@H](O)[C@H]2O)COP([O-])
(=O)OP(=O)([O-])OC[C@H]5O[C@@H](n4cnc3c(ncnc34)N)[C@H](O)[C@@H]5O'
set(nadh)

{'(',
')',
'-',
'1',
'2',
'3',
'4',
'5',
'=',
'@',

© 2022 American Chemical Society 27


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

'C',
'H',
'N',
'O',
'P',
'[',
']',
'c',
'n'}

Suppose we want to know all the heavy atom elements present. We can save the item if the letter is alphabetical. We iterate through each item of
the set in a for-loop.

# initialize list
heavy_atoms = []

# iterate through each unique letter in the SMILES of NADH


for letter in set(nadh):
# if the letter is alphabetical and not a hydrogen
if letter.isalpha() and letter.lower() != 'h':
# save the letter to the list
heavy_atoms.append(letter)

heavy_atoms

['P', 'c', 'O', 'N', 'n', 'C']

NADH contains carbon, phosphorus, nitrogen, and oxygen. The lowercase elements represent atoms in aromatic motifs. So, if we were to
perform energy calculations on NADH, we must confirm the potential handles all these atom types.

A for-loop over a dictionary iterates through its keys.

# key: country; value: capital city


capital_cities = {'Kenya': 'Nairobi', 'Colombia': 'Bogotá', 'Japan': 'Tokyo', 'Iran': 'Tehran', 'Sweden': 'Stockholm'}

for country in capital_cities:


print('The capital city of %s is %s' %(country, capital_cities[country]))

The capital city of Kenya is Nairobi


The capital city of Colombia is Bogotá
The capital city of Japan is Tokyo
The capital city of Iran is Tehran
The capital city of Sweden is Stockholm

1.6.2 While-Loop
The for-loop has a defined end to the loop: the number of items in the iterable. Sometimes the number of operations is not defined. We want to
repeat the operation until a task is done. For this, we use the while-loop. The while-loop defines a block of code to repeat until a condition is met.

Let us simulate a one-dimensional Brownian particle. We initialize the particle at 0, then observe the time taken for the particle to reach 10.

from random import sample

# initialize coordinate and count


x_coord = 0

© 2022 American Chemical Society 28


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

steps_taken = 0

# while the coordinate is not 10


while x_coord != 10:
# sample between -1 and 1 for the step
step = sample([-1, 1], 1)[0]
# apply step to the coordinate and save value to the variable
x_coord += step # same as x_coord = x_coord + step
# add 1 to steps_taken
steps_taken += 1

print('%d steps were taken for the particle to arrive from 0 to 10.' %steps_taken)

20 steps were taken for the particle to arrive from 0 to 10.

Nesting loops should be avoided if possible because it increases the computational complexity. Let us simulate a two-dimensional Brownian
particle. The particle can take steps of (1, 0), (0, 1), (−1, 0), (0, −1) in one iteration. We track the number of steps and observe how many
trajectories reach (2, 1) within 103 steps.

# initialize parameter
max_steps = 10 ** 3
# initialize list for results
steps_list = []
# repeat 100 times
for i in range(100):
# initialize coordinate and count
coord = [0, 0] # x and y coordinate
steps_taken = 0
# repeat until the coordinate is (2, 1)
while coord[0] != 2 or coord[1] != 1:
# determine whether the step is in x (0) or y (1) axis
axis = sample([0, 1], 1)[0]
# obtain step size
step = sample([-1, 1], 1)[0]
# update the coordinate
coord[axis] += step
# add one to steps taken
steps_taken += 1
# if max steps were reached, exit the while loop
if steps_taken >= max_steps:
break
# once while-loop is complete, append result
steps_list.append(steps_taken)
steps_list

[117,
1000,
1000,
141,
1000,
1000,
1000,
91,
1000,

© 2022 American Chemical Society 29


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

11,
9,
13,
1000,
71,
1000,
41,
1000,
931,
21,
1000,
611,
1000,
1000,
7,
73,
1000,
1000,
1000,
65,
1000,
1000,
1000,
1000,
405,
1000,
5,
61,
1000,
1000,
157,
1000,
1000,
3,
341,
7,
1000,
17,
3,
1000,
1000,
1000,
25,
11,
25,
19,
1000,
7,
675,
1000,
1000,
1000,
939,

© 2022 American Chemical Society 30


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

1000,
237,
7,
1000,
227,
1000,
9,
473,
15,
5,
285,
1000,
293,
369,
1000,
1000,
1000,
1000,
1000,
15,
19,
831,
1000,
9,
1000,
5,
1000,
5,
403,
493,
3,
29,
141,
11,
3,
19,
281,
1000]

The while-loop was nested inside a for-loop to repeat 100 simulations of Brownian motion. Notice how the indentation is added to the
while-loop because it is inside the for-loop. We have an if statement with a break. More on this later. Let us count the simulations which
reached (2, 1) within 103 steps. We can subset lists by list comprehension. Simply return the item of the list but add a condition at the end. The
condition is the item is less than the specified maximum numbers of steps allowed. Because not all items of the list are below 1000 steps, the
resulting list is a subset of the original list.

[x for x in steps_list if x < max_steps]

[117,
141,
91,
11,
9,
13,
71,

© 2022 American Chemical Society 31


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

41,
931,
21,
611,
7,
73,
65,
405,
5,
61,
157,
3,
341,
7,
17,
3,
25,
11,
25,
19,
7,
675,
939,
237,
7,
227,
9,
473,
15,
5,
285,
2
93,
369,
15,
19,
831,
9,
5,
5,
403,
493,
3,
29,
141,
11,
3,
19,
281]

len([x for x in steps_list if x < max_steps])

55

© 2022 American Chemical Society 32


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

def mean(alist):
'''return the mean of a list'''
return sum(alist) / len(alist)

mean_steps = mean([x for x in steps_list if x < max_steps])


print('If the Brownian particle reaches (2,1) in 1000 steps, it takes %d steps on average.' %mean_steps)

If the Brownian particle reaches (2,1) in 1000 steps, it takes 165 steps on average.

1.6.3 Continue, Break, and Pass


There are various statements we can use to skip an iteration or exit a loop. continue skips one entry. It is useful to encode exceptions to a code that
would raise an error by the operations in the loop.

# from 0 through 9,
for i in range(10):
# if the value is divisible by 2
if i % 2 == 0:
continue
print(i)

1
3
5
7
9

Only odd integers were printed because we skipped even integers with a continue before reaching the print statement.

We can also exit a loop entirely using break.

i=0

while True: # an infinite loop


print(i)
if i == 10:
break
i += 1

0
1
2
3
4
5
6
7
8
9
10

The loop will continue indefinitely because the statement passed to the while-loop is always true. However, we exit the loop using break.

Similar to functions, pass can fill in loops during development to avoid errors.

© 2022 American Chemical Society 33


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

for i in range(5):
pass

We are ready to process large volumes of data by encoding operations for individual data points and executing them in a loop.

1.6.4 Practice Problems


We should now be able to delegate repetitive tasks and calculations to loops using Python. Try out some practice problems to test your
understanding.

A. Write a function which accepts the whole number i, and returns the ith number in the Fibonacci sequence. The Fibonacci sequence is a
sequence of natural numbers in which the ith entry is the sum of the (i − 2)th and (i − 1)th numbers. The first 10 numbers of the Fibonacci
sequence are 1, 1, 2, 3, 5, 8, 13, 21, 34, and 55.

# write the rest of the code

B. Suppose we have analytically determined the diffusion coefficient at 24 °C to find the Stokes–Einstein radius of drug molecules in water (16).
Assume the molecules are spheres. Calculate the Stokes–Einstein radius for all compounds in a for-loop.

Stokes–Einstein equation for spherical system

(1.9)

where kB is the Boltzmann’s constant (1.38 × 10–23 J/K), T is the temperature, η is the viscosity of the liquid, and D is the diffusion coefficient of
ion.

drug_diffusion_coef = { # D in x10^(-6) cm^2/s


'caffeine': 9.0,
'calcein': 3.8,
'chloramphenicol': 6.6,
'ketoprofen': 6.4,
'nitrofurantoin': 7.1,
'paracetamol': 7.8,
'penicillin G': 6.5,
'tetracycline': 5.8,
'trimethoprim': 5.6,
'vancomycin': 2.9
}

# write the rest of the code

C. We previously identified the indices of PAM in the exons of A. thaliana (9). Write a function that takes the sequence of the exon in lower case
and capitalizes all candidates for the guide RNA to recognize. The region must satisfy all of the following conditions:
• Located at the 5' end of the PAM (<sequence>-NGG or CCN-<sequence>)

• Is 20 nucleotides long

• Does not contain four or more consecutive Ts

• The GC content is 30-80%

exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAGTC
TACACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACATTTCCT

© 2022 American Chemical Society 34


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

TCTCTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTCTCCAAAC
TCAAAATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGTCGTCGCCGT
CGTAAGCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACTTGTCG
GTGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAAC
AGTGAAGGCTTTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCT
ATCTATCCTTCCTAGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTT
CTTGTTAGTTTGAAGAATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCAT
TTGGTAATCAACTTTAATCCATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAAT
CCGCGGTATATCGCGGTATAATTTACTTTTTAAAGTTAATATATATTAAAACTTG'

# write the rest of the code

D. You want to prototype the behavior of an automatic pH titrator, including a pH probe, 1 N hydrochloric acid pump, and 1 N sodium
hydroxide pump connected to the same computer. You want an enzymatic reaction to proceed for one hour with pH on the range of (7.0, 7.2).
The pH should be measured in 30 s increments if the pH is determined to be in the interval. If the pH goes out of range, add the appropriate
solution dropwise to return to the pH range, over increments of 1 s. Use the provided functions to draft the behavior. Test the code by speeding
up the time 60-fold. The pH fluctuates by sampling from a Gaussian distribution every 30 s, purely for illustration.

from time import time, sleep


from random import gauss

pH = 7.1
hcl_added = 0 # mL
naoh_added = 0 # mL
one_drop = 0.0648524 # mL

def add_hcl():
'''use by: hcl_added, pH = add_hcl()
This will update the record of added HCl and update the pH'''
return hcl_added + one_drop, pH - 0.05

def add_naoh():
'''use by: naoh_added, pH = add_naoh() '''
return naoh_added + one_drop, pH + 0.05

times = {
'reaction': 1 * 60 * 60, # s
'in_range': 30, # s
'out_range': 1 # s
}
# expedite the time by 60 fold
speedup = 60
times = {name:t / speedup for (name, t) in times.items()}

toc = time() # current time in s


one_second_counter = 0

tic = time() # current time in s

while (tic - toc) < times['reaction']:

© 2022 American Chemical Society 35


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

# if the pH is on range, wait 30s. pH fluctuates


if pH > 7.0 and pH < 7.2:
sleep(times['in_range'])
pH += gauss(0, 0.05)
# program behavior of the titrator off range from here
# be sure to add one to one_second_counter if 1s elapses

# complete the code

# Once implemented the off-range behavior, test out the code


# if 30 counts of the 1s counter are recorded, fluctuate the pH
if one_second_counter == 30:
# reset the counter
one_second_counter = 0
# fluctuate pH
pH += gauss(0, 0.05)
tic = time()
time_elapsed = (tic - toc)/60 * speedup # min
print('%.2f min | pH = %.2f' %(time_elapsed, pH))

print('reaction complete - program off')


print('HCl (1N) added: %.2f mL' %hcl_added)
print('NaOH (1N) added: %.2f mL' %naoh_added)

Appendix A: Solutions to Practice Problems

With solutions we have learned in base Python, we can now design algorithmic solutions to automate and apply on data we could not have
reached by manual efforts. We cover commands from external Python packages in subsequent chapters; however, we structure our scripts in
a similar manner. We build on these solutions in base Python; however, code should be made more concise and readable by using packages
optimized for the task in practical implementations.

1.7 THAT’S A WRAP

Python is a simple language suitable to begin exploring algorithmic solutions in chemical problems. After trying out a few practice problems, you
should be comfortable opening a Python interactive developing environment (IDE) for solving chemical problems in base Python.
• Chemists can use Python as a scientific calculator. The code documents the calculations performed.

• Python can manipulate text using string operations, which is amenable for files, sequences, and chemical structures.

• We can assign chemical meaning to Python code by organizing them in functions.

• Booleans (True/False) allows us to process various inputs by running certain blocks of code only under a given condition.

• For-loops and while-loops allow us to delegate repetitive tasks to the computer.

1.8 READ THESE NEXT

• McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython; OʼReilly Media, Inc., 2012 (17).

• Severance, C. R. Python for Everybody: Exploring Data in Python 3; North Charleston: CreateSpace Independent Publishing Platform,
2016 (18).

© 2022 American Chemical Society 36


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Appendix A.
Solutions to Practice Problems

1.2.1 Solutions

A.

# molecular weights
benzodioxaborole_mw = 202.06 # g/mol
bromopropene_mw = 135.00 # g/mol
pd_catalyst_mw = 1155.59 # g/mol

scale = 1.0 # mmol

# equivalence
benzodioxaborole_equiv = 1.1
bromopropene_equiv = 1.0
pd_catalyst_equiv = 0.01

# calculate the mass for each


benzodioxaborole_mass = scale * benzodioxaborole_equiv * benzodioxaborole_mw
bromopropene_mass = scale * bromopropene_equiv * bromopropene_mw
pd_catalyst_mass = scale * pd_catalyst_equiv * pd_catalyst_mw

print('(E)-1-hexenyl-1,3,2-benzodioxaborole: %.2f mg' %benzodioxaborole_mass)


print('1-bromo-2-methyl-1-propene: %.2f mg' %bromopropene_mass)
print('tetrakis(triphenylphosphine)-palladium: %.2f mg' %pd_catalyst_mass)

B.

# assign known values to variables


v = 3.4 # units
km = 54 # mM
s = 10 # mM
enz = 0.72 # mg

# solve for Vmax


vmax = v * (km + s) / s
print('LarA exhibited Vmax of %d units' %vmax)

# solve for kcat


kcat = vmax / enz
print('LarA exhibited kcat of %d units/(mg)' %kcat)

C.

# assign known values to variables


epsilon = 84.3 # L / (g cm)
vi = 5 # mL
vf = 13 # mL
b = 1 # cm

© 2022 American Chemical Society 37


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

absorbance = 0.31 # au

# calculate the chlorophyll a concentration of the organic extract


# using the Beer–Lambert law
cf = absorbance / (b * epsilon) # g / L

# adjust for dilution by change in volume


ci = vf * cf / vi # g / L

# correct for units


ci *= 1000 # mg / L

print('The chlorophyll a concentration of the water sample is %.1f mg/L' %ci)

D.

# import relevant functions from the math module


from math import sqrt, log10

# equilibrium constants
K1 = 1.7* 10**(-3)
K2 = 2.5 * 10 ** (-4)
K1K2 = K1 * K2
# henry's law constant
kh = 29.41

# Calculate the pH for 280 ppm CO2


aq_co2_i = 280 * 10 ** (−6) / kh # convert ppm to atm, then find molar aqueous co2
sq_h_i = K1K2 * aq_co2_i
h_conc_i = sqrt(sq_h_i) # obtain the concentration of protons produced
ph_i = −log10(h_conc_i) # calculate pH

# Calculate the pH for 380 ppm CO2


aq_co2_f = 380 * 10 ** (−6) / kh # convert ppm to atm, then find molar aqueous co2
sq_h_f = K1K2 * aq_co2_f
h_conc_f = sqrt(sq_h_f)
ph_f = −log10(h_conc_f)

ph_diff = ph_f - ph_i


print('The change in pH from 280 ppm CO2 to 380 ppm CO2 is %.3 f.' %ph_diff)```

Return to Section

1.3.3 Solutions

A.

# reverse the coding sequence and make it all upper case


# obtain the reverse complement as a list
# then join the individual letters to one string.
antisense_rfp = ''.join([complement_dict[n] for n in rfp[::-1].upper()])
antisense_rfp

© 2022 American Chemical Society 38


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

B.

exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAG
TCTACACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACAT
TTCCTTCTCTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTC
TCCAAACTCAAAATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGT
CGTCGCCGTCGTAAGCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACTTGT
CGGTGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAACAGTGAAG
GCTTTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCTATCTAT
CCTTCCTAGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTTCTTGT
TAGTTTGAAGAATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCATTTGG
TAATCAACTTTAATCCATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAATCCG
CGGTATATCGCGGTATAATTTACTTTTTAAAGTTAATATATATTAAAACTTG'

pam = 'GG'

index1 = exon1.find(pam)
index2 = exon2.find(pam)
index3 = exon3.find(pam)

print('First occurances of the PAM site NGG in exons 1, 2, and 3 respectively: %d, %d, %d' %(index1, index2,
index3))```

C.

d_glucosamine_sulfate = 'C([C@H]([C@H]([C@@H]([C@H](C=O)N)O)O)O)O.OS(=O)(=O)O′
structure_list = d_glucosamine_sulfate.split('.')
len(structure_list)
print('lengths of structures:', len(structure_list[0]), len(structure_list[1]))

structure_list[0]

D.

saveline = ' TOTAL ENERGY = −884.004943174\n'

float(saveline.split()[−1])

Return to Section

1.4.1 Solutions

A.

recombinant_sarafotoxin =
'MKDDAAIQQTLAKMGIKSSDIQPAPVAGMKTVLTNSGVLYITDDGKHIIQGPMYDVSGTAPVNVTNKMLLKQLNALEKEMIVYKA
PQEKHVITVFTDITCGYCHKLHEQMADYNALGITVRYLAFPRQGLDSDAEKEMKAIWCAKDKNKAFDDVMAGKSVAPASCDVDIAD
HYALGVQLGVSGTPAVVLSNGTLVPGYQPPKEMKEFLDEHQKMTSGKGSTSGSGHHHHHHGTMTSLYKKAGLENLYFQCTCKDMTD

© 2022 American Chemical Society 39


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

KECLYFCHQDIIW'

def tev_index(peptide, tev_site = 'ENLYFQ'):


'''Given a peptide sequence, report the index of the TEV site, if any'''
return peptide.find(tev_site)

tev_index(recombinant_sarafotoxin)

B.

from math import log

R = 0.008314 # kJ / (mol K)
F = 96.49 # kJ / (V mol)

def reaction_quotient(acceptor_conc, donor_conc, acceptor_coef = 1, donor_coef = 1):


'''Report the reaction coefficient of a redox reaction.'''
return acceptor_conc** acceptor_coef/donor_conc ** donor_coef

def reduction_potential(Q, acceptor_E_std, donor_E_std, T = 298, n = 1):


'''Return the non-equilibrium reduction potential E of a redox reaction'''
E_std = acceptor_E_std - donor_E_std
E = E_std - R * T / (n * F) * log(Q)
return E

ag_plus = 0.04 / 1000 # M


mn_plus = 0.13 / 1000 # M
ag_E_std = 0.7996 # V
mn_E_std = −1.185 # V

Q = reaction_quotient(ag_plus, mn_plus, acceptor_coef = 2)


reduction_potential(Q, ag_E_std * 2, mn_E_std, n = 2)

C.

exact_masses = {'C': 12.000000, 'H': 1.007825, 'O': 15.994915, 'N': 14.003074, 'Na': 22.989770}

def exact_mass(C_count, H_count, O_count, N_count):


C_mass = exact_masses['C'] * C_count
H_mass = exact_masses['H'] * H_count
O_mass = exact_masses['O'] * O_count
N_mass = exact_masses['N'] * N_count
return C_mass + H_mass + O_mass + N_mass

def compare_to_sodiated(measured, neutral_calc):


calc_adduct = neutral_calc + exact_masses['Na']
return measured - calc_adduct

calc_mass = exact_mass(15, 14, 2, 0)


compare_to_sodiated(249.0881, calc_mass)

D.

© 2022 American Chemical Society 40


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

from math import log

def k_from:halflife(t_half):
'''Calculate decay constant k from halflife in inverse time'''
return - log(1/2) / (t_half) # erroneous parentheses

def time_relative(fraction, k):


'''Calculate the time taken for radioactive decay with decay constant k
to achieve a fraction of the initial amount.'''
return - log(fraction)/ k # forgotten negative sign

halflife = 20.364 # minutes


k = k_from:halflife(halflife)
time_shutdown = time_relative(0.03/0.17, k)
print(time_shutdown, 'minutes')

Return to Section

1.5.3 Solutions

A.

def relu(val):
if val > 0:
return val
else:
return 0

relu(1.3)

B.

def functional_group(chemical_shift):
possible_functional_groups = []
if chemical_shift >= 0.5 and chemical_shift <= 2.0:
possible_functional_groups.append('aliphatic')
if chemical_shift >= 1.5 and chemical_shift <= 2.5:
possible_functional_groups.append('allylic')
if chemical_shift >= 2.5 and chemical_shift <= 4.5:
possible_functional_groups.append('CH2-X')
if chemical_shift >= 0.5 and chemical_shift <= 5.0:
possible_functional_groups.append('ROH')
if chemical_shift >= 4.5 and chemical_shift <= 6.5:
possible_functional_groups.append('vinylic')
if chemical_shift >= 6.0 and chemical_shift <= 8.5:
possible_functional_groups.append('aromatic')
return possible_functional_groups

functional_group(7)

Return to Section

© 2022 American Chemical Society 41


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

1.6.4 Solutions

A.

def fibonacci(n):
'''Return the nth entry of the Fibonacci sequence'''
seq = [1, 1]
for i in range(n - 1):
seq.append(seq[−2] + seq[−1])
return seq[−1]

fibonacci(10)

B.

from math import pi

kB = 1.38 * 10 ** (-23) # J/K


eta = 0.9107 / 1000 # kg / (m s) # same as Pa s
T = 273.15 + 24 # K

drug_diffusion_coef = { # D in x10^(−6) cm^2/s


'caffeine': 9.0,
'calcein': 3.8,
'chloramphenicol': 6.6,
'ketoprofen': 6.4,
'nitrofurantoin': 7.1,
'paracetamol': 7.8,
'penicillin G': 6.5,
'tetracycline': 5.8,
'trimethoprim': 5.6,
'vancomycin': 2.9
}

drug_se_radius = {}

def stokes_einstein_radius(D, T = T, eta = eta):


return kB * T / (6 * pi * eta * D)

for drug in drug_diffusion_coef:


# convert units to m^2/s
D = drug_diffusion_coef[drug] * 10 ** (−6) * (1 / 100) ** 2 # m^2/s
# calculate the radius in m
r = stokes_einstein_radius(D) # radius in m
# record the radius in angstroms
drug_se_radius[drug] = r * 10 ** 10 # radius in angstroms
drug_se_radius

C.

def grna_candidates(dna):
'''return candidate regions to hybridize to gRNA using SpCas9'''
# lower case the sequence

© 2022 American Chemical Society 42


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

dna = dna.lower()
# store indices to capitalize. capitalize at end
cap_indices = []
# scan the sequence from 5′ to 3′
for i in range(len(dna) - 23):
# determine if on index of PAM (CCN)
is_pam = dna[i] == 'c' and dna[i + 1] == 'c'
grna_region = dna[i + 3:i + 23]
# move on if not PAM
if not is_pam:
continue
# skip if polyT is present
elif 'tttt' in grna_region:
continue
else:
# calculate the GC content
gc_cont = (grna_region.count('g') + grna_region.count('c')) / 20
if gc_cont >= 0.3 and gc_cont <= 0.8:
cap_indices += list(range(i + 3, i + 23))
# now scan in the opposite direction
for i in range(len(dna)-1, 22, −1):
# determine if on index of PAM (NGG)
is_pam = dna[i] == 'g' and dna[i - 1] == 'g'
grna_region = dna[i - 22: i - 2]
# move on if not PAM
if not is_pam:
continue
# skip if polyT is present
elif 'tttt' in grna_region:
continue
else:
# calculate the GC content
gc_cont = (grna_region.count('g') + grna_region.count('c')) / 20
if gc_cont >= 0.3 and gc_cont <= 0.8:
cap_indices + = list(range(i - 22, i - 2))
# save dna to list
dna = list(dna)
# capitalize at each index determined to be candidate site for gRNA recognition
for i in cap_indices:
dna[i] = dna[i].upper()
return ''.join(dna)

exon1 =
'ACAAAGTTAGCCTTCAAAATACTTACAAATCCCAATAAAAGACTTCATCTCCATGTGTATTTGAGTGTCAACGACAAGTCTA
CACAAAGGGTAAGAGGTCAACAAGACCACACAACACTTCTTACTATTAGTTTTGCAAAGGCCGTTCGTTGGACATTTCCTTCT
CTCTCCTCCCCTCTTCTTCTTCTTGTTCGCTCTATAAACTCTCATCTCTCACGTCTTTTTTTCCTTACATTCTCCAAACTCAA
AATTTCATCACATTAATTTCTCTCTATTTTTCTTTTCTTACTTCAATAGTAATGGATAACACTGACCGTCGTCGCCGTCGTAA
GCAACACAAAATCGCCCTCCATGACTCTGAAG'
exon2 =
'AAGTGAGCAGTATCGAATGGGAGTTTATCAACATGACTGAACAAGAAGAAGATCTCATCTTTCGAATGTACAGACTTGTCGG
TGATAG'
exon3 =
'GTGGGATTTGATAGCAGGAAGAGTTCCTGGAAGACAACCAGAGGAGATAGAGAGATATTGGATAATGAGAAACAGTGAAGGCT

© 2022 American Chemical Society 43


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

TTGCTGATAAACGACGCCAGCTTCACTCATCTTCCCACAAACATACCAAGCCTCACCGTCCTCGCTTTTCTATCTATCCTTCCT
AGTGTTTTTGTTTTTAAGCCAACGAAAAAAGAAAATAAAAAAATTATAATAGATGTATAGTAGTGGTTCTTGTTAGTTTGAAGA
ATTCATCATCTATTGTTTTCTTTTTGTTGTTATTTCATTTATAATTTTTATAGTATAGGTTTCATTTGGTAATCAACTTTAATC
CATGCGGTTAGGTTTTTTTATTTTCTCGTCTACGACTTTTATATCCACAACTAGATTTTAATCCGCGGTATATCGCGGTATAAT
TTACTTTTTAAAGTTAATATATATTAAAACTTG'

grna_candidates(exon1)

D.

from time import time, sleep


from random import gauss

pH = 7.1
hcl:added = 0 # mL
naoh_added = 0 # mL
one_drop = 0.0648524 # mL

def add_hcl():
'''use by: hcl:added, pH = add_hcl()
This will update the record of added HCl and update the pH'''
return hcl:added + one_drop, pH - 0.05

def add_naoh():
'''use by: naoh_added, pH = add_naoh() '''
return naoh_added + one_drop, pH + 0.05

times = {
'reaction': 1 * 60 * 60, # s
'in_range': 30, # s
'out_range': 1 # s
}
# expedite the time by 60 fold
speedup = 60
times = {name:t / speedup for (name, t) in times.items()}

toc = time() # current time in s


one_second_counter = 0

tic = time() # current time in s

while (tic - toc) < times['reaction']:


if pH > 7.0 and pH < 7.2:
sleep(times['in_range'])
pH += gauss(0, 0.05)
# program behavior of the titrator off range.
elif pH < 7.0:
naoh_added, pH = add_naoh()
one_second_counter += 1
sleep(times['out_range'])
elif pH > 7.2:
hcl:added, pH = add_hcl()
one_second_counter += 1

© 2022 American Chemical Society 44


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

sleep(times['out_range'])
# test out the code
if one_second_counter == 30:
one_second_counter = 0
pH += gauss(0, 0.05)
tic = time()
time_elapsed = (tic - toc)/60 * speedup # min
print('%.2f min|pH = %.2f' %(time_elapsed, pH))

print('reaction complete - program off')


print('HCl (1 N) added: %.2f mL' %hcl:added)
print('NaOH (1 N) added: %.2f mL' %naoh_added)

Return to Section

© 2022 American Chemical Society 45


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Bibliography
1. Van Rossum, G., Drake, F. L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, 2009.

2. Miyaura, N.; Yamada, K.; Suzuki, A. A new stereospecific cross-coupling by the palladium-catalyzed reaction of 1-alkenylboranes with 1-alkenyl or 1-alkynyl halides.
Tetrahedron Lett. 1979, 20(36), 3437–3440, 10.1016/S0040-4039(01)95429-2.

3. Mojzita, D.; Penttilä, M.; Richard, P. Identification of an l-arabinose reductase gene in Aspergillus niger and its role in l-arabinose catabolism. J. Biol. Chem. 2010, 285(31),
23622–23628, 10.1074/jbc.M110.113399.

4. Boardman, N. K.; Thorne, S. W. Sensitive fluorescence method for the determination of chlorophyll achlorophyll b ratios. Biochim. Biophys. Acta (BBA)-Bioenergetics 1971,
253(1), 222–231, 10.1016/0005-2728(71)90248-9.

5. Eyre, B. D.; Cyronak, T.; Drupp, P.; De Carlo, E. H.; Sachs, J. P.; Andersson, A. J. Coral reefs will transition to net dissolving before end of century. Science 2018, 359(6378),
908–911, 10.1126/science.aao1118.

6. Fradkov, A. F.; Chen, Y.; Ding, L.; Barsova, E. V.; Matz, M. V.; Lukyanov, S. A. Novel fluorescent protein from discosoma coral and its mutants possesses a unique far-red
fluorescence. FEBS Lett. 2000, 479 (3), 127–130, 10.1016/S0014-5793(00)01895-0.

7. Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G., Erlich, H. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. In Cold Spring Harbor
Symposia on Quantitative Biology; Cold Spring Harbor Laboratory Press, 1986; vol. 51, pp. 263–273.

8. Knight, T. Idempotent Vector Design for Standard Assembly of Biobricks; MIT Artificial Intelligence Laboratory; MIT Synthetic Biology Working Group, 2003.

9. Grützner, R.; Martin, P.; Horn, C.; Mortensen, S.; Cram, E. J.; Lee-Parsons, C. W. T.; Stuttmann, J.; Marillonnet, S. High-efficiency genome editing in plants mediated by a
cas9 gene containing multiple introns. Plant Commun. 2021, 2(2), 100135, 10.1016/j.xplc.2020.100135.

10. National Center for Biotechnology Information. Gene ID: 835401, Arabidopsis thaliana Homeodomain-like superfamily protein (TRY), Chromosome: 5; National Library of
Medicine (US): Bethesda, MD, 1988. https://www.ncbi.nlm.nih.gov/gene/835401 (accessed 2022-01-08).

11. Manathunga, M.; Miao, Y.; Mu, D.; Götz, A. W.; Merz, K. M., Jr. Parallel implementation of density functional theory methods in the quantum interaction computational
kernel program. J. Chem. Theory Comput. 2020, 16(7), 4315–4326, 10.1021/acs.jctc.0c00290.

12. Manathunga, M.; Jin, C.; Cruzeiro, V. W. D.; Smith, V.; Keipert, K.; Pekurovsky, D.; Mu, D.; Miao, Y.; He, X.; Ayers, K.; Brothers, E.; Götz, A. W.; Merz, K. M.
Quick-21.03. https://github.com/merzlab/QUICK.

13. Sequeira, A. F.; Turchetto, J.; Saez, N. J.; Peysson, F.; Ramond, L.; Duhoo, Y.; Blémont, M.; Fernandes, V.O.; Gama, L. T.; Ferreira, L. M. A.; et al. Gene design, fusion
technology and tev cleavage conditions influence the purification of oxidized disulphide-rich venom peptides in Escherichia coli. Microbial Cell Factories 2017, 16(1), 4,
10.1186/s12934-016-0618-0.

14. Fernandes, R. A.; Gangani, A. J.; Panja, A.; Synthesis of 5-vinyl-2-isoxazolines by palladium-catalyzed intramolecular o-allylation of ketoximes. Org. Lett. 2021, 23(16),
6227–6231, 10.1021/acs.orglett.1c01897.

15. Bennetzen, J. L.; Hall, B. D. The primary structure of the Saccharomyces cerevisiae gene for alcohol dehydrogenase. J. Biol. Chem. 1982, 257 (6), 3018–3025, 10.1016/
S0021-9258(19)81067-0.

16. Di Cagno, M. P.; Clarelli, F.; Våbenø, J.; Lesley, C.; Darsim Rahman, S.; Cauzzo, J.; Franceschinis, E.; Realdon, N.; Stein, P. C. Experimental determination of drug diffusion
coefficients in unstirred aqueous environments by temporally resolved concentration measurements. Mol. Pharmaceutics 2018, 15(4), 1488–1494.

17. McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Ipython; O'Reilly Media, Inc., 2012.

18. Severance, C. R. Python for Everybody: Exploring Data in Python 3; CreateSpace Independent Publishing Platform: North Charleston, 2016.

19. Reilly, S. The role of libraries in supporting data exchange. In 78th IFLA General Conference and Assembly, 2012. http://conference.ifla.org/sites/default/files/files/papers/
wlic2012/116-reilly-en.pdf.

20. Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N. J.; Kern, R.; Picus, M.; Hoyer, S.;
van Kerkwijk, M. H.; Brett, M.; Haldane, A.; del Río, J. F.; Wiebe, M.; Peterson, P.; Gérard-Marchant, P.; Sheppard, K.; Reddy, T.; Weckesser, W.; Abbasi, H.; Gohlke, C.;
Oliphant, T. E. Array programming with NumPy. Nature 2020, 585 (7825), 357–362, 10.1038/s41586-020-2649-2.

21. Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 1976, 32 (5), 922–923, 10.1107/S0567739476001873.

22. McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference; Austin, TX, 2010; vol. 445, pp. 51–56.

23. Simmons, J. P.; Nelson, L. D.; Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol.
Sci. 2011, 22 (11), 1359–1366, 10.1177/0956797611417632.

© 2022 American Chemical Society 46


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

24. Picache, J. A.; Rose, B. S.; Balinski, A.; Leaptrot, K. L.; Sherrod, S. D.; May, J. C.; McLean, J. A. Collision cross section compendium to annotate and predict multi-omic
compound identities. Chem. Sci. 2019, 10 (4), 983–993, 10.1039/C8SC04396E.

25. Wickham, H. Tidy data. J. Stat. Softw. 2014, 59 (10), 1–23, 10.18637/jss.v059.i10.

26. Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 2021, 6 (60), 3021, 10.21105/joss.03021.

27. Peng, R. The reproducibility crisis in science: a statistical counterattack. Significance 2015, 12 (3), 30–32, 10.1111/j.1740-9713.2015.00827.x.

28. Tetko, I. V.; Engkvist, O.; Koch, U.; Reymond, J.-L.; Chen, H. Bigchem: challenges and opportunities for big data analysis in chemistry. Mol. Inf. 2016, 35 (11–12),
615–621, 10.1002/minf.201600073.

29. Leonelli, S. Scientific research and big data. In The Stanford Encyclopedia of Philosophy, summer 2020 ed.; Zalta, E. N., Ed.; 2020. https://plato.stanford.edu/archives/
sum2020/entries/science-big-data/.

30. Prakash, N.; Gareja, D. A. Cheminformatics. J. Proteomics Bioinform. 2010, 03, 249–252, 10.4172/jpb.1000147.

31. Wishart, D. S. Introduction to cheminformatics. Curr. Protoc. Bioinform. 2016, 53 (1), 14–11, 10.1002/0471250953.bi1401s18.

32. Firdaus Begam, B.; Satheesh, J.; Kumar. A study on cheminformatics and its applications on modern drug discovery. Procedia Eng. 2012, 38, 1264–1275, 10.1016/
j.proeng.2012.06.156.

33. Anderson, E., Veith, G. D., Weininger, D. SMILES, a line notation and computerized interpreter for chemical structures; US Environmental Protection Agency, Environmental
Research Laboratory, 1987.

34. Daylight Chemical Information Systems, Inc. Smarts - a language for describing molecular patterns. Daylight Theory Manual, ver. 4.9, 2011. https://daylight.com/dayhtml/
doc/theory/theory.smarts.html (accessed Jan 22, 2022).

35. James, C.; Weininger, D.; Delany, J. Daylight theory manual, ver. 4.9., 2011. https://daylight.com/dayhtml/doc/theory (accessed Jan 22, 2022).

36. OpenEye Scientific Software, Inc. Oechem toolkit 3.2.0.0. https://docs.eyesopen.com/toolkits/python/oechemtk/index.html (accessed Jan 22, 2022).

37. OpenEye Scientific Software, Inc. Chemaxon–Cheminformatics platforms and desktop applications. https://chemaxon.com (accessed Jan 22, 2022).

38. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open babel: an open chemical toolbox. J. Cheminform. 2011, 3 (1), 1–14,
10.1186/1758-2946-3-33.

39. Landrum, G. RDKit: Open-source cheminformatics. http://www.rdkit.org (accessed Jan 22, 2022).

40. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E. Pubchem in 2021: new
data content and improved web interfaces. Nucleic Acids Res. 2021, 49(D1), D1388–D1395, 10.1093/nar/gkaa971.

41. Davies, M.; Nowotka, M.; Papadatos, G.; Dedman, N.; Gaulton, A.; Atkinson, F.; Bellis, L.; Overington, J. P. Chembl web services: streamlining access to drug discovery data
and utilities. Nucleic Acids Res. 2015, 43 (W1), W612–W620, 10.1093/nar/gkv352.

42. Wishart, D. S.; Guo, A. C.; Oler, E.; Wang, F.; Anjum, A.; Peters, H.; Dizon, R.; Sayeeda, Z.; Tian, S.; Lee, B. L.; Berjanskii, M.; Mah, R.; Yamamoto, M.; Jovel, J.;
Torres-Calzada, C.; Hiebert-Giesbrecht, M.; Lui, V. W.; Varshavi, D.; Varshavi, D.; Allen, D.; Arndt, D.; Khetarpal, N.; Sivakumaran, A.; Harford, K.; Sanford, S.; Yee, K.;
Cao, X.; Budinski, Z.; Liigand, J.; Zhang, L.; Zheng, J.; Mandal, R.; Karu, N.; Dambrova, M.; Schiöth, H. B.; Greiner, R.; Gautam, V. Hmdb 5.0: the human metabolome
database for 2022. Nucleic Acids Res. 2022, 50(D1), D622–D631, 10.1093/nar/gkab1062.

43. Bansal, P.; Morgat, A.; Axelsen, K. B.; Muthukrishnan, V.; Coudert, E.; Aimo, L.; Hyka-Nouspikel, N.; Gasteiger, E.; Kerhornou, A.; Neto, T. B.; Pozzato, M.; Blatter, M.-C.;
Ignatchenko, A.; Redaschi, N.; Bridge, A. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res. 2022, 50(D1), D693–D700, 10.1093/nar/gkab1016.

44. Kearnes, S. M.; Maser, M. R.; Wleklinski, M.; Kast, A.; Doyle, A. G.; Dreher, S. D.; Hawkins, J. M.; Jensen, K. F.; Coley, C. W. The open reaction database. J. Am. Chem. Soc.
2021, 143 (45), 18820–18826, 10.1021/jacs.1c09820.

45. Ruddigkeit, L.; Awale, M.; Reymond, J.-L. Expanding the fragrance chemical space for virtual screening. J. Cheminform. 2014, 6 (1), 1–12, 10.1186/1758-2946-6-27.

46. Ahmed, J.; Preissner, S.; Dunkel, M.; Worth, C. L.; Eckert, A.; Preissner, R. Supersweet—a resource on natural and artificial sweetening agents. Nucleic Acids Res. 2011, 39
(Database issue), D377–D382, 10.1093/nar/gkq917.

47. Wiener, A.; Shudler, M.; Levit, A.; Niv, M. Y. Bitterdb: a database of bitter compounds. Nucleic Acids Res. 2012, 40(Database issue), D413–D419, 10.1093/nar/gkr755.

48. Danishuddin; Khan, A. U. Descriptors and their selection methods in qsar analysis: paradigm for drug design. Drug Discovery Today 2016, 21 (8), 1291–1302, 10.1016/
j.drudis.2016.06.013.

49. Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of mdl keys for use in drug discovery. J. Chem. Inform. Comput. Sci. 2002, 42(6), 1273–1280.
10.1021/ci010132r.

© 2022 American Chemical Society 47


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

50. Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at Chemical Abstracts Service. J. Chem. Document. 1965, 5 (2),
107–113, 10.1021/c160017a018.

51. Daylight Chemical Information Systems, Inc. Fingerprints - screening and similarity. Daylight Theory Manual, ver. 4.9, 2011. https://daylight.com/dayhtml/doc/theory/
theory.smarts.html (accessed Jan 22, 2022).

52. Maccs structural keys; Accelrys, Inc.: San Diego, CA, 2011.

53. Tanimoto, T. T. An Elementary Mathematical Theory of Classification and Prediction; International Business Machines Corporation, 1958.

54. Sheridan, R. P.; Kearsley, S. K. Why do we need so many chemical similarity search methods? Drug Discovery Today 2002, 7 (17), 903–911, 10.1016/
S1359-6446(02)02411-X.

55. Chen, H.; Kogej, T.; Engkvist, O. Cheminformatics in drug discovery, an industrial perspective. Mol. Inf. 2018, 37 (9-10), e1800041, 10.1002/minf.201800041.

56. Henderson, L. J. Concerning the relationship between the strength of acids and their capacity to preserve neutrality. Am. J. Physiol. Legacy Content 1908, 21 (2), 173–179,
10.1152/ajplegacy.1908.21.2.173.

57. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau,
D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.

58. Subramanian, G.; Ramsundar, B.; Pande, V.; Denny, R. A. Computational modeling of β-secretase 1 (bace-1) inhibitors using ligand based approaches. J. Chem. Inf. Model.
2016, 56 (10), 1936–1949, 10.1021/acs.jcim.6b00290.

59. Venugopal, C.; Demos, C. M.; Jagannatha Rao, K. S.; Pappolla, M. A.; Sambamurti, K. Beta-secretase: structure, function, and evolution. CNS Neurol. Disord. Drug Target.
2008, 7 (3), 278–294, 10.2174/187152708784936626.

60. Breiman, L. Random forests. Mach. Learn. 2001, 45 (1), 5–32, 10.1023/A:1010933404324.

61. Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2 (11), 559–572,
10.1080/14786440109462720.

62. Van der Maaten, L., Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 2008, 2579-2605.

63. Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742–754, 10.1021/ci100050t.

64. Ward, J. H., Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58 (301), 236–244, 10.1080/01621459.1963.10500845.

65. Liu, F. T., Ting, K. M., Zhou, Z.-H. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining; IEEE, 2008; pp. 413–422.

66. Jurs, P. C.; Kowalski, B. R.; Isenhour, T. L.; Reilley, C. N. Computerized learning machines applied to chemical problems. molecular structure parameters from low resolution
mass spectrometry. Anal. Chem. 1970, 42 (12), 1387–1394, 10.1021/ac60294a015.

67. Artrith, N.; Butler, K. T.; Coudert, F.-X.; Han, S.; Isayev, O.; Jain, A.; Walsh, A. Best practices in machine learning for chemistry. Nat. Chem. 2021, 13 (6), 505–508,
10.1038/s41557-021-00716-z.

68. Dral, P. O. Quantum chemistry in the age of machine learning. J. Phys. Chem. Lett. 2020, 11 (6), 2336–2347, 10.1021/acs.jpclett.9b03664.

69. Panteleev, J.; Gao, H.; Jia, L. Recent applications of machine learning in medicinal chemistry. Bioorg. Med. Chem. Lett. 2018, 28 (17), 2807–2815, 10.1016/
j.bmcl.2018.06.046.

70. Strieth-Kalthoff, F.; Sandfort, F.; Segler, M. H. S.; Glorius, F. Machine learning the ropes: principles, applications and directions in synthetic chemistry. Chem. Soc. Rev. 2020,
49 (17), 6154–6168, 10.1039/C9CS00786E.

71. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A.
A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.;
Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly accurate protein structure prediction with alphafold.
Nature 2021, 596 (7873), 583–589, 10.1038/s41586-021-03819-2.

72. Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B.; Meyer, E. F., Jr.; Brice, M. D.; Rodgers, J. R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. The protein data bank: a
computer-based archival file for macromolecular structures. J. Mol. Biol. 1977, 112 (3), 535–542, 10.1016/S0022-2836(77)80200-3.

73. Xiao, T.; Lu, J.; Zhang, J.; Johnson, R. I.; McKay, L. G. A.; Storm, N.; Lavine, C. L.; Peng, H.; Cai, Y.; Rits-Volloch, S.; Lu, S.; Quinlan, B. D.; Farzan, M.; Seaman,
M. S.; Griffiths, A.; Chen, B. A trimeric human angiotensin-converting enzyme 2 as an anti-sars-cov-2 agent. Nat. Struct. Mol. Biol. 2021, 28 (2), 202–209, 10.1038/
s41594-020-00549-3.

© 2022 American Chemical Society 48


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

74. Barca, G. M. J.; Bertoni, C.; Carrington, L.; Datta, D.; de Silva, N.; Deustua, J. E.; Fedorov, D. G.; Gour, J. R.; Gunina, A. O.; Guidez, E.; Harville, T.; Irle, S.; Ivanic,
J.; Kowalski, K.; Leang, S. S.; Li, H.; Li, W.; Lutz, J. J.; Magoulas, I.; Mato, J.; Mironov, V.; Nakata, H.; Pham, B. Q.; Piecuch, P.; Poole, D.; Pruitt, S. R.; Rendell, A. P.;
Roskop, L. B.; Ruedenberg, K.; Sattasathuchana, T.; Schmidt, M. W.; Shen, J.; Slipchenko, L.; Sosonkina, M.; Sundriyal, V.; Tiwari, A.; Galvez Vallejo, J. L.; Westheimer, B.;
Włoch, M.; Xu, P.; Zahariev, F.; Gordon, M. S. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152 (15), 154102,
10.1063/5.0005188.

75. Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett,
M.; Wilson, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.; Jones, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, I.; Feng, Y.; Moore, E. W.; VanderPlas, J.; Laxalde, D.;
Perktold, J.; Cimrman, R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.; Archibald, A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt, P. SciPy 1.0 Contributors. SciPy 1.0:
Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272, 10.1038/s41592-019-0686-2.

76. Briggs, T. S.; Rauscher, W. C. An oscillating iodine clock. J. Chem. Educ. 1973, 50 (7), 496, 10.1021/ed050p496.

77. Kim, K.-R.; Lee, D. J.; Shin, K. J. A simplified model for the Briggs–Rauscher reaction mechanism. J. Chem. Phys. 2002, 117 (6), 2710–2717, 10.1063/1.1491243.

78. Hjorth Larsen, A.; Jørgen Mortensen, J.; Blomqvist, J.; Castelli, I. E.; Christensen, R.; Dułak, M.; Friis, J.; Groves, M. N.; Hammer, B.; Hargus, C.; Hermes, E. D.; Jennings,
P. C.; Bjerre Jensen, P.; Kermode, J.; Kitchin, J. R.; Leonhard Kolsbjerg, E.; Kubal, J.; Kaasbjerg, K.; Lysgaard, S.; Bergmann Maronsson, J.; Maxson, T.; Olsen, T.; Pastewka,
L.; Peterson, A.; Rostgaard, C.; Schiøtz, J.; Schütt, O.; Strange, M.; Thygesen, K. S.; Vegge, T.; Vilhelmsen, L.; Walter, M.; Zeng, Z.; Jacobsen, K. W. The atomic simulation
environment—a python library for working with atoms. J. Phys.: Condens. Matter 2017, 29 (27), 273002, 10.1088/1361-648X/aa680e.

79. Smith, D. G. A.; Burns, L. A.; Simmonett, A. C.; Parrish, R. M.; Schieber, M. C.; Galvelis, R.; Kraus, P.; Kruse, H.; di Remigio, R.; Alenaizan, A.; James, A. M.; Lehtola, S.;
Misiewicz, J. P.; Scheurer, M.; Shaw, R. A.; Schriber, J. B.; Xie, Y.; Glick, Z. L.; Sirianni, D. A.; O'Brien, J. S.; Waldrop, J. M.; Kumar, A.; Hohenstein, E. G.; Pritchard, B. P.;
Brooks, B. R.; Schaefer, H. F., III; Sokolov, A. Y.; Patkowski, K.; DePrince, A. E., III; Bozkaya, U.; King, R. A.; Evangelista, F. A.; Turney, J. M.; Crawford, T. D.; Sherrill, C.
D. Psi4 1.4: open-source software for high-throughput quantum chemistry. J. Chem. Phys. 2020, 152 (18), 184108, 10.1063/5.0006002.

80. Anthony, N. G.; Johnston, B. F.; Khalaf, A. I.; MacKay, S. P.; Parkinson, J. A.; Suckling, C. J.; Waigh, R. D. Short lexitropsin that recognizes the DNA minor groove at
5′-ACTAGT-3′: Understanding the role of isopropyl-thiazole. J. Am. Chem. Soc. 2004, 126 (36), 11338–11349, 10.1021/ja030658n.

81. Devereux, C.; Smith, J. S.; Huddleston, K. K.; Barros, K.; Zubatyuk, R.; Isayev, O.; Roitberg, A. E. Extending the applicability of the ani deep learning molecular potential to
sulfur and halogens. J. Chem. Theory Comput. 2020, 16 (7), 4192–4202, 10.1021/acs.jctc.0c00121.

82. Cock, P. J. A.; Antao, T.; Chang, J. T.; Chapman, B. A.; Cox, C. J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; de Hoon, M. J. Biopython: freely
available python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25 (11), 1422–1423, 10.1093/bioinformatics/btp163.

83. Hamelryck, T.; Manderick, B. PDB file parser and structure class implemented in Python. Bioinformatics 2003, 19 (17), 2308–2310, 10.1093/bioinformatics/btg299.

84. Wang, S.; Wacker, D.; Levit, A.; Che, T.; Betz, R. M.; McCorvy, J. D.; Venkatakrishnan, A. J.; Huang, X.-P.; Dror, R. O.; Shoichet, B. K.; Roth, B. L. D4 dopamine receptor
high-resolution structures enable the discovery of selective agonists. Science 2017, 358 (6361), 381–386, 10.1126/science.aan5468.

85. Kundrotas, P. J.; Anishchenko, I.; Dauzhenka, T.; Kotthoff, I.; Mnevets, D.; Copeland, M. M.; Vakser, I. A. Dockground: a comprehensive data resource for modeling of
protein complexes. Protein Sci. 2018, 27 (1), 172–181, 10.1002/pro.3295.

86. She, M.; Decker, C. J.; Svergun, D. I.; Round, A.; Chen, N.; Muhlrad, D.; Parker, R.; Song, H. Structural basis of dcp2 recognition and activation by dcp1. Mol. Cell 2008,
29 (3), 337–349, 10.1016/j.molcel.2008.01.002.

87. Houk, K. N.; Liu, F. Holy grails for computational organic chemistry and biochemistry. Acc. Chem. Res. 2017, 50 (3), 539–543, 10.1021/acs.accounts.6b00532.

88. Markowetz, F. All biology is computational biology. PLoS Biol. 2017, 15 (3), e2002050, 10.1371/journal.pbio.2002050.

© 2022 American Chemical Society 49


https://pubs.acs.org/doi/book/10.1021/acsinfocus.7e5030

Glossary
Algorithm: A set of instructions to perform to solve problems.

Application Programming Interface (API): A standardized interface through which two or more software can interact.

Black box: Any system in which we know the input/output but not the inner workings of how the input is processed to produce the
output (e.g., programs which we know how to use but never seen the source code).

Boolean: A data type of True or False.

Bug: Error in the source code of a computer program which causes it unexpected behavior.

Codon: A sequence of three nucleotides interpreted at protein synthesis as adding a particular amino acid or start/stop peptide synthesis.

Compiled languages: Programming languages in which the whole source code is converted to machine code prior to execution.

Conditional statements: A boolean expression which executes a block of code if true.

Faceting: Visualizing various subsets of the data in a grid.

Feature: A measurable independent variable used to quantitatively predict a value.

Ion mobility spectrometry (IMS): An analytical technique by which gas phase ions are separated via mobility through inert gas.

K-fold cross validation: Evaluation of a supervised learning model by partitioning the data set into k-folds and using each fold independently
as the validation set.

Principal component (PC): Eigenvectors of the covariance matrix.

Principal component analysis (PCA): A linear dimensionality reduction technique which selects principal components to retain variance.

Python class: An object constructor in Python, having attributes and methods.

Python object: An instance of a Python class.

Restriction site: A particular, short sequence of nucleotides which are recognized and cleave by the corresponding restriction enzyme (e.g., the
EcoRI restriction site is 5′-GAATTC-3′, and is cleaved by the restriction enzyme, EcoRI).

SMILES: Simplified Molecular-Input Line-Entry System (SMILES) is a single line representation that describes the structure of chemical species
in the form of short strings.

SMARTS: SMILES arbitrary target specification (SMARTS) is a language used to specify substructural patterns in molecules in a single string.

Syntax: The rule to combine symbols such that a line of code can be interpreted in a particular computer language.

Unsupervised learning: Machine learning algorithms which do not involve a target variable.

© 2022 American Chemical Society 50

You might also like