0% found this document useful (0 votes)
213 views

Biopython Tutorial and Cookbook

The document summarizes the structure and hierarchy of biomolecular structure data represented using the Structure, Model, Chain, Residue, Atom (SMCRA) data structure in Biopython. It describes how each entity in the hierarchy contains child entities and can be accessed. Key entities include Structure at the top containing Models, Models containing Chains, Chains containing Residues, and Residues containing Atoms. Disordered entities are also represented. Methods to extract, identify, and navigate between entities in the hierarchy are provided.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views

Biopython Tutorial and Cookbook

The document summarizes the structure and hierarchy of biomolecular structure data represented using the Structure, Model, Chain, Residue, Atom (SMCRA) data structure in Biopython. It describes how each entity in the hierarchy contains child entities and can be accessed. Key entities include Structure at the top containing Models, Models containing Chains, Chains containing Residues, and Residues containing Atoms. Disordered entities are also represented. Methods to extract, identify, and navigate between entities in the hierarchy are provided.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

Biopython also allows you to explore the extensive realm of macromolecular structure. Biopython comes with a PDBParser class that produces a Structure object. The Structure object can be used to access the atomic data in the file in a convenient manner.

A macromolecular structure is represented using a structure, model chain, residue, atom (or SMCRA) hierarchy. The figure below shows a UML class diagram of the SMCRA data structure. Such a data structure is not necessarily best suited for the representation of the macromolecular content of a structure, but it is absolutely necessary for a good interpretation of the data present in a file that describes the structure (typically a PDB or MMCIF file). If this hierarchy cannot represent the contents of a structure file, it is fairly certain that the file contains an error or at least does not describe the structure unambiguously. If a SMCRA data structure cannot be generated, there is reason to suspect a problem. Parsing a PDB file can thus be used to detect likely problems. We will give several examples of this in section 10.5.1.

Structure, Model, Chain and Residue are all subclasses of the Entity base class. The Atom class only (partly) implements the Entity interface (because an Atom does not have children).
. 1 8 08.03.2012 13:26

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

For each Entity subclass, you can extract a child by using a unique id for that child as a key (e.g. you can extract an Atom object from a Residue object by using an atom name string as a key, you can extract a Chain object from a Model object by using its chain identifier as a key). Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue classes, which are both subclasses of the DisorderedEntityWrapper base class. They hide the complexity associated with disorder and behave exactly as Atom and Residue objects. In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively) by using an id as a key.
child_entity=parent_entity[child_id]

You can also get a list of all child Entities of a parent Entity object. Note that this list is sorted in a specific way (e.g. according to chain identifier for Chain objects in a Model object).
child_list=parent_entity.get_list()

You can also get the parent from a child.


parent_entity=child_entity.get_parent()

At all levels of the SMCRA hierarchy, you can also extract a full id. The full id is a tuple containing all ids starting from the top object (Structure) down to the current object. A full id for a Residue object e.g. is something like:
full_id=residue.get_full_id() print full_id ("1abc", 0, "A", ("", 10, "A"))

This corresponds to: The Structure with id "1abc" The Model with id 0 The Chain with id "A" The Residue with id (" ", 10, "A"). The Residue id indicates that the residue is not a hetero-residue (nor a water) because it has a blank hetero field, that its sequence identifier is 10 and that its insertion code is "A". Some other useful methods:
# get the entity's id entity.get_id() # check if there is a child with a given id entity.has_id(entity_id) # get number of children nr_children=len(entity)

It is possible to delete, rename, add, etc. child entities from a parent entity, but this does not include any sanity checks (e.g. it is possible to add two residues with the same id to one chain). This really should be done via a nice Decorator class that includes integrity checking, but you can take a look at the code (Entity.py) if you want to use the raw interface.

10.1.1 Structure
The Structure object is at the top of the hierarchy. Its id is a user given string. The Structure contains a number of Model children. Most crystal structures (but not all) contain a single model, while NMR structures typically consist of several models. Disorder in crystal structures of large parts of molecules can also result in several models.
. 2 8 08.03.2012 13:26

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

10.1.1.1 Constructing a Structure object A Structure object is produced by a PDBParser object:


from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="1fat" filename="pdb1fat.ent" s=p.get_structure(structure_id, filename)

The PERMISSIVE flag indicates that a number of common problems (see 10.5.1) associated with PDB files will be ignored (but note that some atoms and/or residues will be missing). If the flag is not present a PDBConstructionException will be generated during the parse operation. 10.1.1.2 Header and trailer You can extract the header and trailer (simple lists of strings) of the PDB file from the PDBParser object with the get_header and get_trailer methods.

10.1.2 Model
The id of the Model object is an integer, which is derived from the position of the model in the parsed file (they are automatically numbered starting from 0). The Model object stores a list of Chain children. 10.1.2.1 Example Get the first model from a Structure object.
first_model=structure[0]

10.1.3 Chain
The id of a Chain object is derived from the chain identifier in the structure file, and can be any string. Each Chain in a Model object has a unique id. The Chain object stores a list of Residue children. 10.1.3.1 Example Get the Chain object with identifier A from a Model object.
chain_A=model["A"]

10.1.4 Residue
Unsurprisingly, a Residue object stores a set of Atom children. In addition, it also contains a string that specifies the residue name (e.g. ASN) and the segment identifier of the residue (well known to X-PLOR users, but not used in the construction of the SMCRA data structure). The id of a Residue object is composed of three parts: the hetero field (hetfield), the sequence identifier (resseq) and the insertion code (icode). The hetero field is a string : it is W for waters, H_ followed by the residue name (e.g. H_FUC) for other hetero residues and blank for standard amino and nucleic acids. This scheme is adopted for reasons described in section 10.3.1. The second field in the Residue id is the sequence identifier, an integer describing the position of the residue in the chain. The third field is a string, consisting of the insertion code. The insertion code is sometimes used to preserve a certain d i bl id b i h A S 80 i i (i d b h 80 d A 81 id )
. 3 8 08.03.2012 13:26

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as followed: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure. Lets give some examples. Asn 10 with a blank insertion code would have residue id ( , 10, ). Water 10 would have residue id (W, 10, ). A glucose molecule (a hetero residue with residue name GLC) with sequence identifier 10 would have residue id (H_GLC, 10, ). In this way, the three residues (with the same insertion code and sequence identifier) can be part of the same chain because their residue ids are distinct. In most cases, the hetflag and insertion code fields will be blank, e.g. ( , 10, ). In these cases, the sequence identifier can be used as a shortcut for the full id:
# use full id res10=chain[("", 10, "")] # use shortcut res10=chain[10]

Each Residue object in a Chain object should have a unique id. However, disordered residues are dealt with in a special way, as described in section 10.2.3.2. A Residue object has a number of additional methods:
r.get_resname() # return residue name, e.g. "ASN" r.get_segid() # return the SEGID, e.g. "CHN1"

10.1.5 Atom
The Atom object stores the data associated with an atom, and has no children. The id of an atom is its atom name (e.g. OG for the side chain oxygen of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception is made for disordered atoms, as described in section 10.2.2. In a PDB file, an atom name consists of 4 chars, typically with leading and trailing spaces. Often these spaces can be removed for ease of use (e.g. an amino acid C atom is labeled .CA. in a PDB file, where the dots represent spaces). To generate an atom name (and thus an atom id) the spaces are removed, unless this would result in a name collision in a Residue (i.e. two Atom objects with the same atom name and id). In the latter case, the atom name including spaces is tried. This situation can e.g. happen when one residue contains atoms with names .CA. and CA.., although this is not very likely. The atomic data stored includes the atom name, the atomic coordinates (including standard deviation if present), the B factor (including anisotropic B factors and standard deviation if present), the altloc specifier and the full atom name including spaces. Less used items like the atom element number or the atomic charge sometimes specified in a PDB file are not stored. An Atom object has the following additional methods:
a.get_name() a.get_id() a.get_coord() a.get_bfactor() a.get_occupancy() a.get_altloc() a.get_sigatm() a.get_siguij() a.get_anisou() a.get_fullname() # # # # # # # # # # atom name (spaces stripped, e.g. "CA") id (equals atom name) atomic coordinates B factor occupancy alternative location specifie std. dev. of atomic parameters std. dev. of anisotropic B factor anisotropic B factor atom name (with spaces, e.g. ".CA.")

To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy arrays are used.

10.2.1 General approach


Disorder should be dealt with from two points of view: the atom and the residue points of view. In general, we have
. 4 8 08.03.2012 13:26

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

p p g , tried to encapsulate all the complexity that arises from disorder. If you just want to loop over all C atoms, you do not care that some residues have a disordered side chain. On the other hand it should also be possible to represent disorder completely in the data structure. Therefore, disordered atoms or residues are stored in special objects that behave as if there is no disorder. This is done by only representing a subset of the disordered atoms or residues. Which subset is picked (e.g. which of the two disordered OG side chain atom positions of a Ser residue is used) can be specified by the user.

10.2.2 Disordered atoms


Disordered atoms are represented by ordinary Atom objects, but all Atom objects that represent the same physical atom are stored in a DisorderedAtom object. Each Atom object in a DisorderedAtom object can be uniquely indexed using its altloc specifier. The DisorderedAtom object forwards all uncaught method calls to the selected Atom object, by default the one that represents the atom with with the highest occupancy. The user can of course change the selected Atom object, making use of its altloc specifier. In this way atom disorder is represented correctly without much additional complexity. In other words, if you are not interested in atom disorder, you will not be bothered by it. Each disordered atom has a characteristic altloc identifier. You can specify that a DisorderedAtom object should behave like the Atom object associated with a specific altloc identifier:
atom.disordered\_select("A") print atom.get_altloc() "A" atom.disordered_select("B") print atom.get_altloc() "B" # select altloc B atom # select altloc A atom

10.2.3 Disordered residues


10.2.3.1 Common case The most common case is a residue that contains one or more disordered atoms. This is evidently solved by using DisorderedAtom objects to represent the disordered atoms, and storing the DisorderedAtom object in a Residue object just like ordinary Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in fact the atom with the highest occupancy) by forwarding all uncaught method calls to one of the Atom objects (the selected Atom object) it contains. 10.2.3.2 Point mutations A special case arises when disorder is due to a point mutation, i.e. when two or more point mutants of a polypeptide are present in the crystal. An example of this can be found in PDB structure 1EN2. Since these residues belong to a different residue type (e.g. lets say Ser 60 and Cys 60) they should not be stored in a single Residue object as in the common case. In this case, each residue is represented by one Residue object, and both Residue objects are stored in a DisorderedResidue object. The DisorderedResidue object forwards all uncaught methods to the selected Residue object (by default the last Residue object added), and thus behaves like an ordinary residue. Each Residue object in a DisorderedResidue object can be uniquely identified by its residue name. In the above example, residue Ser 60 would have id SER in the DisorderedResidue object, while residue Cys 60 would have id CYS. The user can select the active Residue object in a DisorderedResidue object via this id.

10.3.1 Associated problems


A common problem with hetero residues is that several hetero and non-hetero residues present in the same chain share the same sequence identifier (and insertion code). Therefore, to generate a unique id for each hetero residue, waters and other hetero residues are treated in a different way. Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank ( ) for amino and
. 5 8 08.03.2012 13:26

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank ( ) for amino and nucleic acids, and a string for waters and other hetero residues. The content of the hetfield is explained below.

10.3.2 Water residues


The hetfield string of a water residue consists of the letter W. So a typical residue id for a water is (W, 1, ).

10.3.3 Other hetero residues


The hetfield string for other hetero residues starts with H_ followed by the residue name. A glucose molecule e.g. with residue name GLC would have hetfield H_GLC. Its residue id could e.g. be (H_GLC, 1, ).

Parse a PDB file, and extract some Model, Chain, Residue and Atom objects.
from Bio.PDB.PDBParser import PDBParser parser=PDBParser() structure=parser.get_structure("test", "1fat.pdb") model=structure[0] chain=model["A"] residue=chain[1] atom=residue["CA"]

Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq 10).
residue_id=("H_GLC", 10, " ") residue=chain[residue_id]

Print all hetero residues in chain.


for residue in chain.get_list(): residue_id=residue.get_id() hetfield=residue_id[0] if hetfield[0]=="H": print residue_id

Print out the coordinates of all CA atoms in a structure with B factor greater than 50.
for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.has_id("CA"): ca=residue["CA"] if ca.get_bfactor()>50.0: print ca.get_coord()

Print out all the residues that contain disordered atoms.


for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.is_disordered(): resseq=residue.get_id()[1] resname=residue.get_resname() model_id=model.get_id() chain_id=chain.get_id() print model_id, chain_id, resname, resseq

Loop over all disordered atoms, and select all atoms with altloc A (if present). This will make sure that the SMCRA data structure will behave as if only the atoms with altloc A are present.
for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.is_disordered(): for atom in residue.get_list(): if atom.is_disordered(): if atom disordered has id("A"):

. 6 8

08.03.2012 13:26

Biopython Tutorial and Cookbook


if atom.disordered_has_id( A ): atom.disordered_select("A")

http://biopython.org/DIST/docs/tutorial/Tutorial.html

Suppose that a chain has a point mutation at position 10, consisting of a Ser and a Cys residue. Make sure that residue 10 of this chain behaves as the Cys residue.
residue=chain[10] residue.disordered_select("CYS")

10.5.1 Examples
The PDBParser/Structure class was tested on about 800 structures (each belonging to a unique SCOP superfamily). This takes about 20 minutes, or on average 1.5 seconds per structure. Parsing the structure of the large ribosomal subunit (1FKK), which contains about 64000 atoms, takes 10 seconds on a 1000 MHz PC. Three exceptions were generated in cases where an unambiguous data structure could not be built. In all three cases, the likely cause is an error in the PDB file that should be corrected. Generating an exception in these cases is much better than running the chance of incorrectly describing the structure in a data structure. 10.5.1.1 Duplicate residues One structure contains two amino acid residues in one chain with the same sequence identifier (resseq 3) and icode. Upon inspection it was found that this chain contains the residues Thr A3, , Gly A202, Leu A3, Glu A204. Clearly, Leu A3 should be Leu A203. A couple of similar situations exist for structure 1FFK (which e.g. contains Gly B64, Met B65, Glu B65, Thr B67, i.e. residue Glu B65 should be Glu B66). 10.5.1.2 Duplicate atoms Structure 1EJG contains a Ser/Pro point mutation in chain A at position 22. In turn, Ser 22 contains some disordered atoms. As expected, all atoms belonging to Ser 22 have a non-blank altloc specifier (B or C). All atoms of Pro 22 have altloc A, except the N atom which has a blank altloc. This generates an exception, because all atoms belonging to two residues at a point mutation should have non-blank altloc. It turns out that this atom is probably shared by Ser and Pro 22, as Ser 22 misses the N atom. Again, this points to a problem in the file: the N atom should be present in both the Ser and the Pro residue, in both cases associated with a suitable altloc identifier.

10.5.2 Automatic correction


Some errors are quite common and can be easily corrected without much risk of making a wrong interpretation. These cases are listed below. 10.5.2.1 A blank altloc for a disordered atom Normally each disordered atom should have a non-blank altloc identifier. However, there are many structures that do not follow this convention, and have a blank and a non-blank identifier for two disordered positions of the same atom. This is automatically interpreted in the right way. 10.5.2.2 Broken chains Sometimes a structure contains a list of residues belonging to chain A, followed by residues belonging to chain B, and again followed by residues belonging to chain A, i.e. the chains are broken. This is correctly interpreted.

10.5.3 Fatal errors


Sometimes a PDB file cannot be unambiguously interpreted. Rather than guessing and risking a mistake, an exception is generated, and the user is expected to correct the PDB file. These cases are listed below. 10.5.3.1 Duplicate residues All residues in a chain should have a unique id This id is generated based on:
. 7 8 08.03.2012 13:26

Biopython Tutorial and Cookbook

http://biopython.org/DIST/docs/tutorial/Tutorial.html

All residues in a chain should have a unique id. This id is generated based on: The sequence identifier (resseq). The insertion code (icode). The hetfield string (W for waters and H_ followed by the residue name for other hetero residues) The residue names of the residues in the case of point mutations (to store the Residue objects in a DisorderedResidue object). If this does not lead to a unique id something is quite likely wrong, and an exception is generated. 10.5.3.2 Duplicate atoms All atoms in a residue should have a unique id. This id is generated based on: The atom name (without spaces, or with spaces if a problem arises). The altloc specifier. If this does not lead to a unique id something is quite likely wrong, and an exception is generated.

There are also some tools to analyze a crystal structure. Tools exist to superimpose two coordinate sets (SVDSuperimposer), to extract polypeptides from a structure (Polypeptide), to perform neighbor lookup (NeighborSearch) and to write out PDB files (PDBIO). The neighbor lookup is done using a KD tree module written in C++. It is very fast and also includes a fast method to find all point pairs within a certain distance of each other. A Polypeptide object is simply a UserList of Residue objects. You can construct a list of Polypeptide objects from a Structure object as follows:
model_nr=1 polypeptide_list=build_peptides(structure, model_nr) for polypeptide in polypeptide_list: print polypeptide

The Polypeptide objects are always created from a single Model (in this case model 1).

. 8 8

08.03.2012 13:26

You might also like