Simulating Bio Molecules With Python
Simulating Bio Molecules With Python
Category: Science
Keywords: Data Visualization, Biology, Computational Chemistry
Title: Simulating Biomolecules with Python
Author: Konrad Hinsen
Date: 2005-04-20
Website: http://dirac.cnrs-orleans.fr/MMTK/
Summary: Python and C serve as the basis for a molecular modeling toolkit.
Logo:
Background
The Molecular Modeling Toolkit (MMTK) is a open source Python library for molecular
modeling and simulation with a focus on biomolecular systems, written in a mixture of
Python and C. It provides standard techniques such as Molecular Dynamics or normal
mode calculations in a ready-to-use form, but also provides a basis of low-level
operations on top of which new techniques can easily be implemented.
I started developing MMTK in 1996. I had some experience with mainstream simulation
packages for biomolecules that were written in Fortran and had their origins in the 1970s.
Those packages were too cumbersome to use and in particular to modify and extend.
Since my research work is focused on the development of new simulation techniques,
modifiability was a particularly important criterion.
Dynamic deformation of the chaperon protein GroEL, obtained with the MMTK-based
interactive DomainFinder (Zoom in)
Characteristic features of biomolecular simulations that had to be taken into account are
the long execution times of some simulation techniques (several weeks are not
uncommon) and the complexity of the data structures describing biomolecules.
Choice of languages
The choice of Python plus C was made after an evaluation of various languages. I was
rapidly convinced that only a mixture of a high-level interpreted language and a CPU-
efficient compiled language could meet my seemingly conflicting requirements of rapid
development and efficient execution.
For the high-level part, Tcl was ruled out because it could not handle the complex data
structures required by the project. Perl was ruled out because of its unpleasant syntax
(this was of course a subjective choice), and because of its badly integrated OO
mechanism. Python scored high in readability, OO support, library support, and
integration with compiled languages. Moreover, Numerical Python had just been released
and was an important building block for my developments.
For the low-level part, Fortran 77 was eliminated because of its archaic character, lack of
memory management, and portability issues in C-Fortran interfacing. C++ was a
candidate, but ultimately not chosen because portability between compilers was still an
issue in 1996, and because I considered the benefits of C++ for the small amount of
compiled code in the project insufficient to compensate for the complexity of the
language.
Library architecture
The architecture of MMTK is clearly Python-driven. To the user, it presents itself as a
pure Python library. The C code in MMTK was written from scratch in the form of
Python extension modules that only handle the few time-critical aspects: evaluation of
interaction energies, and long-running iterative algorithms such as energy minimization
and Molecular Dynamics, which run without any Python-related overhead. Extensive use
is made of Numerical Python, LAPACK, and the netCDF library. MMTK provides multi-
threading support for shared memory parallel machines, and MPI-based parallelization
for distributed memory machines.
The biggest part of MMTK is a set of classes that describe atoms and molecules and
manage a database of molecules and fragments. Biomolecules (proteins, DNA, and RNA)
are handled by subclasses of the generic Molecule class. Another important subset of
MMTK implements schemas for calculating interaction energies (called somewhat
incorrectly "force fields" in the simulation community). I/O-related code is the third pillar
of MMTK. It reads and writes a few popular file formats plus its own trajectory format
that is based on the netCDF format. Contrary to other trajectory file formats, MMTK's
netCDF files are both binary (and thus compact) files and portable between platforms.
and moreover permit efficient access to nearly arbitrary subsets.
Snapshot from a Molecular Dynamics simulation of lysozyme in water, run with MMTK.
Zoom in
Modularity and extendibility were important design criteria. Algorithms, energy terms,
and specializations of the data types can be added without having to modify the MMTK
code. The design of MMTK as a library, rather than a closed program, is essential for
many applications.
Most MMTK users access the library from simple Python scripts, but MMTK has also
been used as a basis for end-user programs with graphical user interfaces, such as
nMOLDYN and DomainFinder.
MMTK currently consists of about 18,000 lines of Python code, 12,000 lines of hand-
written C code, and some machine-generated C code. The majority of the code was
developed by one person during eight years as part of a research activity. Two modules,
some functions, and many ideas were contributed by the user community.
Practical experience
MMTK and other Python libraries have been the basis for all my research projects for ten
years. Many of these projects would not have been possible without the rapid prototyping
that is characteristic for Python. In methodological work, development and testing time is
essential: an idea that can be tried out in an afternoon will be tried out, whereas an idea
that requires a week of work for evaluation is often put aside.
As with all open source projects, the size of the MMTK user community can only be
estimated indirectly. The mailing list for MMTK users currently has 175 members, and
the scientific publication that describes MMTK to computational chemists has been cited
30 times.
Počela sam u razvoju MMTK u 1996. Imao sam neka iskustva s mainstream simulaciju
biomolekula pakete za koje su napisane u FORTRAN i imali su svoje korijene u 1970. Ti
paketi su previše glomazan za korištenje, a posebice to promijeniti i proširiti. Budući da
moj istraživački rad je fokusiran na razvoj novih tehnika simulacije, modifiability je
posebno važan kriterij.
Primjer MMTK Molekularna model
Izbor Python plus C je donesena nakon procjene različitih jezika. Brzo sam bio uvjeren
da samo mješavina visoke razine tumači jezik i CPU-efikasan sastavljen jezik mogao
ispuniti moje naoko proturječne zahtjeve brz razvoj i efikasno izvršenje.
Za visoke razine dijelu, Tcl je odbacio jer nije mogao nositi kompleksne strukture
podataka koje zahtijeva projekt. Perl isključena je zbog neugodnog sintakse (ovo je
naravno subjektivno izbora), a zbog svoje loše integriranog OO mehanizam. Python je
postigao visoko u čitljivosti, OO podršku, podaci knjižnice, te integraciju s kompilirane
jezika. Štoviše, Numerička Python je pravedan bio otpušten i je važan element za moj
razvoj.
Za niske razine dijelu, Fortran 77 bio eliminiran zbog svojih arhaičnih karaktera,
nedostatak memorije za upravljanje, i prenosivosti brojeva u C-Fortran sučelja. C + + je
bio kandidat, ali u konačnici nije izabran jer prenosivost između kompilatora je još uvijek
problem u 1996, i zato što sam smatrao koristi C + + za malo sastavljen kod u projekt
dovoljno za kompenzaciju složenosti jezika.
Knjižnica arhitektura
Arhitektura MMTK je jasno Python-driven. Da korisnik, ona sebe predstavlja kao čista
Python biblioteka. C kod u MMTK napisan od samog početka u obliku Python modula
kako rukovati samo nekoliko vremenski kritične aspekte: evaluacija interakcije energija, i
dugo-prikazivati iterativni algoritmi minimizacije energije kao što su i molekularne
dinamike koje teku bez Python vezane pretek. Ekstenzivni upotrebljavaju se Numerička
Python, LAPACK i netCDF knjižnice. MMTK pruža multi-threading podrška za
zajedničku memoriju strojeva paralelni i MPI-based paralelizam raspodijeljena memorija
za strojeve.
Najveći dio MMTK je skup klasa koje opisuju atoma i molekula i upravljati baze
podataka molekula i fragmenata. Biomolekula (proteini, DNA i RNA) rješava podrazreda
generičkih Molekula klase. Drugi važan podskup MMTK provodi shema za
izračunavanje interakcije energije (naziva nešto krivo "silnica" u zajednici simulacije). I /
O-vezane kod je treći stup MMTK. Ona čita i piše nekoliko popularnih formata datoteka i
vlastiti putanje oblik koji se temelji na netCDF formatu. Za razliku od druge formate
datoteka trajektorije, MMTK je netCDF datoteke su obje binarne (a time i kompaktne)
datoteka i prenosiv između platforme. i štoviše dozvola učinkovit pristup gotovo
proizvoljnog podskupa.
Primjer MMTK Molekularna model
Većina MMTK korisnik pristup biblioteke iz jednostavnog Python skripte, ali MMTK je
također bio korišten kao osnova za kraj-korisnik programa s grafičkim korisničkim
sučeljem, kao što su nMOLDYN i DomainFinder.
MMTK trenutno se sastoji od oko 18.000 redaka Python koda, 12.000 redaka rukom
pisane C koda, a neki strojno generirani C kod. Većina koda razvio je jedna osoba
tijekom osam godina u sklopu istraživačke djelatnosti. Dva modula, neke funkcije, te
mnoge ideje su doprinijeli korisnik zajednica.
Praktično iskustvo
MMTK i drugih Python knjižnica je osnova za sve moje istraživačke projekte za deset
godina. Mnogi od tih projekata ne bi bilo moguće bez brza koji je karakterističan za
Python. U metodološkom rada, razvoja i testiranja vrijeme je bitno: ideja da se može
suditi u popodnevnim satima biti će isprobani, dok je ideju da se zahtijeva tjedan dana
rada za procjenu često odložio.
Kao i za sve open source projekte, veličina MMTK zajednica korisnika može samo
pretpostaviti indirektno. Mailing lista za MMTK korisnik trenutno ima 175 članova, a
znanstvena publikacija koja opisuje MMTK uz računarsko kemičare je citirani 30 puta.
O autoru