0% found this document useful (0 votes)
41 views25 pages

CENG3300 Lecture 4

This document provides an introduction to cheminformatics and bioinformatics. It discusses common data types used in these fields like SMILES and InChI for representing chemical structures as well as molecular fingerprints for comparing molecular similarity. It also introduces key bioinformatics databases and topics studied in the field, which include analyzing genomic, proteomic, and other biological data as well as modeling biological pathways and structures.

Uploaded by

huichloemail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views25 pages

CENG3300 Lecture 4

This document provides an introduction to cheminformatics and bioinformatics. It discusses common data types used in these fields like SMILES and InChI for representing chemical structures as well as molecular fingerprints for comparing molecular similarity. It also introduces key bioinformatics databases and topics studied in the field, which include analyzing genomic, proteomic, and other biological data as well as modeling biological pathways and structures.

Uploaded by

huichloemail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Science for Molecular

Engineering
Lecture 5
Introduction to cheminformatics and
bioinformatics
• ILOs
• Know the common data types in cheminformatics and bioinformatics;
• Digitize molecular structure data;
• Calculate simple molecular features;
• Know the tools for processing biology data
Cheminformatics
• Chemoinformatics, chemical informatics, chemioinformatics
• Computer and chemistry
• Scope and Applications
• Chemical Data Management
• Molecular Descriptors and Property Prediction
• Chemical Similarity and Clustering
• Quantitative Structure-Activity Relationship

• Widely applied in drug discovery, materials science, and chemical engineering


Chemical data
Chemical data

What is the important information about a molecule?


Molecules
Simplified Molecular Input Line Entry
System
(SMILES)
• Atoms
• Specified by their atomic symbols inside brackets: [Au], [Fe], [Zn], etc
• No brackets needed for organic subset: B, C, N, O, P, S, F, Cl, Br, and I
• Aromatic atoms are lower case: c1ccccc1
• Bonds
• Single -
• Double = Q: What is the SMILES for ethane? Water?
• Triple #
• Aromatic :
• Single and aromatic can be omitted.
SMILES(cont.)
• Branches
• Parentheses denote branches and can be nested.
• Example: SC(N)CO

• Cycles
• Break a bond in the cycle and use a digit to label the break.
Q: What is the SMILES for Cyclohexane? Benzene? Toluene?
SMILES(cont.)
• Disconnections
• A period “.” separates nonbonded molecules.

• Isomeric Smiles
• Slashes ( / \ ) denote configuration around double bonds. F/C=C/F
• At ( @ ) denotes configuration around chiral centers. F\C=C\F
SMILES(cont.)
• SMILES is not natural to human (readable, but not ideal)
• SMILES is friendly to computers (text string)

• Computer packages to work with SMILES


• Rdkit
• Openbabel
• pybel
SMILES(cont.)
• Look at the three SMILES, what do they represent?
CCO, OCC and C(O)C

• (Many) different SMILES can refer to the same molecule


InChI (International Chemical Identifier)
• Ethane
SMILES: CC
InChI: InChI=1S/C2H6/c1-2/h1-2H3

• Compared to SMILES:
• InChI is standardized and unique
• Lengthy representation, not human readable
• Limited stereochemistry
SMILES arbitrary target
specification (SMARTS)
• Regular expressions for molecules.
• All SMILES are SMARTS (exact matches). Additionally, SMARTS
support:
• wild cards
• C~*~C any atom can be between two carbons using any (~) bond
• a1aaaaa1 any aromatic 6 atom ring
• property testing
• [R] atom in a ring
• [#6] atomic number is 6 (matches aromatic or aliphatic)
• [D3] atom with three explicit bonds (degree)
SMARTS (cont.)
• Additionally, SMARTS support:
• logical operators (not - !, and - & ;, or - ,)
• [!C&R] not aliphatic carbon and in ring
• [F,Cl,Br,I] one of the first four halogens
• matching an atomic environment ('recursive' SMARTS)
• [$(*O);$(*C)] this matches one atom that is bound to both C and O
Molecular fingerprint

Differenciate
Differenciate
Molecular fingerprint
• Naïve fingerprint

CHHCHH-----
Molecular fingerprint
• Implementation
Molecular fingerprint
• Extended connectivity molecular fingerprint
Molecular fingerprint (cont.)
• Similarity calculation

• Use as input for machine learning models

Read more details at


https://www.rdkit.org/docs/GettingStartedInPython.html
Bioinformatics
• What is bioinformatics?
• Computer and biology
• “Bioinformatics is the application of computers to the collection, archiving,
organization, and analysis of biological data.”
Common data types in bioinformatics
Key Online Bioinformatics Resources: NCBI & EBI

http://www.ncbi.nlm.nih.gov

https://www.ebi.ac.uk
Bioinformatics databases
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD, Beanref,
Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF,
BTKbase,
CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST,
dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB,
ESTHER,
FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genlilesne,
GenLink,
GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB,
HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase,
HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA,
KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI,
MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR,
MutBase,
MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase,
PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS,
GenBank
Bioinformatics topics
Include but are not limited to:
• Organization, classification, dissemination and analysis of biological and biomedical
data (particularly ‘-omics' data).
• Biological sequence analysis and phylogenetics.
• Genome organization and evolution.
• Regulation of gene expression and epigenetics.
• Biological pathways and networks in healthy & disease states.
• Protein structure prediction from sequence.
• Modeling and prediction of the biophysical properties of biomolecules for binding
prediction and drug design.
• Design of biomolecular structure and function.
References
• http://mscbio2025.csb.pitt.edu/notes/cheminformatics.slides.html#/
• https://bioboot.github.io/bggn213_f17/lectures/#1

You might also like