Bio in For Matics
Bio in For Matics
Biot2083
Yordanos Sewalem
1 Introduction Xiong, 2 0 0 6
What is bioinformatics
Historical highlight
Significance of bioinformatics: overview
2 Data base and databank search tools overiview of Attwood, 2 0 0 4
biological databases
Classification of databases
Sequence databases
DNA data base genbank EMBL, DDBJ
Protein databases
3 Data search and mining tools Mani and
Vijaraj, 2 0 0 3
4 Nucleic acid sequence analysis Rastogi et al.,
Gene searching and sequence retrieval 2001
Sequence alignment BLAST, MSA CLUSTALW X etc
5 Restriction analysis Baxevanis and
Primer designing O uellette,
2001
6 Protein sequence and structure analysis Mani and
Protein sequence retrieval Vijaraj, 2 0 0 3
Protein sequcne alignment
Content
7 Protein structure prediction (primary, secondary, and Baxevanis and
tertiary) strucuture O uellette,
Protein structure visualization with RAMO L etc 2001
8 Homology modeling Mani and
Vijaraj, 2 0 0 3
9 Application of bioinformatics Baxevanis, and
Genome analysis O uellette,
Transcriptome analysis- micro array 2001
10 Protein analysis Mani and
Vijaraj,2 0 0 3
11 Metabolomics Baxevanis and
O uellette,
2001
12 Introduction to system biology Mani and
Vijaraj, 2 0 0 3
Web based exercise study of public biological Databases
databases
Web base exercise primary sequences NCBI
Web based exercise Dot matrix plot NCBI and EBI
Web based exercise nucleotide translation NCBI
Web based searching for and identifying protein Web browsing
Software exercise PSI-BLAST PSI-BLAST
Alignment as a search tool NCBI and EBI
Multiple sequence alignment All webs
Gene finding Any web including ordinary
web sites
Laboratory activity
O N
T I
U C
O D
T R
IN
Introduction
• Computers and specialized softwares are essential biologist’s
toolkit.
• Beginning of bioinfomratics starts 70 years before
• It starts with protein sequencing not DNA sequencing
• In 1950s Edman degradation as protein sequence method
o Starting from the N-terminal
o For large proteins first cleaved to small pieces then sequenced
o Assembling of these sequence the first bioinformatics software
o COMPROTEIN by Margaret Dayhoff and Robert Ledley
Amino acids
• From three letter to one letter amino acid
o 1965 Dayhoff and Eck’s “Atlas of Protein sequence and Structure”
o “The mother and father of bioinformatics” by NCBI director
o Five volumes focusing on the structure
13
14
Any region of the DNA sequence can, in principle, code for six different amino acid
sequences, because any one of three different reading frames can be used to interpret each of
the two strands.
15
A human hemogluobin has a
cDNA sequence
• >gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1),
mRNA
• ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTC
CTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGG
CAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCT
GAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGT
GACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTC
TGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGC
CTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC
A cDNA sequence (reading frame)
A protein sequence
17
And, a whole genome…
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGG
GGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCT
ACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGC
CGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC
AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCT
CCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCT
TGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCG
GCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCT
GGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGAC
CTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAAC
GCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGG
TCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGC
CTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTT
CTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGG
CGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGC
CTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAG
ACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCA
ACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCC
GGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCAC
GCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGC
TTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTG
GGCGGCGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGG
ACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGT
GCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCC
ATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTG
AGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG...
18
• Using bioinformatics
o Sequence assembly
o Genome annotation
o Molecular evolution
o Analysis of gene expression
o Analysis of gene regulation
o Protein structure prediction and docking
History of bioinformatics
• 1953- Watson & Crick proposed the dobule helix model
• 1954- Perutz’s group develop heavy atom method for improved
protein crytallography
• 1955- bovine insulin was sequenced (1st protein)
• 1969- the ARPANET is created at Stanford and UCLA
• 1970- Needleman-Wunsch algorithm published
• 1972- the first recombinant DNA is created by Paul Berg and his group
• Cohen, S., Chang, A., and Boyer, H. produced the first recombinant
DNA orgnism
• 1973- The Brookhaven Protein DataBank is announced
• Joseph Sambrook et al., refined DNA electrophoresis using agarose gel
• Herbert Boyer and Stanely Cohen invented DNA cloning.
• 1974- Vint Cerf and Robert Khan develop the concept of connecting
networks of computers into an “internet” and develop Transmission
Control Protocol (TCP)
• 1975- Microsoft Corporation is founded by Bill Gates and Paul Allen ;
• Two-dimentaional electrophoresis by P.H.O’Farrel
• 1977- method for sequenicng DNA; the first genetic engineering
company Genetech was founded
• 1988- the National Center for Biotechnology Information (NCBI) is
established at National Cancer Institute (NCI);
• the Human Genome Initiative is started;
• The FASTA algorithm for sequence comparison published by Pearson
and Lupman;
• 1989-the first complete genome map of Haemophilus influenza was
published.
• 1990- the BLAST program is implemented (Altschul et al.)
• Look and SegMod software for molecular modeling and
protein design by Molecular application group
• Software by InfoMax for sequence analysis, database and data
managemnt, searching, publication graphics, clone
construction, mapping and primer design
• 1991- CERN in Geneva announces protocol for World Wide
Web
• ESTs (expressed sequence tags) was created and used
• 1994- The PRINTS database of protein motifs is published by
Attwood and Beck
• 1995- the Haemophilus influenzea genome (1.8) sequenced
• Mycoplasma genitalium genome sequenced
• 1996- final version of Human genetic Map was published by
Genethon
• Saccharomyces cerevisiae (12.1Mb) is sequenced
• Prosite database is reported by Bairoch et al
• Affymetrix produced the first commercial DNA chips
• 1997- E. coli genome (4.7MB) published
• 1998- genome for Caenorhabitis elegans and Baker’s yeast
published
• 2000- genome for Pseudomonas aeruginosa (6.3Mb) was published;
Athaliana genome (100Mb) is sequenced; D. melanogaster genome
(180Mb) sequenced
• 2001- the human genome (3,000Mb) is published
Origin of bioinformatics and
biological databases:
The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.
24
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).
Eukaryotes 37
Prokaryotes 1708
Total 1745
26
Open reading frames
Functional sites
Annotation
Structure, function
27
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
28
promoter TF binding site
Transcription
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
Start Site
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
Identifying orthologs
Inferences on structure
and function
Comparative
genomics
Comparing functional sites
Inferences on regulatory
networks
30
Alignment preproinsulin
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****
Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:*******
31
32
Significance of
bioinformatics: overview
• Accelerate research in the area of biotechnology
o Automatic genome sequencing
o Gene identification
o Prediction of gene function
o Prediction of protein structure
o Phylogeny
o Drug designing and development
o Identification of organisms
o Vaccine designing
o Understanding the gene and genome complexity
o Understanding protein structure
o Functionality and folding
• Genomics
o Generates vast amount of data managing these data and generating information from such data
• Proteomics
o Data from protein-protein interaction, protein profiles, protein activity pattern and organelles
compositions, image analysis from 2D gels, peptide mass fingerprinting and peptide
fragmentation fingerprinting.
• Transcirptomics
o Microarray data, RNA sequcning generating a lot of data
• Cheminformatics
o Identify and structurally modify a nautral product, to design a cpd with desired properteis and
assess its theraputic effect theroetically.
• Drug discovery
o Used to predict, analyze and interpretation of clincal and preclincial data. Particularly in
computer-aided drug design, providing drug related databases and softwares
• Evolutionary study/phylogentics
o Using sequence alignment and various algorithms
• Crop improvement
• Veterinary science
o Understaning of the livestock species, and provide accurate predictions
• Forensic science
o Identification and relatedness of individuals,
• Biodefense
o Restore bisecurity for biological threats or infectious diseseases
• Waste cleanup
o Expolore microbial potential for biodegradation, improve the potential,
• Bioenergy/biofuels
o Biofuel production pathway,
N K
B A
TA E
S
DA
S
A
B
A
N D D
A
T
A LS I C A L
SE O OG
A T OIOL
A B H B
AT C O F
D ARV I E W
SE E R V
O
• A biological database is a large, organized body of
persistent data, usually associated with computerized
software designed to update, query, and retrieve
components of the data stored within the system.
• A simple database might be a single file containing many
records, each of which includes the same set of
information.
• Example, a record associated with a nucleotide sequence
database typically contains information such as contact
name; the input sequence with description of the type of
molecule; the scientific name of the source organism from
which it was isolated; and , often, literature citations
associated with the sequence.
• To benefit from the data stored in a database,
two requirements must be met:
o Easy access to the information and
o Method for extracting only that information needed to answer a
specific biological question.
• Databases allows knowledge discovery, which
refers to the identification of connections between
pieces of information that were not known when
the information was first entered.
• Allows the indexing of data
• It helps to remove redundancy of data.
Classification of databases
• Public repositories of gene data
o GenBank; DDBJ; EMBL
• Private databases
o Research groups databases
o Biotech companies databases
• Based on the information
o Primary
• Data bases containing raw information (original form)
• Also called archieval database
• ENA, GenBank , EMBL, DDBJ (nucleotide sequence),
• SWISS-PROT, PIR, PDB (protein data bank)
• Array Express Archive and GEO (functional genomics data)
• They are populated with experimentally derrived data such as nucleotide
sequence, protein sequence or macromoleuclar structure.
• Experimental results are submitted directly into the database by researchers,
and the data are essentially arhival in nature.
• Once given a database accession number, the data in primary databases are
never changed
o Secondary
• Comprise data derived from analyzed/ curetted primary data.
• More relevant and useful information to specific requirements
• PROSITE, PRINTS, BLOCKS,Pfam , InterPro(protein families, motifs and
domains), UniProt(sequence and functional information on proteins), Ensembl
(variation, function, regulation and more layered onto whole genome
sequences)
Sequence databases
• RNA and DNA store the hereditary information
about an organism which can be analyzed with the
help of bioinformatics tools and databases
• The most popular databases are
o GenBank from NCBI
o SwissProt from the Swiss Institute of Bioinformatics and
o PIR from Protein Information Resources
• The principal requirements on public data services
are:
o Data quality- the quality of data mainly the primary responsibility of the
submitter
o Supporting data- users may be interested in the primary experimental data
in the database or by cross-references back to network-accessible
laboratory databases
o Deep annotation- deep, consistent annotation comprising supporting and
ancillary information should be attached
o Timeliness- basic data should be internet- accessible server within
days/hours of publication or submission
o Integration- each data object in database should be cross-referenced to
representation of the same or related biological entities in other databases.
DNA database
• The main function of DNA databases is to store and
compare DNA sequence and protein sequence data
o Genebank
o EMBL
o DDBI
GeneBank
• Genetic Sequence Databank (GeneBank) is the fastest
growing repositories of known genetic sequences.
• A NIH genetic sequence database, an annotated collection
of all publicly available DNA sequences
• It has a flat file structure (ASCII text file, readable by both
humans and computers)
• In addition to sequence data it contains accession
numbers and gene names, phylogenetic classification and
references to published literature
• 216million sequences and 399 billion bases in traditional
GenBank
EMBL
• EMBL Nucleotide sequence database is a comprehensive
database of DNA and RNA sequences from scientific literature
and patent applications and directly submitted from
researchers and sequencing groups
• Data collection is done in collaboration with GenBank (USA)
and DNA Database of Japan (DDBJ)
• Current number of bases and sequence can found…
• www.ebi.ac.uk
DDBJ
• www.ddbj.nig.ac.jp/services-e.html
• Collection of nucleotide sequence data. In addition it has
protein sequences and protein structures.
• Maintenance and development is organized by the Center for
Information Biology and DNA Data Bank of Japan(CIB-
DDBJ) of the National Institute of Genetics. (
http://www.cib.nig.ac.jp/) (
http://ww.nig.ac.jp/english/index.html).
Other database
• Genethon Genome database (PHYSICAL MAP;GENETIC
MAP; GENEXPRESS (cDNA);
• 21 Bdb: LBL’s Human chromosome 21 database
• MGD: the mouse genome database
• ACeDB: a Caenorhabditis elegans database
• MEDLINE is NLM’s premier biblioraphic database covering
medical sciences; their citations atrre searchable using NLM’s
controlled voabulary (MeSH, Medical Subject Headings)
Protein databases
• Protein sequence databases are classified as primary,
secondary and composite
• Primary databses contain protein sequences as ‘raw’ data
o PIR and SwissProt
•
• Protein sequences is increasing
• Used for
o Biological variation
o Evolutionary pattern
o Protein family characterization
• Protein databases
o List swissProt
o PIR
o PRF
o PDB
o Nrdb
•
• To overcome redundant sequences
o Non-redundant protein databases emerge
• KIND (Karolinska Institutet Nonredundant Database)
o Compiled from (GenPept, gpcu ), (Swissprot and Swissnew), PIR and
TrEMBL
o Sequence ID retation in priority Swissprot, PIR, GenPept and TrEMBL
o ftp://ftp.mbb.ki.se/pub/KIND
PIR-PSD
• Protein Information Resource (PIR) produces and distributes
the PIR-International Protein Sequence Database (PSD)
• Most comprehensive and expertly annotated protein sequence
database
• On-line access, off-line sequence identification service
SwissProt
• It provides a high level of integration with other databases and
has low level of redundancy(less identical sequences)
PROSITE
• PROSITE contains infomration or dictionary of sites and
patterns in proteins
• Prepared by Amos Bairoch
EC-ENZYME
• ‘ENZYME’ data bank contains information about enzymes
with EC number such as
o EC number, recommended name, alternative names, catlytic activity, cofactors, pointers
to the SwissProt entries that correspond to the enzyme, pinters to disease associated with
deficiency of the enzyme
PDB
• X-ray crystallography Protein Data Bank (PDB)
• www.rcsb.org
Other databases
• GDB has human genome Data Base storage and
dissmeination of data about genes and other DNA markers,
map location, genetic desease and locus infomration and
bibliographic information.
• OMIM – the Mendelian Inheritance in Man data bank (MIM)
• PIR-PSD-
• Molecual Modeling database (MMDB)
o http://www.ncbi.nlm.nih.gov/Structure/index.shtml
N G
N I
I
M
N D
A
C H
A R
SE
TA S
A
D OL
TO
• Knowledge discovery from database process is defined in five
major steps: Data preparation, data preprocessing, DM,
evaluation and interpretation, and implementation.
o “the nontrivial process of identifying valid, novel, potentially useful and ultimately
understandable patterns in data” – Fayyad et al., 1996
o Data preparation: determining the data to be used for analysis. Which then will be
collected (files, databases, data warehouses or data marts);
• If large take representative
• Selecting suitable data from the available data (data farming)
• Standard representation (table format, space delineated, comma delineated…)
o Data preprocessing: to increase data qulaity (incompleteness, inconsistency,
transformation…)
• Data cleaning- filling missing vlues (e.g. using most probable data value, attribute
mean, a global mean etc.), smoothing out noise (e.g. binning, clustering, regression
etc.), handling outliers (e.g. robust regression), detecting and removing redundant
data (e.g. correlation analysis). ,
• Data transformation- data is transformed into approprate forms. ANNs, and
various clusintering methods perfom better if the data is sclaed to a specified range
(normaized (min-max, z-score and decimal scaling)), smoothing, aggregation and
generalization
• Data reduction- the size of data is reduced. Generalization and and aggregation is
form of reduction. Dimension reduction/ feature selection to eliminate unnecessary
attributes using best-first search, beam search etc; regression techniques; ANOVA;
DT induction, correlation analysis, ANNs, SVMs, rule induction by rough set thoery,
partial least squares, subjective evaluation, genetic algorithm, PCA.
o Discretization:
• DM (data mining) is further explained in descriptive and
predictive mining categories with their functions and
methods.
o Historically DM has evolved from various disciplines. Such as databases (relational
databases, data warehousing, on-line analytical processing, OLAP), information
retrieval (IR) (similarity measures, clustering), statistics (Bayes theorem, regression,
maximum likelihood estimation, resampling) and artifical intelligence (AI) (ANNs,
machine learning, genetic algorithms, decision trees )
o The functions are used to specify the types of patterns to be mined including
description, clustering, association, classification, prediction, trend analysis,
similarity analysis and pattern detection.
o DM talked from the kinds of knowledge mined (DM functionalites) such as
association, classification, clusingering and others.
• Descriptive data mining
o This includes summarization, clustering, association rule generation and sequence discovery.
o Summarization is presentation of general properties of the data set studied. Use statistical
methods.
o Clustering
o Association function tries to identify groups of items that are happening together.
• Predictive Modeling
o Classification is DM function accomplished in two steps , i.e, constructing a model that
describes important classes in a given labled data set by using methods like DTs, ANNs,… and
then, categorizing a new data set (testing sample), whose classes are not known, depending on
the model built.
• S-based algorithms for modeling
• Decision Tree-based (DT-based) algorithms (ID3, C4.5, CHAID, CART, ID5R, SLIQ,
SPRINT)
• ANN based systems use optimization methods such as (levenberg-Marquart, quasi-
Newton and Conjugate-gradient =local optimization) simulated annealing, GA are global
ones. Perceptron is the simplest form of an ANN, and used for classification to two
groups. Multi-layer perceptron (MLP) combines perceptrons into a network structure.
• Learning alogrithm such as back propagation (BP)use gradient descent (GD)
optimization technique.
• Radial basis function (RBF) ; competitive ANN (CompetNN); SOM; learning
vectorquantization (LVQ); ARTMAP; Fuzzy ARTMAP; probabilistic NN
(PNN); Bayesian NN (BNN);
• Disadvantage of ANN is no if-then type of rules over DT and to overcome this
rectangular basis function network (RecBFN)
o Other classification systems : k-nearest-neighbors (KNN), GA, rough set theory
(RST), fuzzy set theory (FST), SVM, entropy network (EN) and association rules-
based algorithm. PRISM, attribute decomposition approach (ADA), modified
breath-first search of an interest graph (MIG), genetic programming (GP) and
breadth-oblivious-wrapper (BOW) hill-climbing search procedure.
NW http://www.bioinf.org.uk/software/nw/
Stretcher https://galaxy.pasteur.fr/?form=stretcher
SABERTOOTH http://www.fkp.tu-darmstadt.de/sabertooth_source
Gene searching and
sequence retrieval
• Sequence retrieval system (SRS)
• ?BioPerl, BioPhyton, BioJava, BioRuby, BioSQL?
Sequence alignment
BLAST, MSA CLUSTALWx,
etc
• BLAST (Basic Local Alignment Search Tool)
o Homology and similarity tool
o Designed for windows platform
o For protein or DNA
o Q BLAST for user friendly retrieval of results
• Different blast tools
o Blastp compares an amino acid query sequence against a protein sequence database
o Blastn compares a nucleotide query sequence a gainst a nucleotide sequence database
o Balstx comapreds a cnucleotide query sequence translated in all reading frames against a
proeterin dequence database
o Tblastn comapres a protein query sequence against a nucleotide sequence database dynamically
translated in all reading fframes
o Tblastx comapres the six-frame translations of a nucleotide query sequcne against the six-frame
translations of a nucleotide sequcne database.
• Clustalw is fully automated sequence alignment tool for DNA
and protein sequences
• It returns the best match over a total length of input sequences
EMBOSS, Staden package
• EMBOSS (European Molecular Biology Open Software Suite)
o It is open source software with a range of librarites to extedn/ flaxibility for users
THREADER, PHD
• GenomeThreader from genomethreader.org
• THREADER3 threading algorithm for alignment and fold
recognition tool.
o http://bioinf.cs.ucl.ac.uk/psipred?genthreader=1
RasMol, WHATIF
• RasMol- to display the structure of DNA, proteins and smaller
molecules
o Derivatives such as Protein Explorer are easy to use
I S
Y S
A L
A N
O N
T I
I C
T R
E S
R
• Restriction analysis is to identify restriction mapping sites in
DNA sequences using appropriate enzyme sets and enzyme
filtering criteria as per specific experimental requirements.
• Restriction Analyzer
o It has options such as enzymes, citeia based selection o use poided lists.
o Type o DNA linea/ cicula
o http://molbiotools.com/restrictionanalyzer.html
• GenScript
o Online tool
o Restriction Enzyme Map Analysis tools
• Molecular Biology Tools
• Peptide Tools
• Protein Tools
• NEBcutter
o http://tools.neb.com/NEBcutter/index.php3
o to find restriction digestion map of a DNA sequence. Find the large, non-overlapping
open reading frames.
o GenBank number or plain/FASTA format DNA sequence can be submitted.
o NEB enzymes/ All commercially available enzymes
o Type of sequence linear or Circular
o Minimum ORF length to display xxxx amino acids.
o It is from NEW ENGLAND BioLabs inc.
• Restriction endonuclease digestion
o Webcutter 2.0 (U.S.A)
o WatCut (Michael Palmer, University of Waterloo, Canada) – provides restriction analysis
coupled with where the sites are located within genes.
o Restriction Site Analysis – (University of Massachusetts Medical School, U.S.A.) uses H.
Mangalam’s TACG2 program. Provides one with considerable choice of enzymes and
output format, including pseudo gel maps.
o Restriction Enzyme Picker – finds sets of 4 commercially available restriction
endonucleasses which together uniquely differentiate designated sequence groups from a
supplied FASTA format sequence file for use in T-RFLP
o NEBcutter- provides opportunities to upload local files, choose from common vector
sequences or enter GenBank accession numbers. Also includes ability to map sites in
genes. After you have the restriction map for this sequence you might want to consult the
New England Biolabs.
o Restriction Analyzer- carry out in silico restriction analysis online. Quickly find absent and
unique sites. Tabularand graphical output. Analyze restriction fragments. Simulate a gel
electrophoresis.
o Restriction Comparator –carry out parallel in silico restriction analysis online. Compare two
sequences side by side. Find distinguishing restriction sites. Visualize restriction patterns.
o WebDSV – is a basic molecular biology app to create, edit and analyze DNA sequences,
mark and visualize sequence features, and generate plasmid maps. With WebDSV you can
analyze restriction sites, perform in silico molecular cloning, and design PCR primers.
o In silico restriction digest of complete genomes – allows in silico digestion of over 300
prokaryotic genomes and simulated pulsed field gel electrophoretic separation of the
fragments.
o Computation of size of DNA and Protein Fragments from Their Electrophoretic Mobility?
o Sequence Extractor – generates a clickable restriction map and PCR primer map of a DNA
sequence (accepted formats are: raw, GenBank, EMBL, and FASTA) offering a great deal of
control on output. Protein translations and intron/exon boundaries are also shown. Use
sequence Extractor to build DNA constructs in silico.
o Promega Restriction Enzyme Tool –
• Worldwide.promega.com/resources/tools/retol/
• Sequence will be submitted; to select enzymes by setting overhang or blunt cut is
needed.
o SimVector (online)
• 1000 restriction enzymes.
• Compelte sequences or fragments sequences
• Linear or circular DNA which will be uploaded from file, pated or retrived from
NCBI link and are analyzed with respet ot parameters set by the user.
• www.premierbiosoft.com/plasmid_maps/featuressv/cloning.html
N G
N I
I G
E S
D
ER
IM
PR
• Oligonucleotide synthesis provide researchers the ability to
construct short fragments of DNA with sequences of their own
choices.
• Could be used in polymerase chain reactions (PCR) to amplify
existing DNA sequences or to modify sequences and it require
o Ready access to collected pool of sequence information and
o A way to extract from this pool only sequences of intersest.
• Primer3 program
• In-silico PCR, Reverse ePCR- used for amplification targets of
primers.
• Autoprimer; QuanPrime; PRIMEGENES(sequence specific
primer design tools.)
• Primer-BLAST design-target-specific primers. Global
alignment ;
• Primer3; Web Primer; GeneFisher; Primer3Plus; BiSearch;
MFEPrimer; Primer Desgn and Search Tool; PrimerDesgn-M;
RF coloning; primers4clades; TaxMan;
• Oligonucleotde physicochemical parameters
o NetPrime; dnaMATE; OligoCalc; Oligo analyzer 3.1; Mongo Oligo Mass Calculator
v2.06; OligoEvaluator; OligoCalculation tool.
Primer3
• It generates candidate primers
• It has options such as primer list or sequencing primers.
• Species will be selected; primer failure rate cutoff values;
primer size; primer Tm; product Tm; primer GC%;number of
primer option to return; max 3’ stability; maximum library
mispriming …..
• Primer3.ut.ee
Web primer
• Primer-BLAST
o Given a sequence (target sequence)
o Generate candidate pairs of primers.
o With target specificity of the primers.
• Real-time PCR primer design
o www.genscript.com/tools/real-time-pcr-taqman-primer-design-tool
o Use GenBank accession or the DNA sequence; the number of primer sets to be out
putted; the applicon size range; melting temperature minimum, optimum and maximum;
for probe minimum, optimum and maximum.
• PRIMO,
o To design primers for large scale DNA sequencing projects.
o Can be downloaded from Chang bioscience (
www.changbioscience.com/primo/primo.html)
• PDA(primer design assistant),
o Web based (dbb.nhri.org.tw/primer/index.html)
o
• Analysis tools for primers
o OligoCalc (biotools.nubic.northwestern.edu/OligoCalc.html) and Oligo (
http://www.operon.com/tools/oligo-analysis-tool.aspx/)
o To calculate molecular weight, GC content, melting temperature, intermolecular self-
hybridization, and intramolecular hairpin loop formation of oligomers or primers.
N D
A
C E I S
N Y S
E
U A L
E Q N
S EA
I N R
T E TU
R O C
P RU
ST
C E
E N
U
Q
SE
I N L
E VA
O T E
I
PR ETR
R
• From UniProt
• by entering the list of identifiers . the protein identifiers can be
specified and changed by setting the second option from and
change it to . this is obtained from the Retrieve/ID mapping.
• the other option is if we have segment of protein sequences
with more than two amino acids long can be run by putting
thes equence and pressing peptide sequence button. This is
obtained from the peptide search window.
• From the NCBI (EMBL/DDBJ)
o The proteins sequences are identified by unique accession numbers and name of the
protein
C E
E N
U
Q
SE T
I N N
T E E
O N M
R
P LI G
A
• Local alignment is mainly used for those sequences which
differ in sequence length. The method finds local matches
within the sequence stretch instead of looking at the entire
sequence
• The sequence similarities can be represented as
o Dot matrix method
o Dynamic programming
• Protein function analysis is made by comparing the protein
sequence to the secondary/derived protein databases that
contain information on motifs, signatures and protein domains
• Highly significant hits against these different pattern databases
allow approximate the biochemical function of query protein
• It includes evolutionarily analysis, identification of mutations,
hydropathy regions, CpG islands and compositional biases
Alignment preproinsulin
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****
Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:*******
103
FASTA
• FAST homology search All sequences.
• Alignment program for protein sequences created by Pearsin
and Lipman in 1988
• Heuristic algorithms to speed up sequence comparison
• BLAST finds regions of
similarity between biological
sequences.
• Compares nucleotide or
protein sequences to sequence
in databases.
• BLAST specialized searches
(SmartBLAST, Primer-BLAST,
Global Align, CD-search,
igBLAST, vecScreen, CDART,
Multiple Alignment , and
MOLE-BLAST).
• Global Align compare two
sequcnes across their entire
span using the Needleman-
Wunsch algorithm.
Global alignment Local
software alignment
software
T U IA
R
U C T
E
R
T
R D
ST N R Y A
N
I N IO D A
T E T N
O I C S E CS O
PR EDA R Y ,U R E
PRR I MR U C T
P T
S
• Comparison of protein with known structure databases.
• The function of a protein is more directly a consequence of its
structure rather than its sequence
• structural homology tending to share functions.
• Determination of protein’s 2D/3D structure is crucial
• The 3D structure for macromolecules is done by four
fundamental techniques as X-ray crystallography, nuclear
magnetic resonance (NMR) spectroscopy, cryo-electron
microscopy (Cryo-EM), and neutron diffraction.
• Although these techniques are viable and inestimable, they
cannot build an atomic structure model from scratch without
former knowledge of the proteins’ chemical and physical
properties and proteins’ primary sequence.
X-ray crystallography NMR Cryo-EM
• http://www2.ebi.ac.uk/dali/fssp/
PROSPECT
• PROSPECT (PROtein Structure Prediction and Evaluation
Computer ToolKit)
o Protein structure prediction system
o Use protein threading computation technique to construct protein’s 3D model
COPIA
• COPIA (Consensus Pattern Identification and Analysis)
• Protein structure analysis tool for discovering motifs
(conserved regions) in a family of protein sequences
• Used to identify membership to the family for new protein
sequences, predict secondary and tertiary structure and
function of proteins
• Study evolution history of the sequences
• Structural databases are storage platforms that are devoted to
the three-dimensional structural information of
macromolecules.
• Protein
o Primary
o Secondary
o Tertiary and
o Quaternary
• DNA
PDBsum
Database Use/Description Link
PDBj Protein Data Bank Japan archives https://pdbj.org/
macromolecular structures and provides
integrated tools
BMRD Biological Magnetic Resonance Data Bank http://
(NMR), a repository for data from NMR ww.bmrb.wisc.edu/
spectroscopy on proteins, peptides, nucleic
acids, and other biomolecules
PDBe Protein Data Bank in Europe (PDBe) archives http://
biological macromolecular structures www.ebi.ac.uk/pdbe/
RCSB Research Collaborator for Structural https://www.rcsb.org/
PDB Bioinformatics Protein Data Bank archives
information about the 3D shapes of proteins,
nucleic acids, and complex assemblies
RDBsum Pictorial analysis of macromolecular structures www.ebi.ac.uk/
pdbsum
Table: Primary structural data centers and other browsers
CATH Domain classification of structures http://www.cathdb.info/
SCOP SCOP2, structural and evolutionary http://scop2.mrc-
classification lmb.cam.ac.uk/
Table: Structure classification databases
NDB Nucleic acid database http://
ndbserver.rutgers.edu/
RNA 3D structure of RNA fragments http://
FRABAS rnafrabase.cs.put.poznan
E .pl/
NPIDB 3D structures of nucleic acid-protein http://
complexes npidb.belozersky.msu.ru
/
Table: Nucleic acid datablases
MemProtMD MemProtMD, database of membrane protein http://
sbcb.bioch.ox.ac.
uk/memprotmd/
T U P
Y
M
U C M
O
,
T R N H
I
S TI O L O ,
C
I N A SM
T E I Z A
O A L L
,
R
R
P ISU R A M
O
V H IT
W
• Before computer visualization software was developed,
molecular structures were presented by physical models of
metal wires, rods, and spheres. With the development of
computer hardware and software technology and computer
graphics programs were developed to visualizing and
manipulating three-dimensional structures. The computer
graphics help to analyze and compare protein structure to gain
the function of protein.
• Molecular visualization helps the scientists to bioengineer the
protein molecules. User-friendly graphics interface makes this
area of bioinformatics a full filled scientific thrill.
RasMol(stand alone)
• RasMol is a molecular graphics program intended for the visualization
of proteins, nucleic acids and small molecules. The program is aimed at
display, teaching and generation of publication quality images.
• The program reads in a molecule coordinate file and interactively
displays the molecule on the screen in a variety of colour schemes and
molecule representations. Currently available representations include
depth-cued wireframes, 'Dreiding' sticks, spacefilling (CPK) spheres,
ball and stick, solid and strand biomolecular ribbons, atom labels and
dot surfaces.
• Supported input file formats include Protein Data Bank (PDB), Tripos
Associates' Alchemy and Sybyl Mol2 formats, Molecular Design
Limited's (MDL) Mol file format, Minnesota Supercomputer Center's
(MSC) XYZ (XMol) format, CHARMm format, CIF format and
mmCIF format files.
CHIME
• CHIME is derived from Chemical MIME
o free program to show molecular structure in three dimensions.
• PyMOL
o Download and install in your computer.
o Pymol.org
proteopedia
• Proteopedia.org/wiki/fgij/
• By submitting PDB id example hemogglobin 1A3N
NCBI structure
• Or the icn3D
N G
L I
D E
O
M
G Y
L O
O
O M
H
• Homologous sequences are sequences that are related by
divergence from a common ancestor.
• Degree of similarity between two sequences can be measured
while their homology is a case of being either true or false.
• A homology modeling is useful when the model protein (with
a known sequence and an unknown structure) is related to at
least one other protein with both a known sequence and a
known structure.
• The quality of the predicted structure by homology modeling
depends on the degree of similarity between the model and
template sequences.
• The 3D structure of a protein through homology modeling is
obtained with the following steps:
o target sequence search, identifying the proper template using BLAST,
o Sequence alignment
o Alignment corrections to ensure the conserved or functionally important residues are
aligned
o Backbone generation
o Loop modeling
o Side chain modeling using rotomer libraries
o Optimizing the model using energy minimization and
o Validating the model by stereochemcial evaluation
•
• MODELLER
o Provide an alignment of a seqence to be modeled with known related structures
o Downloadable
• MaxMod
o It is a graphical user interface to MODELLER program
PyMod
• Pymod is an open source PyMOL plugin, designed to act as an
interface between PyMOL and several bioinformatics tools
(for example: BLAST+, HMMER, Clustal Omega, MUSCLE,
PSIPRED and MODELLER).
PRIMO
• PRotein Interactive MOdeling (PRIMO)
o Template identiication
o Target-template sequence alignment
o Modeling and model ealuation
SWISS-MODEL
• is a fully automated protein structure homology-modelling server,
accessible via the Expasy web server, or from the program DeepView
(Swiss Pdb-Viewer).
• The purpose of this server is to make protein modelling accessible to all
life science researchers worldwide.
• FoldX is an empirical force field that was developed for the rapid
evaluation of the effect of mutations on the stability, folding and dynamics
of proteins and nucleic acids. The core functionality of FoldX, namely the
calculation of the free energy of a macromolecule based on its high-
resolution 3D structure
F
O S
N I C
I O T
T A
C A M
I R
P L FO
P
A IO I N
B
E M
ST
SY
TO
O N
T I
C
U
O D Y
R
T LO G
N
I IO
B
Cheminformatics
• It includes
o Synthesis planning; reaction and structure retrieval; 3D structure retrieval; Modeling;
computational chemistry; visualization tools and Utilities ;
• It focuses on storing, indexing, searching , retriving and
applying information about chemical compounds. It involves
organization of chemical data in a logical form to facilitate the
retrieval of chemical properties, structures and their
relationships.
• It is possbilet
Bioinformatics projects
• BioJava
• BioPerl
• BioXML
• Biocorba
• Ensembl
• Bioperl-db
• Biopython and biojava
• Information from images includes
o Magnetic resonance imaging (MRI)
o Computed tomography (CT)
o Positron emission tomography (PET)
o Single-photon emission computed tomography (SPECT)
o Functional magnetic resonance imaging (fMRI)
o Electroencephalography (EEG)
o Magnetoencephalography (MEG)
o Cryosectioning (2D images)
• The image data Is dynamic
o Which changes day to day, moment to moment, milliseconds/minutes, hours or days
weeks or longer.
• Four elements of Databases
o Biological objects (sequence, proteins, cell, organism)
o Relationship among objects
o Classifiers to relate objects one another
o Metadata or data about the data
• Genome annotation
o The analysis and management of genome data to predict and archive various kinds of
biological features, particularly genes, biologic signals, sequence characteristics, and
gene products.
• Blocks database: consists of ungapped multiple alignments of
short regions called ‘blocks’ . The data base was constructed
of sequences of protein families using fully automated
method.
• The database takes the protein families from PROSITE. The
blocks representing a protein family will be generated using a
two step system called PROTOMAT system.
o 1. motif finder finds triplates of amino acids which are common to multiple sequences.
These will represent as blocks for the group.
o 2. assembles the best blocks that is consistently found in most sequences.
Bioinformatic resoruces
for proteomics data
• proteomics has applications such as
o Identification of peptides and proteins
o The study of post translational modification
o The quantification of protein levels
o The characterization of protein structure and
o Identification of protein interaction with other biomolecules
• ExPASy
o Contain listing of free proteomics databases and tools
• Bio.tools
o Listing of free proetomics tools
• omicX
o Selective, manually curated listings of tens to hundreds of proteomics and protein analysis tools
in various categories. Commercial
• Ms-utils.org
o Free MS data tools
• OBRC(online Bioinformatics Resoruces Collection)
o Part of University of Pittsburgh’s health sciences librarary of system. List databases and tools