100% found this document useful (1 vote)
19 views160 pages

Bio in For Matics

Uploaded by

addisgezahegn16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
19 views160 pages

Bio in For Matics

Uploaded by

addisgezahegn16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 160

Bioinformatics

Biot2083
Yordanos Sewalem
1 Introduction Xiong, 2 0 0 6
What is bioinformatics
Historical highlight
Significance of bioinformatics: overview
2 Data base and databank search tools overiview of Attwood, 2 0 0 4
biological databases
Classification of databases
Sequence databases
DNA data base genbank EMBL, DDBJ
Protein databases
3 Data search and mining tools Mani and
Vijaraj, 2 0 0 3
4 Nucleic acid sequence analysis Rastogi et al.,
Gene searching and sequence retrieval 2001
Sequence alignment BLAST, MSA CLUSTALW X etc
5 Restriction analysis Baxevanis and
Primer designing O uellette,
2001
6 Protein sequence and structure analysis Mani and
Protein sequence retrieval Vijaraj, 2 0 0 3
Protein sequcne alignment

Content
7 Protein structure prediction (primary, secondary, and Baxevanis and
tertiary) strucuture O uellette,
Protein structure visualization with RAMO L etc 2001
8 Homology modeling Mani and
Vijaraj, 2 0 0 3
9 Application of bioinformatics Baxevanis, and
Genome analysis O uellette,
Transcriptome analysis- micro array 2001
10 Protein analysis Mani and
Vijaraj,2 0 0 3
11 Metabolomics Baxevanis and
O uellette,
2001
12 Introduction to system biology Mani and
Vijaraj, 2 0 0 3
Web based exercise study of public biological Databases
databases
Web base exercise primary sequences NCBI
Web based exercise Dot matrix plot NCBI and EBI
Web based exercise nucleotide translation NCBI
Web based searching for and identifying protein Web browsing
Software exercise PSI-BLAST PSI-BLAST
Alignment as a search tool NCBI and EBI
Multiple sequence alignment All webs
Gene finding Any web including ordinary
web sites

Laboratory activity
O N
T I
U C
O D
T R
IN
Introduction
• Computers and specialized softwares are essential biologist’s
toolkit.
• Beginning of bioinfomratics starts 70 years before
• It starts with protein sequencing not DNA sequencing
• In 1950s Edman degradation as protein sequence method
o Starting from the N-terminal
o For large proteins first cleaved to small pieces then sequenced
o Assembling of these sequence the first bioinformatics software
o COMPROTEIN by Margaret Dayhoff and Robert Ledley
Amino acids
• From three letter to one letter amino acid
o 1965 Dayhoff and Eck’s “Atlas of Protein sequence and Structure”
o “The mother and father of bioinformatics” by NCBI director
o Five volumes focusing on the structure

• From structure to information


o Emile Zuckerkandl and Linus Pauling
o Biomolecules as carriers of information
o “Letter arrangements will tell us the evolutionary history”
o Orthologous proteins
o Differences on orthologs is proportional to evolutionary divergency b/n
species
o Hemoglobin sequence from humane is more similar with
• Chimpanzee (Pan troglodytes) than
• Mouse (Mus musculus)
o Based on fossil record (divergence data)
o “ancestral sequences” of protein such as hemoglobin can be made
o For Hemoglobin

• What was the problem


o “evolutionary value” of substitution of different amino acids?
o Comparisons made with distant ancestor or unequal sequence length
• The problem was solved by
o Needleman and Wunsch in 1970 (the first dynamic programing algorithm)
• For pairwise protein sequence alignment
• Require a running time of LN
• L number of proteins to be compared and
• N- the number of residue amino acids
o Multiple sequence alignment (MSA) algorithm appear in 1980s
• Da-Fei Feng and Russell F. Doolitle, (1987)
• “Progressive sequence alignment”
o Needleman-Wunsch sequence alignment (NxN)
o Extracting pairwise similarity scores for each pairwise alignment
o Using those scores to build a guide tree
o Aligning the two most similar sequences and then the next more similar
sequence and so on
o The popular MSA software CLUSTAL was developed in 1988
What is bioinformatics
• Bioinformatics is application of computational tools on molecular
data, including the means to acquire, analyze, or visualize such data.
• The field of science in which biology, computer science and
information technology meet.
• The use of computer to store, retrieve, analyze or predict the
composition or structure of bio-molecules.
• Application of computational techniques and information technology
to the organisation and management of biological data.
Biologists
collect molecular data:
DNA & Protein sequences, Bioinformaticians
gene expression, etc. Study biological questions by
analyzing molecular data
Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
• The three sub-disciplines of bioinformatics
o The development of new algorithms and statistics with which to
assess relationships among members of large data sets
o The analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and protein
structures; and
o The development and implementation of tools that enable efficient
access and mnagement of different types of information

• Bioinformaticist Vs. Bioinformatician


o Bioinformaticist is an expert who not only knows how to use
bioinfomratics tools but also knows how to write interfaces for
effective use of the tools
o Bioinformatician is a trained individual who only knows to use
bioinformatics tools without a deeper understanding.
Amino acids - The protein building blocks

13
14
Any region of the DNA sequence can, in principle, code for six different amino acid
sequences, because any one of three different reading frames can be used to interpret each of
the two strands.

15
A human hemogluobin has a
cDNA sequence
• >gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1),
mRNA
• ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTC
CTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGG
CAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCT
GAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGT
GACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTC
TGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGC
CTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC
A cDNA sequence (reading frame)

>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA


ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCC
GCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCAC
CACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG
ACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCAC
AAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGC
CGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACC
GTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCC
GTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC

A protein sequence

>gi|4504347|ref|NP_000549.1| alpha 1 globin [Homo sapiens]


MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAH
VDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

17
And, a whole genome…
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGG
GGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCT
ACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGC
CGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC
AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCT
CCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCT
TGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCG
GCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCT
GGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGAC
CTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAAC
GCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGG
TCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGC
CTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTT
CTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGG
CGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGC
CTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAG
ACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCA
ACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCC
GGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCAC
GCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGC
TTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTG
GGCGGCGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGG
ACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGT
GCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCC
ATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTG
AGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG...
18
• Using bioinformatics
o Sequence assembly
o Genome annotation
o Molecular evolution
o Analysis of gene expression
o Analysis of gene regulation
o Protein structure prediction and docking
History of bioinformatics
• 1953- Watson & Crick proposed the dobule helix model
• 1954- Perutz’s group develop heavy atom method for improved
protein crytallography
• 1955- bovine insulin was sequenced (1st protein)
• 1969- the ARPANET is created at Stanford and UCLA
• 1970- Needleman-Wunsch algorithm published
• 1972- the first recombinant DNA is created by Paul Berg and his group
• Cohen, S., Chang, A., and Boyer, H. produced the first recombinant
DNA orgnism
• 1973- The Brookhaven Protein DataBank is announced
• Joseph Sambrook et al., refined DNA electrophoresis using agarose gel
• Herbert Boyer and Stanely Cohen invented DNA cloning.
• 1974- Vint Cerf and Robert Khan develop the concept of connecting
networks of computers into an “internet” and develop Transmission
Control Protocol (TCP)
• 1975- Microsoft Corporation is founded by Bill Gates and Paul Allen ;
• Two-dimentaional electrophoresis by P.H.O’Farrel
• 1977- method for sequenicng DNA; the first genetic engineering
company Genetech was founded
• 1988- the National Center for Biotechnology Information (NCBI) is
established at National Cancer Institute (NCI);
• the Human Genome Initiative is started;
• The FASTA algorithm for sequence comparison published by Pearson
and Lupman;
• 1989-the first complete genome map of Haemophilus influenza was
published.
• 1990- the BLAST program is implemented (Altschul et al.)
• Look and SegMod software for molecular modeling and
protein design by Molecular application group
• Software by InfoMax for sequence analysis, database and data
managemnt, searching, publication graphics, clone
construction, mapping and primer design
• 1991- CERN in Geneva announces protocol for World Wide
Web
• ESTs (expressed sequence tags) was created and used
• 1994- The PRINTS database of protein motifs is published by
Attwood and Beck
• 1995- the Haemophilus influenzea genome (1.8) sequenced
• Mycoplasma genitalium genome sequenced
• 1996- final version of Human genetic Map was published by
Genethon
• Saccharomyces cerevisiae (12.1Mb) is sequenced
• Prosite database is reported by Bairoch et al
• Affymetrix produced the first commercial DNA chips
• 1997- E. coli genome (4.7MB) published
• 1998- genome for Caenorhabitis elegans and Baker’s yeast
published
• 2000- genome for Pseudomonas aeruginosa (6.3Mb) was published;
Athaliana genome (100Mb) is sequenced; D. melanogaster genome
(180Mb) sequenced
• 2001- the human genome (3,000Mb) is published
Origin of bioinformatics and
biological databases:
The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.

Nearly a decade later, the first nucleic acid


sequence was reported, that of yeast
tRNAalanine with 77 bases.

24
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).

The Protein DataBank followed in 1972 with a


collection of ten X-ray crystallographic protein
structures. The SWISSPROT protein sequence
database began in 1987. 25
Complete Genomes
as of August 2011:

Eukaryotes 37
Prokaryotes 1708
Total 1745

26
Open reading frames

Functional sites
Annotation
Structure, function

27
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA

28
promoter TF binding site

Transcription
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA

Start Site
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................

.............. TGAAAAACGTA Ribosome binding Site

ORF = Open Reading Frame


CDS = Coding Sequence
29
Comparing ORFs

Identifying orthologs

Inferences on structure
and function
Comparative
genomics
Comparing functional sites

Inferences on regulatory
networks

30
Alignment preproinsulin
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****

Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*

Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****

Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:*******
31
32
Significance of
bioinformatics: overview
• Accelerate research in the area of biotechnology
o Automatic genome sequencing
o Gene identification
o Prediction of gene function
o Prediction of protein structure
o Phylogeny
o Drug designing and development
o Identification of organisms
o Vaccine designing
o Understanding the gene and genome complexity
o Understanding protein structure
o Functionality and folding
• Genomics
o Generates vast amount of data managing these data and generating information from such data

• Proteomics
o Data from protein-protein interaction, protein profiles, protein activity pattern and organelles
compositions, image analysis from 2D gels, peptide mass fingerprinting and peptide
fragmentation fingerprinting.
• Transcirptomics
o Microarray data, RNA sequcning generating a lot of data

• Cheminformatics
o Identify and structurally modify a nautral product, to design a cpd with desired properteis and
assess its theraputic effect theroetically.
• Drug discovery
o Used to predict, analyze and interpretation of clincal and preclincial data. Particularly in
computer-aided drug design, providing drug related databases and softwares
• Evolutionary study/phylogentics
o Using sequence alignment and various algorithms

• Crop improvement
• Veterinary science
o Understaning of the livestock species, and provide accurate predictions

• Forensic science
o Identification and relatedness of individuals,

• Biodefense
o Restore bisecurity for biological threats or infectious diseseases

• Waste cleanup
o Expolore microbial potential for biodegradation, improve the potential,

• Climate change studies


o Search for microbes that will reduce green house gas

• Bioenergy/biofuels
o Biofuel production pathway,
N K
B A
TA E
S

DA
S
A
B
A

N D D
A
T

A LS I C A L
SE O OG
A T OIOL
A B H B

AT C O F

D ARV I E W
SE E R V
O
• A biological database is a large, organized body of
persistent data, usually associated with computerized
software designed to update, query, and retrieve
components of the data stored within the system.
• A simple database might be a single file containing many
records, each of which includes the same set of
information.
• Example, a record associated with a nucleotide sequence
database typically contains information such as contact
name; the input sequence with description of the type of
molecule; the scientific name of the source organism from
which it was isolated; and , often, literature citations
associated with the sequence.
• To benefit from the data stored in a database,
two requirements must be met:
o Easy access to the information and
o Method for extracting only that information needed to answer a
specific biological question.
• Databases allows knowledge discovery, which
refers to the identification of connections between
pieces of information that were not known when
the information was first entered.
• Allows the indexing of data
• It helps to remove redundancy of data.
Classification of databases
• Public repositories of gene data
o GenBank; DDBJ; EMBL

• Public repositories of protein data


o Protein DataBank (PDB)

• Private databases
o Research groups databases
o Biotech companies databases
• Based on the information
o Primary
• Data bases containing raw information (original form)
• Also called archieval database
• ENA, GenBank , EMBL, DDBJ (nucleotide sequence),
• SWISS-PROT, PIR, PDB (protein data bank)
• Array Express Archive and GEO (functional genomics data)
• They are populated with experimentally derrived data such as nucleotide
sequence, protein sequence or macromoleuclar structure.
• Experimental results are submitted directly into the database by researchers,
and the data are essentially arhival in nature.
• Once given a database accession number, the data in primary databases are
never changed
o Secondary
• Comprise data derived from analyzed/ curetted primary data.
• More relevant and useful information to specific requirements
• PROSITE, PRINTS, BLOCKS,Pfam , InterPro(protein families, motifs and
domains), UniProt(sequence and functional information on proteins), Ensembl
(variation, function, regulation and more layered onto whole genome
sequences)
Sequence databases
• RNA and DNA store the hereditary information
about an organism which can be analyzed with the
help of bioinformatics tools and databases
• The most popular databases are
o GenBank from NCBI
o SwissProt from the Swiss Institute of Bioinformatics and
o PIR from Protein Information Resources
• The principal requirements on public data services
are:
o Data quality- the quality of data mainly the primary responsibility of the
submitter
o Supporting data- users may be interested in the primary experimental data
in the database or by cross-references back to network-accessible
laboratory databases
o Deep annotation- deep, consistent annotation comprising supporting and
ancillary information should be attached
o Timeliness- basic data should be internet- accessible server within
days/hours of publication or submission
o Integration- each data object in database should be cross-referenced to
representation of the same or related biological entities in other databases.
DNA database
• The main function of DNA databases is to store and
compare DNA sequence and protein sequence data
o Genebank
o EMBL
o DDBI
GeneBank
• Genetic Sequence Databank (GeneBank) is the fastest
growing repositories of known genetic sequences.
• A NIH genetic sequence database, an annotated collection
of all publicly available DNA sequences
• It has a flat file structure (ASCII text file, readable by both
humans and computers)
• In addition to sequence data it contains accession
numbers and gene names, phylogenetic classification and
references to published literature
• 216million sequences and 399 billion bases in traditional
GenBank
EMBL
• EMBL Nucleotide sequence database is a comprehensive
database of DNA and RNA sequences from scientific literature
and patent applications and directly submitted from
researchers and sequencing groups
• Data collection is done in collaboration with GenBank (USA)
and DNA Database of Japan (DDBJ)
• Current number of bases and sequence can found…
• www.ebi.ac.uk
DDBJ
• www.ddbj.nig.ac.jp/services-e.html
• Collection of nucleotide sequence data. In addition it has
protein sequences and protein structures.
• Maintenance and development is organized by the Center for
Information Biology and DNA Data Bank of Japan(CIB-
DDBJ) of the National Institute of Genetics. (
http://www.cib.nig.ac.jp/) (
http://ww.nig.ac.jp/english/index.html).
Other database
• Genethon Genome database (PHYSICAL MAP;GENETIC
MAP; GENEXPRESS (cDNA);
• 21 Bdb: LBL’s Human chromosome 21 database
• MGD: the mouse genome database
• ACeDB: a Caenorhabditis elegans database
• MEDLINE is NLM’s premier biblioraphic database covering
medical sciences; their citations atrre searchable using NLM’s
controlled voabulary (MeSH, Medical Subject Headings)
Protein databases
• Protein sequence databases are classified as primary,
secondary and composite
• Primary databses contain protein sequences as ‘raw’ data
o PIR and SwissProt

• Secondary databases contain information derived from protein


sequences; primary databases are combined and filtered to
form non-redundant composite database
o Prosite


• Protein sequences is increasing
• Used for
o Biological variation
o Evolutionary pattern
o Protein family characterization

• Protein databases
o List swissProt
o PIR
o PRF
o PDB
o Nrdb

• Translated open reading frames


o GenPept
o TrEMBL


• To overcome redundant sequences
o Non-redundant protein databases emerge
• KIND (Karolinska Institutet Nonredundant Database)
o Compiled from (GenPept, gpcu ), (Swissprot and Swissnew), PIR and
TrEMBL
o Sequence ID retation in priority Swissprot, PIR, GenPept and TrEMBL
o ftp://ftp.mbb.ki.se/pub/KIND
PIR-PSD
• Protein Information Resource (PIR) produces and distributes
the PIR-International Protein Sequence Database (PSD)
• Most comprehensive and expertly annotated protein sequence
database
• On-line access, off-line sequence identification service
SwissProt
• It provides a high level of integration with other databases and
has low level of redundancy(less identical sequences)
PROSITE
• PROSITE contains infomration or dictionary of sites and
patterns in proteins
• Prepared by Amos Bairoch
EC-ENZYME
• ‘ENZYME’ data bank contains information about enzymes
with EC number such as
o EC number, recommended name, alternative names, catlytic activity, cofactors, pointers
to the SwissProt entries that correspond to the enzyme, pinters to disease associated with
deficiency of the enzyme

PDB
• X-ray crystallography Protein Data Bank (PDB)
• www.rcsb.org
Other databases
• GDB has human genome Data Base storage and
dissmeination of data about genes and other DNA markers,
map location, genetic desease and locus infomration and
bibliographic information.
• OMIM – the Mendelian Inheritance in Man data bank (MIM)
• PIR-PSD-
• Molecual Modeling database (MMDB)
o http://www.ncbi.nlm.nih.gov/Structure/index.shtml
N G
N I
I
M
N D
A
C H
A R
SE
TA S
A
D OL
TO
• Knowledge discovery from database process is defined in five
major steps: Data preparation, data preprocessing, DM,
evaluation and interpretation, and implementation.
o “the nontrivial process of identifying valid, novel, potentially useful and ultimately
understandable patterns in data” – Fayyad et al., 1996
o Data preparation: determining the data to be used for analysis. Which then will be
collected (files, databases, data warehouses or data marts);
• If large take representative
• Selecting suitable data from the available data (data farming)
• Standard representation (table format, space delineated, comma delineated…)
o Data preprocessing: to increase data qulaity (incompleteness, inconsistency,
transformation…)
• Data cleaning- filling missing vlues (e.g. using most probable data value, attribute
mean, a global mean etc.), smoothing out noise (e.g. binning, clustering, regression
etc.), handling outliers (e.g. robust regression), detecting and removing redundant
data (e.g. correlation analysis). ,
• Data transformation- data is transformed into approprate forms. ANNs, and
various clusintering methods perfom better if the data is sclaed to a specified range
(normaized (min-max, z-score and decimal scaling)), smoothing, aggregation and
generalization
• Data reduction- the size of data is reduced. Generalization and and aggregation is
form of reduction. Dimension reduction/ feature selection to eliminate unnecessary
attributes using best-first search, beam search etc; regression techniques; ANOVA;
DT induction, correlation analysis, ANNs, SVMs, rule induction by rough set thoery,
partial least squares, subjective evaluation, genetic algorithm, PCA.
o Discretization:
• DM (data mining) is further explained in descriptive and
predictive mining categories with their functions and
methods.
o Historically DM has evolved from various disciplines. Such as databases (relational
databases, data warehousing, on-line analytical processing, OLAP), information
retrieval (IR) (similarity measures, clustering), statistics (Bayes theorem, regression,
maximum likelihood estimation, resampling) and artifical intelligence (AI) (ANNs,
machine learning, genetic algorithms, decision trees )
o The functions are used to specify the types of patterns to be mined including
description, clustering, association, classification, prediction, trend analysis,
similarity analysis and pattern detection.
o DM talked from the kinds of knowledge mined (DM functionalites) such as
association, classification, clusingering and others.
• Descriptive data mining
o This includes summarization, clustering, association rule generation and sequence discovery.
o Summarization is presentation of general properties of the data set studied. Use statistical
methods.
o Clustering
o Association function tries to identify groups of items that are happening together.
• Predictive Modeling
o Classification is DM function accomplished in two steps , i.e, constructing a model that
describes important classes in a given labled data set by using methods like DTs, ANNs,… and
then, categorizing a new data set (testing sample), whose classes are not known, depending on
the model built.
• S-based algorithms for modeling
• Decision Tree-based (DT-based) algorithms (ID3, C4.5, CHAID, CART, ID5R, SLIQ,
SPRINT)
• ANN based systems use optimization methods such as (levenberg-Marquart, quasi-
Newton and Conjugate-gradient =local optimization) simulated annealing, GA are global
ones. Perceptron is the simplest form of an ANN, and used for classification to two
groups. Multi-layer perceptron (MLP) combines perceptrons into a network structure.
• Learning alogrithm such as back propagation (BP)use gradient descent (GD)
optimization technique.
• Radial basis function (RBF) ; competitive ANN (CompetNN); SOM; learning
vectorquantization (LVQ); ARTMAP; Fuzzy ARTMAP; probabilistic NN
(PNN); Bayesian NN (BNN);
• Disadvantage of ANN is no if-then type of rules over DT and to overcome this
rectangular basis function network (RecBFN)
o Other classification systems : k-nearest-neighbors (KNN), GA, rough set theory
(RST), fuzzy set theory (FST), SVM, entropy network (EN) and association rules-
based algorithm. PRISM, attribute decomposition approach (ADA), modified
breath-first search of an interest graph (MIG), genetic programming (GP) and
breadth-oblivious-wrapper (BOW) hill-climbing search procedure.

o Combining techniques: combining results with the use of a weighted linear


combination of different classification techniques. It is called combination of
multiple classifiers (CMC) and Boosting and bagging (bootstrap aggregation-
BANN) are two examples. Thus decrease the expected error by reducing the overall
variance.
o Mixed techniques : FST is used in combination with RBF NN; RST and linear
programming (LP) are mixed up. DT and ANN are mixed; Taguchi method (TM)
isapplied to selelct parameters of an ANN; Fuzzy Decision support system (FDSS)
devloped based on GA learning; FST is used in combination with RST; SVM and
DT algorithms are integrated for the purpose of classification.
• Evaluation and Interpretation
o The above tools provide better understanding the relationship in the data. The next step is
implementation of the result for decision making process. Some faramework for more
structured and transparent decision making. Decision tables, decision maps, atlases and
library for decision making.
o Decision table is a collection of knowledge needed to make decisions in a particular area;
several decision tables are combined to form a decision maps, and then, in an atlas.
Implementation is the stage that distinguishes DM from any kind of data analysis. Typical
output of a DM analysis cannot be readily understood by its users.
• Mining advanced data types
o Algorithms listed in above used for mining structured data. (flat files, relational
databases, transactional databases, data warehouses or data marts). For other types of data
(object-oriented databases, object-relational databases and application-oriented
databases-spactial/temporal databases)
o Mining from such database involves sequence discovery, trend analysis, and similarity
analysis.
• Mining spatial data:
o Spatial data refers to objects occupy in space . Spatial data contains location information
(address, latitude/longitude, coordinates). Spatial data is more complicated compared to
non spatial data. Besides, spatial databases contains both spatial and non spatial data.
Spatial DM utilizes database technologies such as querying, reporting and OLAP
operations as well as DM techniques.
o Among spatial queries are region query, range query, nearest neighbor query, distance
scan query as well as spatial selection, aggregation and join. OLAP operations such as
drill down or roll up, can be used for analytical processing of data. Thus descriptive DM
on spatial data can effectively be applied by constructing spatial data cubes and by using
spatial OLAP techniques.
o STING is a kind of hierarchical clustering technique that partition the area studied into
rectangular grids.
o Common approach is to represent a spatial object by the minimum bounding rectangle
(MBR), which is the minimum rectangel embodying the object.
o Data structures used to store spatial data includes R-tree, quad tree and K-D tree.
o DM functions applied on spatial data includes: spatial association(distance information,
topological relations, spatial orientations), Spatial clustering (CLARANS extensions, SD
(CLARANS), DBCLASD, BANG, WaveCluster), Spatial classification, and spatial Trend
analysis)
• Mining Temporal Data
o Sequence data involves attributes related to ordered activities. Temporal (time-varying)
data is a kind of sequence data consisting of values of attributes that change with time.
The main difference between spatial data and time-varying data is that spatial attributes
are static while time is dynamic. Time can be represented in one dimension while space is
represented in at least two dimensions. Temporal databases can be categorized according
to the representation type of time: snapshot, transactional time, valid time or bitemporal
o Techniques used includes: finite state recognizer (FSR), Markov models (MMs), hidden
Markov models (HMMs), recurrent neural network (RNNs). Through trend analysis,
similarity analysis, pattern detection, periodicity analysis, and querying (intesection,
inclusion and containment).
o Trend analysis separate a time series data into: trends, cycles, seasonal and random.
(smothing techniques, least square method),
o Prediction forecast future values: autoregression, autoregressive moving average
(ARMA), autoregressive integrated moving average (ARIMA),
o Similarity analysis matches patterns it uses sismilarity measures such as Eucledian
distance. Mostly data independent transformation will be done.
o Such transformation includes discrete Fourier transformation (DFT), or discrete wavlelet
transformation (DWT),
o Pattern detection a given pattern that occurs in a sequence. It can be considered as a
classification function.
• String matching algorithms (Knuth-Morris-Pratt (KMP), Boyer-Moore (BM).
• Sequencial pattern mining frequently occuring patterns related to time or other
sequences. Methods include AprioriAll, sequencital discovery using equivalence
classes (SPADE), generalized sequencital patterns.
o Periodicity analysis: special kind of sequencital pattern analysis
o Temporal Association Rules: obtained after the data is clustered with time.
Nucleic acid sequence
analysis
• The entire genomic sequence of many organisms is now
available.
o Human (Homo sapians) near completion
o E. coli

• The expression level of mRNA or other RNA molecules can


be monitored
• Sequences of genes can be obtained from repositories or from
material labeled, processed, and examined in electric field.
• Then the sequence will be analyzed in many ways
• Assembly of short sequenced strands
• Sequence mapping (physical location in chromosomes,
intron/exon structure, physical/genetic distance from other
genes)
• Sequence alignment (match-mismatch) (homology analysis)
• Primar design
• Determination of gene function
• Methods of sequence alignment
o Two major classes namely Global and local alignment.
o Global alignment
o When closely related sequences of the same length are alignmed together; alignment
takes place from start to end by searching for best possible alignment.
o List of some global alignmetnt softwares
Software URL
GGSEARCH http://nebc.nerc.ac.uk/bioinformatics/docs/ggsearch.html
It is part of Fast3 software in Bio-Linux bioinformatics
workstation require Ubuntu Linux 14.04 LTS
HMMER http://hmmer.org/ for searching sequence databases for
sequence homologs and for making sequence alignments. It
uses profile hidden Markov Models (profile HMMs).
G-PAS http://gpualign.cs.put.poznan.pl/project-gpu-pairAlign.html
EMBOSS Needle https://www.ebi.ac.uk/Tools/psa/emboss_needle/

NW http://www.bioinf.org.uk/software/nw/
Stretcher https://galaxy.pasteur.fr/?form=stretcher
SABERTOOTH http://www.fkp.tu-darmstadt.de/sabertooth_source
Gene searching and
sequence retrieval
• Sequence retrieval system (SRS)
• ?BioPerl, BioPhyton, BioJava, BioRuby, BioSQL?
Sequence alignment
BLAST, MSA CLUSTALWx,
etc
• BLAST (Basic Local Alignment Search Tool)
o Homology and similarity tool
o Designed for windows platform
o For protein or DNA
o Q BLAST for user friendly retrieval of results
• Different blast tools
o Blastp compares an amino acid query sequence against a protein sequence database
o Blastn compares a nucleotide query sequence a gainst a nucleotide sequence database
o Balstx comapreds a cnucleotide query sequence translated in all reading frames against a
proeterin dequence database
o Tblastn comapres a protein query sequence against a nucleotide sequence database dynamically
translated in all reading fframes
o Tblastx comapres the six-frame translations of a nucleotide query sequcne against the six-frame
translations of a nucleotide sequcne database.
• Clustalw is fully automated sequence alignment tool for DNA
and protein sequences
• It returns the best match over a total length of input sequences
EMBOSS, Staden package
• EMBOSS (European Molecular Biology Open Software Suite)
o It is open source software with a range of librarites to extedn/ flaxibility for users
THREADER, PHD
• GenomeThreader from genomethreader.org
• THREADER3 threading algorithm for alignment and fold
recognition tool.
o http://bioinf.cs.ucl.ac.uk/psipred?genthreader=1
RasMol, WHATIF
• RasMol- to display the structure of DNA, proteins and smaller
molecules
o Derivatives such as Protein Explorer are easy to use
I S
Y S
A L
A N
O N
T I
I C
T R
E S
R
• Restriction analysis is to identify restriction mapping sites in
DNA sequences using appropriate enzyme sets and enzyme
filtering criteria as per specific experimental requirements.
• Restriction Analyzer
o It has options such as enzymes, citeia based selection o use poided lists.
o Type o DNA linea/ cicula

o http://molbiotools.com/restrictionanalyzer.html

• GenScript
o Online tool
o Restriction Enzyme Map Analysis tools
• Molecular Biology Tools
• Peptide Tools
• Protein Tools
• NEBcutter
o http://tools.neb.com/NEBcutter/index.php3
o to find restriction digestion map of a DNA sequence. Find the large, non-overlapping
open reading frames.
o GenBank number or plain/FASTA format DNA sequence can be submitted.
o NEB enzymes/ All commercially available enzymes
o Type of sequence linear or Circular
o Minimum ORF length to display xxxx amino acids.
o It is from NEW ENGLAND BioLabs inc.
• Restriction endonuclease digestion
o Webcutter 2.0 (U.S.A)
o WatCut (Michael Palmer, University of Waterloo, Canada) – provides restriction analysis
coupled with where the sites are located within genes.
o Restriction Site Analysis – (University of Massachusetts Medical School, U.S.A.) uses H.
Mangalam’s TACG2 program. Provides one with considerable choice of enzymes and
output format, including pseudo gel maps.
o Restriction Enzyme Picker – finds sets of 4 commercially available restriction
endonucleasses which together uniquely differentiate designated sequence groups from a
supplied FASTA format sequence file for use in T-RFLP
o NEBcutter- provides opportunities to upload local files, choose from common vector
sequences or enter GenBank accession numbers. Also includes ability to map sites in
genes. After you have the restriction map for this sequence you might want to consult the
New England Biolabs.
o Restriction Analyzer- carry out in silico restriction analysis online. Quickly find absent and
unique sites. Tabularand graphical output. Analyze restriction fragments. Simulate a gel
electrophoresis.
o Restriction Comparator –carry out parallel in silico restriction analysis online. Compare two
sequences side by side. Find distinguishing restriction sites. Visualize restriction patterns.
o WebDSV – is a basic molecular biology app to create, edit and analyze DNA sequences,
mark and visualize sequence features, and generate plasmid maps. With WebDSV you can
analyze restriction sites, perform in silico molecular cloning, and design PCR primers.
o In silico restriction digest of complete genomes – allows in silico digestion of over 300
prokaryotic genomes and simulated pulsed field gel electrophoretic separation of the
fragments.
o Computation of size of DNA and Protein Fragments from Their Electrophoretic Mobility?
o Sequence Extractor – generates a clickable restriction map and PCR primer map of a DNA
sequence (accepted formats are: raw, GenBank, EMBL, and FASTA) offering a great deal of
control on output. Protein translations and intron/exon boundaries are also shown. Use
sequence Extractor to build DNA constructs in silico.
o Promega Restriction Enzyme Tool –
• Worldwide.promega.com/resources/tools/retol/
• Sequence will be submitted; to select enzymes by setting overhang or blunt cut is
needed.
o SimVector (online)
• 1000 restriction enzymes.
• Compelte sequences or fragments sequences
• Linear or circular DNA which will be uploaded from file, pated or retrived from
NCBI link and are analyzed with respet ot parameters set by the user.
• www.premierbiosoft.com/plasmid_maps/featuressv/cloning.html
N G
N I
I G
E S
D
ER
IM
PR
• Oligonucleotide synthesis provide researchers the ability to
construct short fragments of DNA with sequences of their own
choices.
• Could be used in polymerase chain reactions (PCR) to amplify
existing DNA sequences or to modify sequences and it require
o Ready access to collected pool of sequence information and
o A way to extract from this pool only sequences of intersest.
• Primer3 program
• In-silico PCR, Reverse ePCR- used for amplification targets of
primers.
• Autoprimer; QuanPrime; PRIMEGENES(sequence specific
primer design tools.)
• Primer-BLAST design-target-specific primers. Global
alignment ;
• Primer3; Web Primer; GeneFisher; Primer3Plus; BiSearch;
MFEPrimer; Primer Desgn and Search Tool; PrimerDesgn-M;
RF coloning; primers4clades; TaxMan;
• Oligonucleotde physicochemical parameters
o NetPrime; dnaMATE; OligoCalc; Oligo analyzer 3.1; Mongo Oligo Mass Calculator
v2.06; OligoEvaluator; OligoCalculation tool.
Primer3
• It generates candidate primers
• It has options such as primer list or sequencing primers.
• Species will be selected; primer failure rate cutoff values;
primer size; primer Tm; product Tm; primer GC%;number of
primer option to return; max 3’ stability; maximum library
mispriming …..
• Primer3.ut.ee
Web primer
• Primer-BLAST
o Given a sequence (target sequence)
o Generate candidate pairs of primers.
o With target specificity of the primers.
• Real-time PCR primer design
o www.genscript.com/tools/real-time-pcr-taqman-primer-design-tool
o Use GenBank accession or the DNA sequence; the number of primer sets to be out
putted; the applicon size range; melting temperature minimum, optimum and maximum;
for probe minimum, optimum and maximum.

• PRIMO,
o To design primers for large scale DNA sequencing projects.
o Can be downloaded from Chang bioscience (
www.changbioscience.com/primo/primo.html)
• PDA(primer design assistant),
o Web based (dbb.nhri.org.tw/primer/index.html)
o
• Analysis tools for primers
o OligoCalc (biotools.nubic.northwestern.edu/OligoCalc.html) and Oligo (
http://www.operon.com/tools/oligo-analysis-tool.aspx/)
o To calculate molecular weight, GC content, melting temperature, intermolecular self-
hybridization, and intramolecular hairpin loop formation of oligomers or primers.
N D
A
C E I S
N Y S
E
U A L
E Q N
S EA
I N R
T E TU
R O C
P RU
ST
C E
E N
U
Q
SE
I N L
E VA
O T E
I
PR ETR
R
• From UniProt
• by entering the list of identifiers . the protein identifiers can be
specified and changed by setting the second option from and
change it to . this is obtained from the Retrieve/ID mapping.
• the other option is if we have segment of protein sequences
with more than two amino acids long can be run by putting
thes equence and pressing peptide sequence button. This is
obtained from the peptide search window.
• From the NCBI (EMBL/DDBJ)
o The proteins sequences are identified by unique accession numbers and name of the
protein
C E
E N
U
Q
SE T
I N N
T E E
O N M
R
P LI G
A
• Local alignment is mainly used for those sequences which
differ in sequence length. The method finds local matches
within the sequence stretch instead of looking at the entire
sequence
• The sequence similarities can be represented as
o Dot matrix method
o Dynamic programming
• Protein function analysis is made by comparing the protein
sequence to the secondary/derived protein databases that
contain information on motifs, signatures and protein domains
• Highly significant hits against these different pattern databases
allow approximate the biochemical function of query protein
• It includes evolutionarily analysis, identification of mutations,
hydropathy regions, CpG islands and compositional biases
Alignment preproinsulin
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****

Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*

Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****

Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:*******
103
FASTA
• FAST homology search All sequences.
• Alignment program for protein sequences created by Pearsin
and Lipman in 1988
• Heuristic algorithms to speed up sequence comparison
• BLAST finds regions of
similarity between biological
sequences.
• Compares nucleotide or
protein sequences to sequence
in databases.
• BLAST specialized searches
(SmartBLAST, Primer-BLAST,
Global Align, CD-search,
igBLAST, vecScreen, CDART,
Multiple Alignment , and
MOLE-BLAST).
• Global Align compare two
sequcnes across their entire
span using the Needleman-
Wunsch algorithm.
Global alignment Local
software alignment
software

GGSEARCH http://nebc.nerc.ac.uk/bioinformatics/docs/ggsearch.html BLAST https://blast.ncbi.nlm.nih.gov/Blast.cgi

HMMER http://hmmer.org/ HMMER http://hmmer.org/


G-PAS http://gpualign.cs.put.poznan.pl/project-gpu-pairAlign.htm PSI-BLAST https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMCD=Web&
l PAGE=Proteins&blastp&RUN_PSIBLAST=on

EMBOSS Needle https://www.ebi.ac.uk/Tools/psa/emboss_needle/ FASTA https://www.ebi.ac.uk/Tools/sss/fasta/

NW-align https://zhanglab.ccmb.med.umich.edu/NW-align/ EMBOSS https://www.ebi.ac.uk/Tools/psa/emboss_water/


water

MUMmer http://mummer.sourceforge.net/ Matcher https://ebi.ac.uk/Tools/psa/emboss_matcher/

MCALIGN2 http://www.homepages.ed.ac.uk/pkeightlmcalign/mcinstru SAM https://web.archive.org/weeb/20080509161215/http://w


ctions.html ww.cse.ucsc.edu/research/compbio/sam.html

NW http://www.bioinf.orguk/software/nw/ SWIMM https://github.com/enzorucci/SWIMM


Stretcher https://galaxy.pasteur.fr/?form=stretcher ALLALIGN http://www.allalign.com/

SABERTOOTH http://www.fkp.tu-darmstadt.de/sabertooth_source/ SWIPE http://dna.uio.no/swipe/


R E Y

T U IA
R

U C T
E
R
T

R D

ST N R Y A
N

I N IO D A
T E T N

O I C S E CS O

PR EDA R Y ,U R E
PRR I MR U C T
P T
S
• Comparison of protein with known structure databases.
• The function of a protein is more directly a consequence of its
structure rather than its sequence
• structural homology tending to share functions.
• Determination of protein’s 2D/3D structure is crucial
• The 3D structure for macromolecules is done by four
fundamental techniques as X-ray crystallography, nuclear
magnetic resonance (NMR) spectroscopy, cryo-electron
microscopy (Cryo-EM), and neutron diffraction.
• Although these techniques are viable and inestimable, they
cannot build an atomic structure model from scratch without
former knowledge of the proteins’ chemical and physical
properties and proteins’ primary sequence.
X-ray crystallography NMR Cryo-EM

Experiment 1. X-rays are scattered 1. Molecules absorb 1. Sample is


al steps by electrons in the radiofrequency vitrified at
atoms of crystal. radiation held in a liquid nitrogen
2. Then recorded on a strong magnetic temp.
detector, e.g., field. 2. High-energy
CCDS. 2. Resonance frequency electron beam
3. Phase estimation detection influenced passes through
and calaculation of by chemical it under high
electron density environment. vacuum.
map. 3. Collection of 3. Image is
4. Fit primary conformational produced when
sequence to electron interatomic distance transmitted
density map constraints. electrons are
(model). 4. Calcualtion of the 3D projected to a
5. Model refinement. structure. detector
6. Deposition in PDB. 5. Deposition in PDB. 4. Structure
determination.
Specimen Crystals Solution Vitrified solution
Protein size Wide range Below 40-50KDa >150 Kda
Contribution >89% PDB entries >9% 1>%
Resolution Higher resolution High resolution Significantly low
>3.5A
Advantages Well-developed Provide dynamic Easy sample
accurate, easy for information preparation samples
model building in its native
environment
Disadvantages Crystallization step High purity Cost;
; Slow process sample is requied Mainly for large
Less precise than molecules and
X- ray assemblies.
Intensive
computational
simuations.
• Structural databases
o Protein Data Bank wwPDB(www.wwpdb.org)
• RCSB PDB (http://www.rcsb.org) US partner
• PDBe (http://www.ebi.ac.uk/pdbe/) European partner
• PDBj (https://pdbj.org/) partner Japan
• BMRB (http://www.bmrb.wisc.edu/) magnetic resonance data bank
• NCBI structure resource (https://www.ncbi.nlm.nih.gov/structure/)
o PDBsum (https://www.ebi.ac.uk/pdbsum) atlas of proteins and web server
• EC-PDB (enzyme structure Database
• Drug port
• ProFunc server
• SAS
o Sc-PDB (3D database of ligndable binding sites) (http://bioinfopharmaa.u-strasbg.fr/scPDB/)
• Metal ions are not included and the lignads are classified as :
• Nucleotides of size <4 bases
• Peptides <9 amino acids
• Cofactors and
• Organic compounds
o PDBTM: protein Data Bank of Transmembrane proteins (http://pdbtm.enzim.hu)
o PDBTM: protein Data Bank of Transmembrane proteins (http://pdbtm.enzim.hu)
• Uses TMDET algorithm to find transmembrane proteins found in PDB
o CATH (Class, Architecture, Topology, Homology) database
• Classifies the protein domains according to the amino acid sequence and structural
and functional properties
• Has hierarchy as class (C), architecture (A), topology (T), and homologous
superfamily (H) = CATH
• C level: four groups 2 structure (beta, alpha-betam, alpha and few alpha and beta)
• A level: general orientation of secondary structures
• T level: connectivity of secondary structures
• H level: combination of sequence similarity and structural similarity.
• http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html
o SCOP ( structural classification of proteins ) database
• Focus on structure and evolutionary classifications of proteins (
http://scop.mrc-lmb.cam.ac.uk/scop/) has hierarchical scheme
• Family-> superfamily -> common fold-> class-> multi domain
• Updated SCOP2 (http://scop2.mrc-lmb.cam.ac.uk/)
o VAST (vector alignment search tool)
• http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml

o CE (combinatorial extension of optimal pathway):


• http://cl.sdsc.edu

o FSSP (fold classification based on structure- structure alignment of proteins)

• http://www2.ebi.ac.uk/dali/fssp/
PROSPECT
• PROSPECT (PROtein Structure Prediction and Evaluation
Computer ToolKit)
o Protein structure prediction system
o Use protein threading computation technique to construct protein’s 3D model
COPIA
• COPIA (Consensus Pattern Identification and Analysis)
• Protein structure analysis tool for discovering motifs
(conserved regions) in a family of protein sequences
• Used to identify membership to the family for new protein
sequences, predict secondary and tertiary structure and
function of proteins
• Study evolution history of the sequences
• Structural databases are storage platforms that are devoted to
the three-dimensional structural information of
macromolecules.
• Protein
o Primary
o Secondary
o Tertiary and
o Quaternary
• DNA
PDBsum
Database Use/Description Link
PDBj Protein Data Bank Japan archives https://pdbj.org/
macromolecular structures and provides
integrated tools
BMRD Biological Magnetic Resonance Data Bank http://
(NMR), a repository for data from NMR ww.bmrb.wisc.edu/
spectroscopy on proteins, peptides, nucleic
acids, and other biomolecules
PDBe Protein Data Bank in Europe (PDBe) archives http://
biological macromolecular structures www.ebi.ac.uk/pdbe/
RCSB Research Collaborator for Structural https://www.rcsb.org/
PDB Bioinformatics Protein Data Bank archives
information about the 3D shapes of proteins,
nucleic acids, and complex assemblies
RDBsum Pictorial analysis of macromolecular structures www.ebi.ac.uk/
pdbsum
Table: Primary structural data centers and other browsers
CATH Domain classification of structures http://www.cathdb.info/
SCOP SCOP2, structural and evolutionary http://scop2.mrc-
classification lmb.cam.ac.uk/
Table: Structure classification databases
NDB Nucleic acid database http://
ndbserver.rutgers.edu/
RNA 3D structure of RNA fragments http://
FRABAS rnafrabase.cs.put.poznan
E .pl/
NPIDB 3D structures of nucleic acid-protein http://
complexes npidb.belozersky.msu.ru
/
Table: Nucleic acid datablases
MemProtMD MemProtMD, database of membrane protein http://
sbcb.bioch.ox.ac.
uk/memprotmd/

PeptiSite Is a comprehensive and reliable database of http://


biologically and structurally characterized peptisite.ucsd.ed
peptide-binding sites that can be identified u/
experimentally from co-crystal structures in the
Protein Data Bank
ComSin Database of protein structures inbound (complex) http://
and unbound (single) states in relation to their antares.protres.r
intrinsic disorder u/comsin/
MetalPDB MetalPDB collects and allows easy access to the http://
knowledge on metal sites in biological metalweb.cerm.
macromolecules unifi.it/
Pocketome The Pocketome is an encyclopedia of http://
conformational ensembles of druggable binding www.pocketom
sites that can be identified experimentally from co- e.org/
crystal structres in the wwPDB
MIPS A databse of all the metal-containing proteins http://
available in the Protein Data Bank dicsoft2.physics.iisc.ernet.i
n/cgi-bin/mips/query.pl

DALI The Dali server is a service used for comparing http://


protein 3D structures ekhidna2.biocenter.helsin
ki.fi/dali/
VAST+ Vector Alignment Search Tool, web-based tool https://structure.ncbi.nlm.
for comparing 3D structure against all structres nih.gov/Structure/VAST/v
astsearch.html
in the Molecular Modelling Database (MMDB),
NCBI
CE A method for comparing and aligning protein http://source,rcsb.org/ceH
structures ome.jsp

PTM- Posttranslational modification database http://


SD www.dsimb.inserm.fr/
dsimb_tools/PTM-SD/
PED3 Protein Ensemble Database The database of http://pedb.vib.be/
conformational ensembles describing flexible
proteins
GFDB Glycan Fragment Database (GFDB), http://
identifying PDB structures with www.glycanstructre.org/
biologically relevant carbohydrate
moieties and classifying PDB glycan
structres based on their primary
sequence and glycosidic linkage
ChEBI Chemical Entities of Biological https://www.ebi.ac.uk/
Interest (ChEBI), a database focused chebi/
on “small” chemical compounds
ChEMBL ChEMBL is a database of bioactive https://www.ebi.ac.uk/
drug-like small moleucles chembl/
• Structure comparison servers
o Protein structures are more conserved compared to protein sequences
• Structure superposition refers to the spatial fitting of two sturtures that lready have
similar starting points usually C-alpha/ targets find best match b/n the structres
• Structure alignment doesnot required prior information find structures between two
3D sttructures or more based on the 3D information.
• Tools includes Combinatorial Extension (CE); PDBeFold; VAST+, DALI/ DALI
Lite,
• Structural alignment can be done using different ways.
o Pairwise structure alignment (www.rcsb.org/3d-view) your primary sequence with the
structure in PDB database.
o TM-align used to align/compare two sequence independent structures. Generates
optimized residue-to-residue alignment based on structural similarity, generrates TM-
score values (0,1]. Zhanglab.med.umich.edu/TM-align/
o Dali protein structure comparison server. It can performer four types of structure
comparisons:
o Dali protein structure comparison server. It can performer four types of structure
comparisons:
• Heuristic PDB search- compares one query structure against those in the protein
data bank.
• Exhaustive PDB25 search – compares one query structure against a representative
subset of the Protein Data Bank
• Pairwise structure comparison – compares one query structure against those
specified by the user.
• All against all structure comparison – returns a structural similarity dendrogram for
a set of structures specified by the user
• ekhidna.biocenter.helsinki.fi/dali/
o VAST: vector alignment search Tool : to identify similar protein 3D structures by
geometric criteria (homologs)
• Takes PDB ID or MMDB ID
• Generates a list of similar protein strucctures with aligned residues, sequence
identity and other information
o SCALI: Structural Core ALIgnment of proteins
• Aligns two PDB files using sequential or non-sequential alignment methods.
o LGA: Local-Global Alignment . A method for finding 3D similarities in two protein
structures.
• There are a number of options i.e.
• Protein PDB codes as A and B (1sip_A 1cpi_B)
• PDB codes in separate boxes or upload files
• Upload file containing two-structures data. Start with MOLECULE and end with
END for each molecules
• Proteinmodel.org/AS2TS/LGA/lga.html
R E O
L

T U P
Y
M

U C M
O
,

T R N H
I

S TI O L O ,
C

I N A SM
T E I Z A

O A L L
,
R

R
P ISU R A M
O

V H IT
W
• Before computer visualization software was developed,
molecular structures were presented by physical models of
metal wires, rods, and spheres. With the development of
computer hardware and software technology and computer
graphics programs were developed to visualizing and
manipulating three-dimensional structures. The computer
graphics help to analyze and compare protein structure to gain
the function of protein.
• Molecular visualization helps the scientists to bioengineer the
protein molecules. User-friendly graphics interface makes this
area of bioinformatics a full filled scientific thrill.
RasMol(stand alone)
• RasMol is a molecular graphics program intended for the visualization
of proteins, nucleic acids and small molecules. The program is aimed at
display, teaching and generation of publication quality images.
• The program reads in a molecule coordinate file and interactively
displays the molecule on the screen in a variety of colour schemes and
molecule representations. Currently available representations include
depth-cued wireframes, 'Dreiding' sticks, spacefilling (CPK) spheres,
ball and stick, solid and strand biomolecular ribbons, atom labels and
dot surfaces.
• Supported input file formats include Protein Data Bank (PDB), Tripos
Associates' Alchemy and Sybyl Mol2 formats, Molecular Design
Limited's (MDL) Mol file format, Minnesota Supercomputer Center's
(MSC) XYZ (XMol) format, CHARMm format, CIF format and
mmCIF format files.
CHIME
• CHIME is derived from Chemical MIME
o free program to show molecular structure in three dimensions.

• PyMOL
o Download and install in your computer.
o Pymol.org
proteopedia
• Proteopedia.org/wiki/fgij/
• By submitting PDB id example hemogglobin 1A3N
NCBI structure
• Or the icn3D
N G
L I
D E
O
M
G Y
L O
O
O M
H
• Homologous sequences are sequences that are related by
divergence from a common ancestor.
• Degree of similarity between two sequences can be measured
while their homology is a case of being either true or false.
• A homology modeling is useful when the model protein (with
a known sequence and an unknown structure) is related to at
least one other protein with both a known sequence and a
known structure.
• The quality of the predicted structure by homology modeling
depends on the degree of similarity between the model and
template sequences.
• The 3D structure of a protein through homology modeling is
obtained with the following steps:
o target sequence search, identifying the proper template using BLAST,
o Sequence alignment
o Alignment corrections to ensure the conserved or functionally important residues are
aligned
o Backbone generation
o Loop modeling
o Side chain modeling using rotomer libraries
o Optimizing the model using energy minimization and
o Validating the model by stereochemcial evaluation


• MODELLER
o Provide an alignment of a seqence to be modeled with known related structures
o Downloadable

• MaxMod
o It is a graphical user interface to MODELLER program
PyMod
• Pymod is an open source PyMOL plugin, designed to act as an
interface between PyMOL and several bioinformatics tools
(for example: BLAST+, HMMER, Clustal Omega, MUSCLE,
PSIPRED and MODELLER).
PRIMO
• PRotein Interactive MOdeling (PRIMO)
o Template identiication
o Target-template sequence alignment
o Modeling and model ealuation
SWISS-MODEL
• is a fully automated protein structure homology-modelling server,
accessible via the Expasy web server, or from the program DeepView
(Swiss Pdb-Viewer).
• The purpose of this server is to make protein modelling accessible to all
life science researchers worldwide.
• FoldX is an empirical force field that was developed for the rapid
evaluation of the effect of mutations on the stability, folding and dynamics
of proteins and nucleic acids. The core functionality of FoldX, namely the
calculation of the free energy of a macromolecule based on its high-
resolution 3D structure
F
O S
N I C
I O T
T A
C A M
I R
P L FO
P
A IO I N
B
E M
ST
SY
TO
O N
T I
C
U
O D Y
R
T LO G
N
I IO
B
Cheminformatics
• It includes
o Synthesis planning; reaction and structure retrieval; 3D structure retrieval; Modeling;
computational chemistry; visualization tools and Utilities ;
• It focuses on storing, indexing, searching , retriving and
applying information about chemical compounds. It involves
organization of chemical data in a logical form to facilitate the
retrieval of chemical properties, structures and their
relationships.
• It is possbilet
Bioinformatics projects
• BioJava
• BioPerl
• BioXML
• Biocorba
• Ensembl
• Bioperl-db
• Biopython and biojava
• Information from images includes
o Magnetic resonance imaging (MRI)
o Computed tomography (CT)
o Positron emission tomography (PET)
o Single-photon emission computed tomography (SPECT)
o Functional magnetic resonance imaging (fMRI)
o Electroencephalography (EEG)
o Magnetoencephalography (MEG)
o Cryosectioning (2D images)
• The image data Is dynamic
o Which changes day to day, moment to moment, milliseconds/minutes, hours or days
weeks or longer.
• Four elements of Databases
o Biological objects (sequence, proteins, cell, organism)
o Relationship among objects
o Classifiers to relate objects one another
o Metadata or data about the data
• Genome annotation
o The analysis and management of genome data to predict and archive various kinds of
biological features, particularly genes, biologic signals, sequence characteristics, and
gene products.
• Blocks database: consists of ungapped multiple alignments of
short regions called ‘blocks’ . The data base was constructed
of sequences of protein families using fully automated
method.
• The database takes the protein families from PROSITE. The
blocks representing a protein family will be generated using a
two step system called PROTOMAT system.
o 1. motif finder finds triplates of amino acids which are common to multiple sequences.
These will represent as blocks for the group.
o 2. assembles the best blocks that is consistently found in most sequences.
Bioinformatic resoruces
for proteomics data
• proteomics has applications such as
o Identification of peptides and proteins
o The study of post translational modification
o The quantification of protein levels
o The characterization of protein structure and
o Identification of protein interaction with other biomolecules
• ExPASy
o Contain listing of free proteomics databases and tools

• Bio.tools
o Listing of free proetomics tools

• omicX
o Selective, manually curated listings of tens to hundreds of proteomics and protein analysis tools
in various categories. Commercial
• Ms-utils.org
o Free MS data tools
• OBRC(online Bioinformatics Resoruces Collection)
o Part of University of Pittsburgh’s health sciences librarary of system. List databases and tools

• OReFiL (Online Resoruce Finder for Lifesciences)


o Users can search for peer-reviewed literautre related to online bioinformatics resources mined
against peer-reviwed papers form Medline and PubMed.
• European Bioinformatics institute and National Center for
Biotechnology Information
o Each provide a small listing of free EMBL-EBI/NCBI supported proteomics databases and tools
such as PRIDE toolsuite (MS-based proteomics data)
• GitHub
o Fantastic resource for packages and scripts for proteomic and other bioinfomratic analysis.
Search “proteo” generate tools scripted in R or python.
• Galaxy
o Proteomic Galaxy tools are listed at https://toolshed.g2.bx.psu.edu/ which was previously for
fools related to the analysis of next-generation sequencing data.
• About proteomic pipelines
o Studies usually require their own unique sequcne of analysis, necessitating researchers to
assemble the needed scripts, tools, and software into customized analysis pipleines.
o There are several online proteomic pipleines avaialble that integrate all the tools required
for certain applications into one suite of software. Some even integrate open-source tools
with commercial tools. Users can choose which of the tools on offer to incorporate into
their analysis. Examples include ProteoSuite, CPAS, CORRA, ProteoWizard, the Trans-
Proteomic Pipeline(TPP), PIPE, and the OpenMS Proteomics Pipeline (TOPP).

You might also like