PRESIDENCY UNIVERISTY,BENGALURU
School of Engineering
Department of Computer Science & Engineering
Introduction to Bioinformatics
CSE 3069
V Semester 2023-24
Module 2
Genomic Database and
Sequence similarity
Bioinformatics Resources and Tools
• A database is a structured collection of records stored in a
computer system.
• Genomic databases typically store DNA or protein sequences
as well as annotated information about those sequences.
• There are hundreds of genomics databases: some are
comprehensive, but are not carefully curated (GenBank),
while others are carefully curated, but are narrow (FlyBase).
3
Continued…
• Bioinformatic tools are computer programs that
analyze one or more sequences.
• There are a dizzying array of bioinformatic tools that can
analyze sequences to find protein domains (Pfam), or
that can search through databases of millions of
sequences to find ones that are similar (BLAST) or that
can find potential protein-coding regions (ORF-Finder).
Many are freely available over the web.
4
Continued…
Nucleotide Sequence Databases (the principal
ones)
• NCBI - National Center for Biotechnology Information
• EBI - European Bioinformatics Institute
• DDBJ - DNA Data Bank of Japan
Database Searching by Sequence Similarity
• BLAST @ NCBI
• PSI-BLAST @ NCBI
• FASTA @ EBI
• BLAT
5
Continued…
Protein Sequence Databases
• SWISS-PROT & TrEMBL - Protein sequence database
and computer annotated supplement
• UniProt - UniProt (Universal Protein Resource) is the
world's most comprehensive catalog of information
on proteins.
• PIR - Protein Information Resource
• MIPS - Munich Information centre for Protein
Sequences
• HUPO - HUman Proteome Organization
6
Continued…
Sequence Alignment
• USC Sequence Alignment Server - align 2
sequences with all possible varieties of dynamic
programming
• ClustalW @ EBI - multiple sequence alignment
• Spidey - an mRNA-to-genomic alignment program
• Wise2 - align a protein or profile HMM against
genomic sequence to predict a gene structure, and
related tools
• PipMaker - computes alignments of similar regions
in two (long) DNA sequences
• VISTA - align + detect conserved regions in long
genomic sequence.
7
List of Open Source Tools available for
Bioinformatics
• geWorkbench
• BioPerl
• UGENE Open Source Bioinformatics Tool Linux
• Biojava Bioinformatics Tool for Linux
• Biopython Test Genomic Software
• InterMine
• IGV Genomic Sequencing Tool
• GROMACS
• Taverna Workbench
• EMBOSS Bioinformatics Tool Linux
• Clustal Omega
• BLAST
• Bedtool
• Bioclipse Open Source Bioinformatics Tool
• Bioconductor
8
Types and classification of genome
databases
Genomics refers to the study of structure and function
of entire genome of a living organism. Genome refers
to the basic set of chromosomes.
Plant Genomics:
• It deals with the study of structure and function of
entire genome of plant species.
Animal Genomics:
• It deals with the study of structure and function of
entire genome of animal species.
9
Continued…
Eukaryotic Genomics:
• It deals with the study of structure and function of
entire genome of higher [multi-cellular]
organisms.
Prokaryotic Genomics:
• It deals with the study of structure and function of
entire genome of unicellular organisms.
10
Types of Databases
Primary databases
• Primary databases are also called as archival database.
• They are populated with experimentally derived data
such as nucleotide sequence, protein sequence or
macromolecular structure.
• Experimental results are submitted directly into the
database by researchers, and the data are essentially
archival in nature.
• Once given a database accession number, the data in
primary databases are never changed: they form part
of the scientific record.
11
Continued…
Examples
• ENA, GenBank and DDBJ (nucleotide sequence)
• Array Express Archive and GEO (functional genomics
data)
• Protein Data Bank (PDB; coordinates of three-
dimensional macromolecular structures)
12
Continued…
Secondary databases
• Secondary databases comprise data derived from the
results of analysing primary data.
• Secondary databases often draw upon information
from numerous sources, including other databases
(primary and secondary), controlled vocabularies and
the scientific literature.
• They are highly curated, often using a complex
combination of computational algorithms and manual
analysis and interpretation to derive new knowledge
from the public record of science.
13
Continued…
Examples
•InterPro (protein families, motifs and domains)
•UniProt Knowledgebase (sequence and functional
information on proteins)
•Ensembl (variation, function, regulation and more layered
onto whole genome sequences)
14
Continued…
Composite Databases
• The data entered in these types of databases are first
compared and then filtered based on desired criteria.
• The initial data are taken from the primary database, and
then they are merged together based on certain
conditions.
• It helps in searching sequences rapidly. Composite
Databases contain non-redundant data.
Examples –
• Composite Databases -OWL,NRD and Swissport +TREMBL
15
Genomic Databases
They contain data related to the genomic sequencing of
different organisms, and gene annotations.
• Human Genome Databases: include information on the
human gene sequencing.
• Model Organism Databases (MOD): stored data
coming from the sequencing projects of model organisms
(such as, e.g., MATDB); they are also intended to support
the Human Genome Project (HGP).
• Other Organism Databases: stored information derived
from sequencing projects not related to HGP.
16
Continued…
• Organelle Databases: stored genomic data of cellular
organelles, such as mitochondria, having their own
genome, distinct from the nuclear genome.
• Virus Databases: stored virus genomes.
17
Database-Searching
• The amount of biological relevant data is increasing so
rapidly, knowing how to access and search this information
is essential.
• There are three data retrieval systems of particular
relevance to molecular biologist:
– Sequence Retrieval System (SRS)
– Entrez
– DBGET.
• These systems allow text searching of multiple molecular
biology database and provide links to relevant information
for entries that match the search criteria.
• The three systems differ in the databases they search and
the links they have, to other information.
18
Sequence Retrieval System (SRS)
• SRS is a homogeneous interface to over 80 biological
databases that had been developed at the European
Bioinformatics Institute (EBI) at Hinxton, UK.
• It includes databases of sequences, metabolic
pathways, transcription factors, application results (like
BLAST, SSEARCH, FASTA), protein 3-D structures,
genomes, mappings, mutations, and locus specific
mutations.
• The web page listing all the databases contains a link
to a description page about the database including the
date on which it was last updated.
• The SRS is highly recommended for use.
19
Entrez
• Entrez is a molecular biology database and retrieval
system.
• Developed by the National Center for Biotechnology
information (NCBI).
• It is entry point for exploring distinct but integrated
databases.
• Of the three text-based database systems, Entrez is
the easiest to use, but also offers more limited
information to search.
20
DBGET
• DBGET is an integrated database retrieval system,
developed at the university of Tokyo.
• Provides access to 20 databases, one at a time.
• Having more limited options, the DBGET is less
recommended than the two others.
21
File Formats in Bioinformatics
FASTQ format was developed by Sanger institute in order
to group together sequence and its quality scores. In
FASTQ files each entry is associated with 4 lines.
• Line 1 begins with a ‘@‘ character and is a sequence
identifier and an optional description.
• Line 2 Sequence in standard one letter code.
• Line 3 begins with a ‘+‘ character and is optionally
followed by the same sequence identifier (and any
additional description) again.
• Line 4 encodes the quality values for the sequence in
Line 2, and must contain the same number of symbols
as letters in the sequence.
22
Example:
@K00188:208:HFLNGBBXX:3:1101:1428:1508
2:N:0:CTTGTA
ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATG
TAGTATCTNATNGACTGNCNCCANANGGCTAAAGT
+
AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJ
JJF#F#FJJ#F#JJJFJJJJJ
23
Continued…
• FASTA format is a simple way of representing
nucleotide or amino acid sequences of nucleic acids and
proteins.
• This is a very basic format with two minimum lines.
First line referred as comment line starts with ‘>’ and
gives basic information about sequence.
• After comment line, sequence of nucleic acid or protein
is included in standard one letter code.
• Any tabulators, spaces, asterisks etc in sequence will be
ignored.
24
Continued…
Example:
>XR_002086427.1 Candida albicans SC5314 uncharacterized
ncRNA (SCR1), ncRNA
TGGCTGTGATGGCTTTTAGCGGAAGCGCGCTGTTCGCGTACCTGC
TGTTTGTTGAAAATTTAAGAGCAAAGTGTCCGGCTCGATCCCTGC
25
Continued…
26
Continued…
Genbank:
• The Genbank format, which is commonly utilized by
public databases such as NCBI, is arguably the
industry standard in sequence file template.
• The Genbank file template is very adaptable, allowing
you to include annotations, comments, and references.
• Because the file is plain text, it can be viewed using a
text editor. The file extension '.gb' or '.genbank' is
commonly used for Genbank files.
• The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is
marked by two slashes ("//").
27
Continued…
EMBL:
• The EMBL format, which is similar to the Genbank
file in appearance, is used by public databases such
as the European Molecular Biology Laboratory.
PDB:
• The PDB file template is employed to keep both
sequence data and, more importantly, three-
dimensional structure data.
• This information can be utilized to conceptualize a
molecule's crystal structure (typically a protein).
• PDB files are merely text files that can be considered
with a text editor and usually have the extension
'.pdb'.
28
Continued…
SAM:
• The Sequence Alignment Map (SAM) format is used to
store sequence alignment information in text format. It is
divided into header (optional) and comparison part. The
header starts with @ and may have multiple lines.
29
Continued…
PIR
A sequence in PIR format consists of:
–One line starting with
1. a ">" (greater-than) sign, followed by
2. a two-letter code describing the sequence type (P1, F1,
DL, DC, RL, RC, or XX), followed by
3. a semicolon, followed by the sequence identification.
• One line containing a textual description of the sequence.
• One or more lines containing the sequence itself. The end
of the sequence is marked by a "*" (asterisk) character.
• A file in PIR format may comprise more than one
sequence.
30
Continued…
31
Continued…
Two letter code of the sequence:
• P1 - Protein (complete)
• F1 - Protein (fragment)
• D1 - DNA (e.g. EMBOSS seqret output)
• DL - DNA (linear)
• DC - DNA (circular)
• RL - RNA (linear)
• RC - RNA (circular)
• N3 - tRNA
• N1 - Other functional RNA
• XX - Unknown
32
Continued…
CLUSTALW:
• The first line in the file must start with the words
"CLUSTALW". Other information in the first line is
ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each block
consists of: – One line for each sequence in the
alignment.
33
Continued…
• Each line consists of:
– The sequence name
– white space
– up to 60 sequence symbols.
– Optional - white space followed by a cumulative count
of residues for the sequences
• Some rules about representing sequences:
– Case doesn't matter.
– Sequence symbols should be from a valid alphabet.
– Gaps are represented using hyphens ("-").
34
Continued…
GCG Format:
• A sequence file in GCG format contains exactly one
sequence, begins with annotation lines and the start
of the sequence is marked by a line ending with two
dot ("..") characters. This line also contains the
sequence identifier, the sequence length and a
checksum. This format should only be used if the file
was created with the GCG package.
35
Continued…
An example sequence in GCG format is:
ID AB000263 standard; RNA; PRI; 368 BP.
XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 368 BP;
AB000263 Length: 368 Check: 4514 .. 1 acaagatgcc attgtccccc ggcctcctgc
tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg
gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt
tcctcgcttg gtggtttgag tggacctccc 181 aggccagtgc cgggcccctc
ataggagagg aagctcggga ggtggccagg cggcaggaag 241
gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac
ttcttctgga 301 agaccttctc ctcctgcaaa taaaacctca
cccatgaatg ctcacgcaag tttaattaca 361 gacctgaa
36
Continued…
MSF:
• MSF formatted multiple sequence files are most often
created when using programs of the GCG suite.
• MSF files include the sequence name and the sequence
itself, which is usually aligned with other sequences in
the file.
• You can specify a single sequence or many sequences
within an MSF file.
– Begins with the line (all uppercase) !!
NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid
sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for
amino acid sequences.
– Do not edit or delete the file type if its present. 37
Continued…
38
Frequent Words and k-mers in Text
• A k-mer is just a sequence of k characters in a string
(or nucleotides in a DNA sequence).
• We say that Pattern is a most frequent k-mer in Text
if it maximizes Count ( Text, Pattern) among all k-
mers.
• For example, "ACTAT" is a most frequent 5-mer in
"ACAACTATGCATCACTATCGGGAACTATCCT“
• "ATA" is a most frequent 3-mer of
"CGATATATCCATAG".
39
Continued…
• Consider the DNA sequence “ACGAGGTACGA” which
consists of 11 nucleotides. Let’s try to obtain all the 4-
mers (substrings of length 4) in this DNA sequence.
40
Continued…
• The idea is simple. We create a window of length 4
and slide it from left to right, shifting one character
at a time. If the length of the given DNA sequence is
N, we would end up with (N – k)+1 k-mers.
• Total no. of k-mers = (N – k) + 1
• In the above example, the given DNA sequence is 11
characters long (N=11) and k = 4, thus we get eight
4-mers (11 - 4 )+1.
41
Continued…
• The Total Count is simply how many times each k-mer has
appeared in the given sequence.
• The distinct k-mers are counted only once regardless of how
many times they appear.
• The unique k-mers are those which appear only once. In the
above example, since ACGA has appeared twice, its unique
count is zero.
42
Continued…
• "ATA" is a most frequent 3-mer of
"CGATATATCCATAG".
• What is the most frequent 5-mer in
"ACAACTAAGCATCACTAACGGGAACTAACCT”?
Ans: ACTAA
43
Why k-mers?
• Decomposing a sequence into its k-mers for analysis
allows this set of fixed-size chunks to be analysed
rather than the sequence, and this can be more
efficient.
• K-mers are very useful in sequence matching and set
operations are faster, easier, and there are a lot of
readily-available algorithms and techniques to work
with them.
• A simple example: to check if a sequence S comes
from organism A or from organism B, assuming the
genomes of A and B are known and sufficiently
different, we can check if S contains more k-
mers present in A or in B.
44
Applications of k-mer Counting
Some of the applications where k-mer counting is
used:
• Genome assembly
• Sequence alignment
• Sequence clustering
• Error correction of sequencing reads
• Genome size estimation
• Repeat identification
45
Frequent k-mer with matches
Pattern does not need to actually(exactly) appear as
a substring of Text.
• Ex 1:
Pattern: ATGATCAAG
Substring: atcaATGATCAACgtataagcATGATCAAGgtgct
• Ex 2:
Pattern: CTTGATCAT
Substring: gaaagCATGATCATggctgCTTGATCATctgtt
46
Hamming Distance Problem
Example 1:
• Input:
GGGCCGTTGGT
GGACCGTTGAC
• Output: 3
Example 2:
• Input:
GGGCCGTTGGT
GGAGCGCTGAC
• Output: 5
47
Approximate Pattern Matching
Problem
• Find All starting positions where Pattern appears
as a substring of Text with at
most ’d’ mismatches.
• Input
Pattern: ATTCTGGA
Substring:
CGCCCGAATCCAGAACGCATTCCCATATTTCG
GGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
d: 3
• Output:
6 7 26 27
48
Approximate Pattern Count
• Compute the Hamming distance between Pattern
and every k-mer substring of Text with atmost ‘d’
mismatches.
• Countd(Text, Pattern).
Pattern: GAGG
Text: TTTAGAGCCTTCAGAGG
d(max mismatches): 2
• Output: 4
TTTAGAGCCTTCAGAGG
49
Substring Reconstruction
• Given a string Text, its k-mer composition
Compositionk(Text) is the collection of all k-mer
substrings of Text (including repeated k-mers).
• Composition3(TATGGGGTGC)
• = {TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC}
• Note that we have to list k-mers in lexicographic
order (i.e., how they would appear in a
dictionary) rather than in the order of their
appearance in the Text.
• = {ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG}
50
String Reconstruction Problem
• String: AAT ATG GTT TAA TGT
• The most natural way to solve the String
Reconstruction Problem is to mimic the solution
of the Newspaper Problem and "connect" a pair
of k-mers if they overlap in k-1 symbols.
• It is easy to see that the string should start with
TAA because there is no 3-mer ending in TA.
• This implies that the next 3-mer in the string
should start with AA.
51
Continued…
• There is only one 3-mer satisfying this
condition, AAT:
• TAA AAT
• In turn, AAT can only be extended by ATG,
which can only be extended by TGT, and so on,
leading us to reconstruct TAATGTT
• TAA
AAT
ATG
TGT
GTT
TAATGTT
52
Continued…
• Ex 2: AAT ATG ATG ATG CAT CCA GAT GCC
GGA GGG GTT TAA TGC TGG TGT
• Solution:
• If we start with TAA again
• ATG can be extended either by TGC, or TGG, or
TGT. Let’s select TGT
53
Continued…
• After TGT, our only choice is GTT
• But, we should realize that we are approaching
towards the dead-end, since no k-mers start
with TT. He we need to backtrack and find an
alternative solution.
54
Continued…
Alternative Solution Final Sequence
55
Substitution Matrix
• In bioinformatics, a substitution matrix describes
the frequency at which a character in a nucleotide
sequence or a protein sequence changes to other
character states over evolutionary time.
• The information is often in the form of log odds of
finding two specific character states aligned and
depends on the assumed number of evolutionary
changes or sequence dissimilarity between
compared sequences.
56
Continued…
• Substitution matrices are usually seen in the
context of amino acid or DNA sequence
alignments, where they are used to calculate
similarity scores between the aligned
sequences.
• A substitution matrix is a collection of scores for
aligning nucleotides or amino acids with one
another. These scores generally represent the
relative ease with which one nucleotide or
amino acid may mutate into or substitute for
another, and they are used to measure
similarity in sequence alignments.
57
Example
In the process of evolution, from one generation to the next the
amino acid sequences of an organism's proteins are gradually
altered through the action of DNA mutations. For example, the
sequence
ALEIRYLRD
could mutate into the sequence
ALEINYLRD
in one step, and possibly
AQEINYQRD
over a longer period of evolutionary time.
58
Continued…
• Each amino acid is more or less likely to mutate
into various other amino acids. For instance, a
hydrophilic residue such as arginine is more
likely to be replaced by another hydrophilic
residue such as glutamine, than it is to be
mutated into a hydrophobic residue such as
leucine.
• This is primarily due to redundancy in
the genetic code, which translates similar
codons into similar amino acids.
59
Gap Penalties
• A Gap penalty is a method of scoring alignments of
two or more sequences.
• When aligning sequences, introducing gaps in the
sequences can allow an alignment algorithm to
match more terms than a gap-less alignment can.
• However, minimizing gaps in an alignment is
important to create a useful alignment. Too many
gaps can cause an alignment to become
meaningless.
• Gap penalties are used to adjust alignment scores
based on the number and length of gaps.
60
Sequence Alignment
• Sequence alignment is the procedure of
comparing two (pair-wise alignment) or more
(multiple sequence alignment) sequences by
searching for a series of individual characters or
character patterns that are in the same order in
the sequences.
• Two sequences are aligned by writing them
across a page in two rows. Identical or similar
characters are placed in the same column, and
non-identical characters can either be placed in
the same column as a mis-match or opposite a
gap in the other sequence.
61
• In an optimal alignment, nonidentical characters
and gaps are placed to bring as many identical
or similar characters as possible into vertical
register.
• Sequences that can be readily aligned in this
manner are said to be similar.
• There are two types of sequence alignment,
global and local.
62
• In global alignment, an attempt is made to
align the entire sequence, using as many
characters as possible, up to both ends of each
sequence. Sequences that are quite similar and
approximately the same length are suitable
candidates for global alignment.
• In local alignment, stretches of sequence with
the highest density of matches are aligned, thus
generating one or more islands of matches or
subalignments in the aligned sequences. Local
alignments are more suitable for aligning
sequences that are similar along some of their
lengths but dissimilar in others
63
64
Sequence alignment Ex.
• Problem: Align abcdef with somehow similar
abdgf
• Solution: Write second sequence below the first
one abcdef
abdgf
• Move the sequences to give maximum match
between them.
• Show characters that match using vertical bar.
65
66
abcdef
||
abdgf
• In order to maximise the alignment, we insert
gap between b and d in lower sequence to
allow d and f to align.
abcdef
|| | |
ab-dgf
• Note e and g don’t match
67
• We are looking for an alignment, which:
– Maximizes the number of base-to-base
matches
– If necessary to achieve this goal, inserts gaps
in either sequence (a gap means a base-to-
nothing match)
– The order of bases in each sequence must
remain preserved and
– Gap-to-Gap matches are not allowed.
68
• We need some scheme to evaluate the
goodness of alignment.
• The scoring scheme consists of character
substitution scores (i.e. score for each possible
character replacement) plus penalties for gaps.
• The alignment score is the sum of substitution
scores and gap penalties. The alignment score
reflects goodness of alignment.
69
• For DNA we can construct the following
substitution matrix: ‘+1’ as a reward for match,
and ‘-1‘ as the penalty for mismatch, and ignore
gaps:
C T A G
C +1 -1 -1 -1
T -1 +1 -1 -1
A -1 -1 +1 -1
G -1 -1 -1 +1
70
• Using this scoring scheme, let us evaluate the
following alignments (penalty for a gap is = 0):
A T G G C G is the query sequence
Ex. 1.A T G – A G The score: +1+1+1+0-1+1 =
3
Ex. 2.A – T G A G The score: +1+0-1+1-1+1 = 1
71
Find the score for following alignments
• ATCCGATTCGA
• GATCGTATCCA
• CCTCTCACTGAGACTCAGCC
• CGA-ACTCTT-GATC-TGCA
• CGTAG-ATAG-TGCTAG-AGAAT-GGG-CCACT
• GTAGCTGATC-ATCGATCGTACGTAGC-GCTGA
72
PAM
PAM
• One of the first amino acid substitution matrices,
the PAM (Point Accepted Mutation) matrix was
developed by Margaret Dayhoff in the 1970s.
• This matrix is calculated by observing the
differences in closely related proteins.
• Because the use of very closely related
homologs, the observed mutations are not
expected to significantly change the common
functions of the proteins.
• Thus the observed substitutions (by point
mutations) are considered to be accepted by
natural selection.
73
BLOSUM
BLOSUM
• Dayhoff's methodology of comparing closely
related species turned out not to work very well
for aligning evolutionarily divergent sequences.
• Sequence changes over long evolutionary time
scales are not well approximated by
compounding small changes that occur over
short time scales.
• The BLOSUM (BLOck SUbstitution Matrix) series
of matrices rectifies this problem.
• Henikoff & Henikoff constructed these matrices
using multiple alignments of evolutionarily
divergent proteins.
74
75
76
• Seq 1: MARSIFLT
• Seq 2: MADQLTEE
77
Types of Gap penalties
1. Constant gap penalty
This is the simplest type of gap penalty: a fixed
negative score is given to every gap, regardless of its
length. This encourages the algorithm to make fewer,
larger, gaps leaving larger contiguous sections.
Aligning two short DNA sequences, with '-' depicting a
gap of one base pair. If each match was worth 1 point
and the whole gap -1, the total score: 7 − 1 = 6.
78
2. Linear gap penalty
Compared to the constant gap penalty, the linear gap
penalty takes into account the length (L) of each
insertion/deletion in the gap. Therefore, if the penalty
for each inserted/deleted element is B and the length
of the gap L; the total gap penalty would be the
product of the two BL. This method favors shorter
gaps, with total score decreasing with each additional
gap.
Unlike constant gap penalty, the size of the gap is
considered. With a match with score 1 and each
gap -1, the score here is (7 − 3 = 4).
79
3. Affine gap penalty
In biological sequences, it is more likely that a one
big gap of length 10 occurs in a sequence, than 10
small gaps of length 1.
Therefore, affine gap penalties favour longer gaps
over single gaps of the same total length
They use a gap opening penalty, o < 0, and a gap
extension penalty, e < 0, such that |e| < |o|, to
encourage gap extension rather than gap
introduction.
A gap of length L is then given a penalty
g = o + (L−1)e.
80
81