Module 2 (Bioinformatics)
Module 2 (Bioinformatics)
School of Engineering
Introduction to Bioinformatics
CSE 3069
V Semester 2023-24
Module 2
3
Continued…
4
Continued…
5
Continued…
6
Continued…
Sequence Alignment
8
Types and classification of genome
databases
Genomics refers to the study of structure and function
of entire genome of a living organism. Genome refers
to the basic set of chromosomes.
Plant Genomics:
• It deals with the study of structure and function of
entire genome of plant species.
Animal Genomics:
• It deals with the study of structure and function of
entire genome of animal species.
9
Continued…
Eukaryotic Genomics:
• It deals with the study of structure and function of
entire genome of higher [multi-cellular]
organisms.
Prokaryotic Genomics:
• It deals with the study of structure and function of
entire genome of unicellular organisms.
10
Types of Databases
Primary databases
• Primary databases are also called as archival database.
11
Continued…
Examples
• ENA, GenBank and DDBJ (nucleotide sequence)
12
Continued…
Secondary databases
13
Continued…
Examples
14
Continued…
Composite Databases
• The initial data are taken from the primary database, and
then they are merged together based on certain
conditions.
Examples –
• Composite Databases -OWL,NRD and Swissport +TREMBL
15
Genomic Databases
16
Continued…
17
Database-Searching
• The amount of biological relevant data is increasing so
rapidly, knowing how to access and search this information
is essential.
19
Entrez
20
DBGET
21
File Formats in Bioinformatics
22
Example:
@K00188:208:HFLNGBBXX:3:1101:1428:1508
2:N:0:CTTGTA
ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATG
TAGTATCTNATNGACTGNCNCCANANGGCTAAAGT
+
AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJ
JJF#F#FJJ#F#JJJFJJJJJ
23
Continued…
24
Continued…
Example:
TGGCTGTGATGGCTTTTAGCGGAAGCGCGCTGTTCGCGTACCTGC
TGTTTGTTGAAAATTTAAGAGCAAAGTGTCCGGCTCGATCCCTGC
25
Continued…
26
Continued…
Genbank:
• The Genbank format, which is commonly utilized by
public databases such as NCBI, is arguably the
industry standard in sequence file template.
PDB:
• The PDB file template is employed to keep both
sequence data and, more importantly, three-
dimensional structure data.
• This information can be utilized to conceptualize a
molecule's crystal structure (typically a protein).
• PDB files are merely text files that can be considered
with a text editor and usually have the extension
'.pdb'.
28
Continued…
SAM:
• The Sequence Alignment Map (SAM) format is used to
store sequence alignment information in text format. It is
divided into header (optional) and comparison part. The
header starts with @ and may have multiple lines.
29
Continued…
PIR
A sequence in PIR format consists of:
–One line starting with
1. a ">" (greater-than) sign, followed by
2. a two-letter code describing the sequence type (P1, F1,
DL, DC, RL, RC, or XX), followed by
3. a semicolon, followed by the sequence identification.
31
Continued…
• P1 - Protein (complete)
• F1 - Protein (fragment)
• D1 - DNA (e.g. EMBOSS seqret output)
• DL - DNA (linear)
• DC - DNA (circular)
• RL - RNA (linear)
• RC - RNA (circular)
• N3 - tRNA
• N1 - Other functional RNA
• XX - Unknown
32
Continued…
CLUSTALW:
• The first line in the file must start with the words
"CLUSTALW". Other information in the first line is
ignored.
33
Continued…
34
Continued…
GCG Format:
35
Continued…
36
Continued…
MSF:
• MSF formatted multiple sequence files are most often
created when using programs of the GCG suite.
38
Frequent Words and k-mers in Text
39
Continued…
• Consider the DNA sequence “ACGAGGTACGA” which
consists of 11 nucleotides. Let’s try to obtain all the 4-
mers (substrings of length 4) in this DNA sequence.
40
Continued…
41
Continued…
• The Total Count is simply how many times each k-mer has
appeared in the given sequence.
• The distinct k-mers are counted only once regardless of how
many times they appear.
• The unique k-mers are those which appear only once. In the
above example, since ACGA has appeared twice, its unique
count is zero.
42
Continued…
"CGATATATCCATAG".
"ACAACTAAGCATCACTAACGGGAACTAACCT”?
Ans: ACTAA
43
Why k-mers?
• Decomposing a sequence into its k-mers for analysis
allows this set of fixed-size chunks to be analysed
rather than the sequence, and this can be more
efficient.
44
Applications of k-mer Counting
• Genome assembly
• Sequence alignment
• Sequence clustering
• Error correction of sequencing reads
• Genome size estimation
• Repeat identification
45
Frequent k-mer with matches
• Ex 1:
Pattern: ATGATCAAG
Substring: atcaATGATCAACgtataagcATGATCAAGgtgct
• Ex 2:
Pattern: CTTGATCAT
Substring: gaaagCATGATCATggctgCTTGATCATctgtt
46
Hamming Distance Problem
Example 1:
• Input:
GGGCCGTTGGT
GGACCGTTGAC
• Output: 3
Example 2:
• Input:
GGGCCGTTGGT
GGAGCGCTGAC
• Output: 5
47
Approximate Pattern Matching
Problem
• Find All starting positions where Pattern appears
as a substring of Text with at
most ’d’ mismatches.
• Input
Pattern: ATTCTGGA
Substring:
CGCCCGAATCCAGAACGCATTCCCATATTTCG
GGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
d: 3
• Output:
6 7 26 27
48
Approximate Pattern Count
• Countd(Text, Pattern).
Pattern: GAGG
Text: TTTAGAGCCTTCAGAGG
d(max mismatches): 2
• Output: 4
TTTAGAGCCTTCAGAGG
49
Substring Reconstruction
50
String Reconstruction Problem
51
Continued…
52
Continued…
• Solution:
• If we start with TAA again
53
Continued…
54
Continued…
55
Substitution Matrix
56
Continued…
ALEIRYLRD
ALEINYLRD
AQEINYQRD
59
Gap Penalties
61
• In an optimal alignment, nonidentical characters
and gaps are placed to bring as many identical
or similar characters as possible into vertical
register.
62
• In global alignment, an attempt is made to
align the entire sequence, using as many
characters as possible, up to both ends of each
sequence. Sequences that are quite similar and
approximately the same length are suitable
candidates for global alignment.
• In local alignment, stretches of sequence with
the highest density of matches are aligned, thus
generating one or more islands of matches or
subalignments in the aligned sequences. Local
alignments are more suitable for aligning
sequences that are similar along some of their
lengths but dissimilar in others
63
64
Sequence alignment Ex.
65
66
abcdef
||
abdgf
• In order to maximise the alignment, we insert
gap between b and d in lower sequence to
allow d and f to align.
abcdef
|| | |
ab-dgf
• Note e and g don’t match
67
• We are looking for an alignment, which:
– Maximizes the number of base-to-base
matches
68
• We need some scheme to evaluate the
goodness of alignment.
69
• For DNA we can construct the following
substitution matrix: ‘+1’ as a reward for match,
and ‘-1‘ as the penalty for mismatch, and ignore
gaps:
C T A G
C +1 -1 -1 -1
T -1 +1 -1 -1
A -1 -1 +1 -1
G -1 -1 -1 +1
70
• Using this scoring scheme, let us evaluate the
following alignments (penalty for a gap is = 0):
71
Find the score for following alignments
• ATCCGATTCGA
• GATCGTATCCA
• CCTCTCACTGAGACTCAGCC
• CGA-ACTCTT-GATC-TGCA
• CGTAG-ATAG-TGCTAG-AGAAT-GGG-CCACT
• GTAGCTGATC-ATCGATCGTACGTAGC-GCTGA
72
PAM
PAM
• One of the first amino acid substitution matrices,
the PAM (Point Accepted Mutation) matrix was
developed by Margaret Dayhoff in the 1970s.
• This matrix is calculated by observing the
differences in closely related proteins.
• Because the use of very closely related
homologs, the observed mutations are not
expected to significantly change the common
functions of the proteins.
• Thus the observed substitutions (by point
mutations) are considered to be accepted by
natural selection.
73
BLOSUM
BLOSUM
• Dayhoff's methodology of comparing closely
related species turned out not to work very well
for aligning evolutionarily divergent sequences.
• Sequence changes over long evolutionary time
scales are not well approximated by
compounding small changes that occur over
short time scales.
• The BLOSUM (BLOck SUbstitution Matrix) series
of matrices rectifies this problem.
• Henikoff & Henikoff constructed these matrices
using multiple alignments of evolutionarily
divergent proteins.
74
75
76
• Seq 1: MARSIFLT
• Seq 2: MADQLTEE
77
Types of Gap penalties
1. Constant gap penalty
This is the simplest type of gap penalty: a fixed
negative score is given to every gap, regardless of its
length. This encourages the algorithm to make fewer,
larger, gaps leaving larger contiguous sections.
78
2. Linear gap penalty
Compared to the constant gap penalty, the linear gap
penalty takes into account the length (L) of each
insertion/deletion in the gap. Therefore, if the penalty
for each inserted/deleted element is B and the length
of the gap L; the total gap penalty would be the
product of the two BL. This method favors shorter
gaps, with total score decreasing with each additional
gap.
79
3. Affine gap penalty
In biological sequences, it is more likely that a one
big gap of length 10 occurs in a sequence, than 10
small gaps of length 1.
Therefore, affine gap penalties favour longer gaps
over single gaps of the same total length