0% found this document useful (0 votes)
31 views81 pages

Module 2 (Bioinformatics)

Uploaded by

Asmi Tanzaen H N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views81 pages

Module 2 (Bioinformatics)

Uploaded by

Asmi Tanzaen H N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

PRESIDENCY UNIVERISTY,BENGALURU

School of Engineering

Department of Computer Science & Engineering

Introduction to Bioinformatics
CSE 3069

V Semester 2023-24
Module 2

Genomic Database and


Sequence similarity
Bioinformatics Resources and Tools

• A database is a structured collection of records stored in a


computer system.

• Genomic databases typically store DNA or protein sequences


as well as annotated information about those sequences.

• There are hundreds of genomics databases: some are


comprehensive, but are not carefully curated (GenBank),
while others are carefully curated, but are narrow (FlyBase).

3
Continued…

• Bioinformatic tools are computer programs that


analyze one or more sequences.

• There are a dizzying array of bioinformatic tools that can


analyze sequences to find protein domains (Pfam), or
that can search through databases of millions of
sequences to find ones that are similar (BLAST) or that
can find potential protein-coding regions (ORF-Finder).
Many are freely available over the web.

4
Continued…

Nucleotide Sequence Databases (the principal


ones)
• NCBI - National Center for Biotechnology Information
• EBI - European Bioinformatics Institute
• DDBJ - DNA Data Bank of Japan

Database Searching by Sequence Similarity


• BLAST @ NCBI
• PSI-BLAST @ NCBI
• FASTA @ EBI
• BLAT

5
Continued…

Protein Sequence Databases

• SWISS-PROT & TrEMBL - Protein sequence database


and computer annotated supplement
• UniProt - UniProt (Universal Protein Resource) is the
world's most comprehensive catalog of information
on proteins.
• PIR - Protein Information Resource
• MIPS - Munich Information centre for Protein
Sequences
• HUPO - HUman Proteome Organization

6
Continued…
Sequence Alignment

• USC Sequence Alignment Server - align 2


sequences with all possible varieties of dynamic
programming
• ClustalW @ EBI - multiple sequence alignment
• Spidey - an mRNA-to-genomic alignment program
• Wise2 - align a protein or profile HMM against
genomic sequence to predict a gene structure, and
related tools
• PipMaker - computes alignments of similar regions
in two (long) DNA sequences
• VISTA - align + detect conserved regions in long
genomic sequence.
7
List of Open Source Tools available for
Bioinformatics
• geWorkbench
• BioPerl
• UGENE Open Source Bioinformatics Tool Linux
• Biojava Bioinformatics Tool for Linux
• Biopython Test Genomic Software
• InterMine
• IGV Genomic Sequencing Tool
• GROMACS
• Taverna Workbench
• EMBOSS Bioinformatics Tool Linux
• Clustal Omega
• BLAST
• Bedtool
• Bioclipse Open Source Bioinformatics Tool
• Bioconductor

8
Types and classification of genome
databases
Genomics refers to the study of structure and function
of entire genome of a living organism. Genome refers
to the basic set of chromosomes.

Plant Genomics:
• It deals with the study of structure and function of
entire genome of plant species.

Animal Genomics:
• It deals with the study of structure and function of
entire genome of animal species.

9
Continued…
Eukaryotic Genomics:
• It deals with the study of structure and function of
entire genome of higher [multi-cellular]
organisms.

Prokaryotic Genomics:
• It deals with the study of structure and function of
entire genome of unicellular organisms.

10
Types of Databases

Primary databases
• Primary databases are also called as archival database.

• They are populated with experimentally derived data


such as nucleotide sequence, protein sequence or
macromolecular structure.

• Experimental results are submitted directly into the


database by researchers, and the data are essentially
archival in nature.

• Once given a database accession number, the data in


primary databases are never changed: they form part
of the scientific record.

11
Continued…

Examples
• ENA, GenBank and DDBJ (nucleotide sequence)

• Array Express Archive and GEO (functional genomics


data)

• Protein Data Bank (PDB; coordinates of three-


dimensional macromolecular structures)

12
Continued…

Secondary databases

• Secondary databases comprise data derived from the


results of analysing primary data.

• Secondary databases often draw upon information


from numerous sources, including other databases
(primary and secondary), controlled vocabularies and
the scientific literature.

• They are highly curated, often using a complex


combination of computational algorithms and manual
analysis and interpretation to derive new knowledge
from the public record of science.

13
Continued…

Examples

•InterPro (protein families, motifs and domains)

•UniProt Knowledgebase (sequence and functional


information on proteins)

•Ensembl (variation, function, regulation and more layered


onto whole genome sequences)

14
Continued…
Composite Databases

• The data entered in these types of databases are first


compared and then filtered based on desired criteria.

• The initial data are taken from the primary database, and
then they are merged together based on certain
conditions.

• It helps in searching sequences rapidly. Composite


Databases contain non-redundant data.

Examples –
• Composite Databases -OWL,NRD and Swissport +TREMBL

15
Genomic Databases

They contain data related to the genomic sequencing of


different organisms, and gene annotations.

• Human Genome Databases: include information on the


human gene sequencing.

• Model Organism Databases (MOD): stored data


coming from the sequencing projects of model organisms
(such as, e.g., MATDB); they are also intended to support
the Human Genome Project (HGP).

• Other Organism Databases: stored information derived


from sequencing projects not related to HGP.

16
Continued…

• Organelle Databases: stored genomic data of cellular


organelles, such as mitochondria, having their own
genome, distinct from the nuclear genome.

• Virus Databases: stored virus genomes.

17
Database-Searching
• The amount of biological relevant data is increasing so
rapidly, knowing how to access and search this information
is essential.

• There are three data retrieval systems of particular


relevance to molecular biologist:
– Sequence Retrieval System (SRS)
– Entrez
– DBGET.

• These systems allow text searching of multiple molecular


biology database and provide links to relevant information
for entries that match the search criteria.

• The three systems differ in the databases they search and


the links they have, to other information.
18
Sequence Retrieval System (SRS)
• SRS is a homogeneous interface to over 80 biological
databases that had been developed at the European
Bioinformatics Institute (EBI) at Hinxton, UK.

• It includes databases of sequences, metabolic


pathways, transcription factors, application results (like
BLAST, SSEARCH, FASTA), protein 3-D structures,
genomes, mappings, mutations, and locus specific
mutations.

• The web page listing all the databases contains a link


to a description page about the database including the
date on which it was last updated.

• The SRS is highly recommended for use.

19
Entrez

• Entrez is a molecular biology database and retrieval


system.

• Developed by the National Center for Biotechnology


information (NCBI).

• It is entry point for exploring distinct but integrated


databases.

• Of the three text-based database systems, Entrez is


the easiest to use, but also offers more limited
information to search.

20
DBGET

• DBGET is an integrated database retrieval system,


developed at the university of Tokyo.

• Provides access to 20 databases, one at a time.

• Having more limited options, the DBGET is less


recommended than the two others.

21
File Formats in Bioinformatics

FASTQ format was developed by Sanger institute in order


to group together sequence and its quality scores. In
FASTQ files each entry is associated with 4 lines.

• Line 1 begins with a ‘@‘ character and is a sequence


identifier and an optional description.
• Line 2 Sequence in standard one letter code.
• Line 3 begins with a ‘+‘ character and is optionally
followed by the same sequence identifier (and any
additional description) again.
• Line 4 encodes the quality values for the sequence in
Line 2, and must contain the same number of symbols
as letters in the sequence.

22
Example:

@K00188:208:HFLNGBBXX:3:1101:1428:1508
2:N:0:CTTGTA
ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATG
TAGTATCTNATNGACTGNCNCCANANGGCTAAAGT
+
AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJ
JJF#F#FJJ#F#JJJFJJJJJ

23
Continued…

• FASTA format is a simple way of representing


nucleotide or amino acid sequences of nucleic acids and
proteins.

• This is a very basic format with two minimum lines.


First line referred as comment line starts with ‘>’ and
gives basic information about sequence.

• After comment line, sequence of nucleic acid or protein


is included in standard one letter code.

• Any tabulators, spaces, asterisks etc in sequence will be


ignored.

24
Continued…

Example:

>XR_002086427.1 Candida albicans SC5314 uncharacterized


ncRNA (SCR1), ncRNA

TGGCTGTGATGGCTTTTAGCGGAAGCGCGCTGTTCGCGTACCTGC
TGTTTGTTGAAAATTTAAGAGCAAAGTGTCCGGCTCGATCCCTGC

25
Continued…

26
Continued…
Genbank:
• The Genbank format, which is commonly utilized by
public databases such as NCBI, is arguably the
industry standard in sequence file template.

• The Genbank file template is very adaptable, allowing


you to include annotations, comments, and references.

• Because the file is plain text, it can be viewed using a


text editor. The file extension '.gb' or '.genbank' is
commonly used for Genbank files.

• The start of the sequence is marked by a line


containing "ORIGIN" and the end of the sequence is
marked by two slashes ("//").
27
Continued…
EMBL:
• The EMBL format, which is similar to the Genbank
file in appearance, is used by public databases such
as the European Molecular Biology Laboratory.

PDB:
• The PDB file template is employed to keep both
sequence data and, more importantly, three-
dimensional structure data.
• This information can be utilized to conceptualize a
molecule's crystal structure (typically a protein).
• PDB files are merely text files that can be considered
with a text editor and usually have the extension
'.pdb'.

28
Continued…
SAM:
• The Sequence Alignment Map (SAM) format is used to
store sequence alignment information in text format. It is
divided into header (optional) and comparison part. The
header starts with @ and may have multiple lines.

29
Continued…
PIR
A sequence in PIR format consists of:
–One line starting with
1. a ">" (greater-than) sign, followed by
2. a two-letter code describing the sequence type (P1, F1,
DL, DC, RL, RC, or XX), followed by
3. a semicolon, followed by the sequence identification.

• One line containing a textual description of the sequence.

• One or more lines containing the sequence itself. The end


of the sequence is marked by a "*" (asterisk) character.

• A file in PIR format may comprise more than one


sequence.
30
Continued…

31
Continued…

Two letter code of the sequence:

• P1 - Protein (complete)
• F1 - Protein (fragment)
• D1 - DNA (e.g. EMBOSS seqret output)
• DL - DNA (linear)
• DC - DNA (circular)
• RL - RNA (linear)
• RC - RNA (circular)
• N3 - tRNA
• N1 - Other functional RNA
• XX - Unknown

32
Continued…

CLUSTALW:

• The first line in the file must start with the words
"CLUSTALW". Other information in the first line is
ignored.

• One or more empty lines.

• One or more blocks of sequence data. Each block


consists of: – One line for each sequence in the
alignment.

33
Continued…

• Each line consists of:


– The sequence name
– white space
– up to 60 sequence symbols.
– Optional - white space followed by a cumulative count
of residues for the sequences

• Some rules about representing sequences:


– Case doesn't matter.
– Sequence symbols should be from a valid alphabet.
– Gaps are represented using hyphens ("-").

34
Continued…

GCG Format:

• A sequence file in GCG format contains exactly one


sequence, begins with annotation lines and the start
of the sequence is marked by a line ending with two
dot ("..") characters. This line also contains the
sequence identifier, the sequence length and a
checksum. This format should only be used if the file
was created with the GCG package.

35
Continued…

An example sequence in GCG format is:


ID AB000263 standard; RNA; PRI; 368 BP.
XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 368 BP;
AB000263 Length: 368 Check: 4514 .. 1 acaagatgcc attgtccccc ggcctcctgc
tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg
gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt
tcctcgcttg gtggtttgag tggacctccc 181 aggccagtgc cgggcccctc
ataggagagg aagctcggga ggtggccagg cggcaggaag 241
gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac
ttcttctgga 301 agaccttctc ctcctgcaaa taaaacctca
cccatgaatg ctcacgcaag tttaattaca 361 gacctgaa

36
Continued…
MSF:
• MSF formatted multiple sequence files are most often
created when using programs of the GCG suite.

• MSF files include the sequence name and the sequence


itself, which is usually aligned with other sequences in
the file.

• You can specify a single sequence or many sequences


within an MSF file.

– Begins with the line (all uppercase) !!


NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid
sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for
amino acid sequences.
– Do not edit or delete the file type if its present. 37
Continued…

38
Frequent Words and k-mers in Text

• A k-mer is just a sequence of k characters in a string


(or nucleotides in a DNA sequence).

• We say that Pattern is a most frequent k-mer in Text


if it maximizes Count ( Text, Pattern) among all k-
mers.

• For example, "ACTAT" is a most frequent 5-mer in


"ACAACTATGCATCACTATCGGGAACTATCCT“

• "ATA" is a most frequent 3-mer of


"CGATATATCCATAG".

39
Continued…
• Consider the DNA sequence “ACGAGGTACGA” which
consists of 11 nucleotides. Let’s try to obtain all the 4-
mers (substrings of length 4) in this DNA sequence.

40
Continued…

• The idea is simple. We create a window of length 4


and slide it from left to right, shifting one character
at a time. If the length of the given DNA sequence is
N, we would end up with (N – k)+1 k-mers.

• Total no. of k-mers = (N – k) + 1

• In the above example, the given DNA sequence is 11


characters long (N=11) and k = 4, thus we get eight
4-mers (11 - 4 )+1.

41
Continued…

• The Total Count is simply how many times each k-mer has
appeared in the given sequence.
• The distinct k-mers are counted only once regardless of how
many times they appear.
• The unique k-mers are those which appear only once. In the
above example, since ACGA has appeared twice, its unique
count is zero.
42
Continued…

• "ATA" is a most frequent 3-mer of

"CGATATATCCATAG".

• What is the most frequent 5-mer in

"ACAACTAAGCATCACTAACGGGAACTAACCT”?

Ans: ACTAA

43
Why k-mers?
• Decomposing a sequence into its k-mers for analysis
allows this set of fixed-size chunks to be analysed
rather than the sequence, and this can be more
efficient.

• K-mers are very useful in sequence matching and set


operations are faster, easier, and there are a lot of
readily-available algorithms and techniques to work
with them.

• A simple example: to check if a sequence S comes


from organism A or from organism B, assuming the
genomes of A and B are known and sufficiently
different, we can check if S contains more k-
mers present in A or in B.

44
Applications of k-mer Counting

Some of the applications where k-mer counting is


used:

• Genome assembly
• Sequence alignment
• Sequence clustering
• Error correction of sequencing reads
• Genome size estimation
• Repeat identification

45
Frequent k-mer with matches

Pattern does not need to actually(exactly) appear as


a substring of Text.

• Ex 1:
Pattern: ATGATCAAG
Substring: atcaATGATCAACgtataagcATGATCAAGgtgct

• Ex 2:
Pattern: CTTGATCAT
Substring: gaaagCATGATCATggctgCTTGATCATctgtt

46
Hamming Distance Problem

Example 1:
• Input:
GGGCCGTTGGT
GGACCGTTGAC
• Output: 3

Example 2:
• Input:
GGGCCGTTGGT
GGAGCGCTGAC
• Output: 5

47
Approximate Pattern Matching
Problem
• Find All starting positions where Pattern appears
as a substring of Text with at
most ’d’ mismatches.
• Input
Pattern: ATTCTGGA
Substring:
CGCCCGAATCCAGAACGCATTCCCATATTTCG
GGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
d: 3

• Output:
6 7 26 27

48
Approximate Pattern Count

• Compute the Hamming distance between Pattern


and every k-mer substring of Text with atmost ‘d’
mismatches.

• Countd(Text, Pattern).
Pattern: GAGG
Text: TTTAGAGCCTTCAGAGG
d(max mismatches): 2

• Output: 4

TTTAGAGCCTTCAGAGG

49
Substring Reconstruction

• Given a string Text, its k-mer composition


Compositionk(Text) is the collection of all k-mer
substrings of Text (including repeated k-mers).
• Composition3(TATGGGGTGC)
• = {TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC}

• Note that we have to list k-mers in lexicographic


order (i.e., how they would appear in a
dictionary) rather than in the order of their
appearance in the Text.
• = {ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG}

50
String Reconstruction Problem

• String: AAT ATG GTT TAA TGT

• The most natural way to solve the String


Reconstruction Problem is to mimic the solution
of the Newspaper Problem and "connect" a pair
of k-mers if they overlap in k-1 symbols.

• It is easy to see that the string should start with


TAA because there is no 3-mer ending in TA.

• This implies that the next 3-mer in the string


should start with AA.

51
Continued…

• There is only one 3-mer satisfying this


condition, AAT:
• TAA AAT
• In turn, AAT can only be extended by ATG,
which can only be extended by TGT, and so on,
leading us to reconstruct TAATGTT
• TAA
AAT
ATG
TGT
GTT
TAATGTT

52
Continued…

• Ex 2: AAT ATG ATG ATG CAT CCA GAT GCC


GGA GGG GTT TAA TGC TGG TGT

• Solution:
• If we start with TAA again

• ATG can be extended either by TGC, or TGG, or


TGT. Let’s select TGT

53
Continued…

• After TGT, our only choice is GTT

• But, we should realize that we are approaching


towards the dead-end, since no k-mers start
with TT. He we need to backtrack and find an
alternative solution.

54
Continued…

Alternative Solution Final Sequence

55
Substitution Matrix

• In bioinformatics, a substitution matrix describes


the frequency at which a character in a nucleotide
sequence or a protein sequence changes to other
character states over evolutionary time.

• The information is often in the form of log odds of


finding two specific character states aligned and
depends on the assumed number of evolutionary
changes or sequence dissimilarity between
compared sequences.

56
Continued…

• Substitution matrices are usually seen in the


context of amino acid or DNA sequence
alignments, where they are used to calculate
similarity scores between the aligned
sequences.

• A substitution matrix is a collection of scores for


aligning nucleotides or amino acids with one
another. These scores generally represent the
relative ease with which one nucleotide or
amino acid may mutate into or substitute for
another, and they are used to measure
similarity in sequence alignments.
57
Example

In the process of evolution, from one generation to the next the


amino acid sequences of an organism's proteins are gradually
altered through the action of DNA mutations. For example, the
sequence

ALEIRYLRD

could mutate into the sequence

ALEINYLRD

in one step, and possibly

AQEINYQRD

over a longer period of evolutionary time.


58
Continued…

• Each amino acid is more or less likely to mutate


into various other amino acids. For instance, a
hydrophilic residue such as arginine is more
likely to be replaced by another hydrophilic
residue such as glutamine, than it is to be
mutated into a hydrophobic residue such as
leucine.

• This is primarily due to redundancy in


the genetic code, which translates similar
codons into similar amino acids.

59
Gap Penalties

• A Gap penalty is a method of scoring alignments of


two or more sequences.

• When aligning sequences, introducing gaps in the


sequences can allow an alignment algorithm to
match more terms than a gap-less alignment can.

• However, minimizing gaps in an alignment is


important to create a useful alignment. Too many
gaps can cause an alignment to become
meaningless.

• Gap penalties are used to adjust alignment scores


based on the number and length of gaps.
60
Sequence Alignment

• Sequence alignment is the procedure of


comparing two (pair-wise alignment) or more
(multiple sequence alignment) sequences by
searching for a series of individual characters or
character patterns that are in the same order in
the sequences.
• Two sequences are aligned by writing them
across a page in two rows. Identical or similar
characters are placed in the same column, and
non-identical characters can either be placed in
the same column as a mis-match or opposite a
gap in the other sequence.

61
• In an optimal alignment, nonidentical characters
and gaps are placed to bring as many identical
or similar characters as possible into vertical
register.

• Sequences that can be readily aligned in this


manner are said to be similar.

• There are two types of sequence alignment,


global and local.

62
• In global alignment, an attempt is made to
align the entire sequence, using as many
characters as possible, up to both ends of each
sequence. Sequences that are quite similar and
approximately the same length are suitable
candidates for global alignment.
• In local alignment, stretches of sequence with
the highest density of matches are aligned, thus
generating one or more islands of matches or
subalignments in the aligned sequences. Local
alignments are more suitable for aligning
sequences that are similar along some of their
lengths but dissimilar in others

63
64
Sequence alignment Ex.

• Problem: Align abcdef with somehow similar


abdgf

• Solution: Write second sequence below the first


one abcdef
abdgf

• Move the sequences to give maximum match


between them.

• Show characters that match using vertical bar.

65
66
abcdef
||
abdgf
• In order to maximise the alignment, we insert
gap between b and d in lower sequence to
allow d and f to align.
abcdef
|| | |
ab-dgf
• Note e and g don’t match

67
• We are looking for an alignment, which:
– Maximizes the number of base-to-base
matches

– If necessary to achieve this goal, inserts gaps


in either sequence (a gap means a base-to-
nothing match)

– The order of bases in each sequence must


remain preserved and

– Gap-to-Gap matches are not allowed.

68
• We need some scheme to evaluate the
goodness of alignment.

• The scoring scheme consists of character


substitution scores (i.e. score for each possible
character replacement) plus penalties for gaps.

• The alignment score is the sum of substitution


scores and gap penalties. The alignment score
reflects goodness of alignment.

69
• For DNA we can construct the following
substitution matrix: ‘+1’ as a reward for match,
and ‘-1‘ as the penalty for mismatch, and ignore
gaps:

C T A G
C +1 -1 -1 -1
T -1 +1 -1 -1
A -1 -1 +1 -1
G -1 -1 -1 +1

70
• Using this scoring scheme, let us evaluate the
following alignments (penalty for a gap is = 0):

A T G G C G is the query sequence


Ex. 1.A T G – A G The score: +1+1+1+0-1+1 =
3

Ex. 2.A – T G A G The score: +1+0-1+1-1+1 = 1

71
Find the score for following alignments

• ATCCGATTCGA
• GATCGTATCCA

• CCTCTCACTGAGACTCAGCC
• CGA-ACTCTT-GATC-TGCA

• CGTAG-ATAG-TGCTAG-AGAAT-GGG-CCACT
• GTAGCTGATC-ATCGATCGTACGTAGC-GCTGA

72
PAM
PAM
• One of the first amino acid substitution matrices,
the PAM (Point Accepted Mutation) matrix was
developed by Margaret Dayhoff in the 1970s.
• This matrix is calculated by observing the
differences in closely related proteins.
• Because the use of very closely related
homologs, the observed mutations are not
expected to significantly change the common
functions of the proteins.
• Thus the observed substitutions (by point
mutations) are considered to be accepted by
natural selection.

73
BLOSUM

BLOSUM
• Dayhoff's methodology of comparing closely
related species turned out not to work very well
for aligning evolutionarily divergent sequences.
• Sequence changes over long evolutionary time
scales are not well approximated by
compounding small changes that occur over
short time scales.
• The BLOSUM (BLOck SUbstitution Matrix) series
of matrices rectifies this problem.
• Henikoff & Henikoff constructed these matrices
using multiple alignments of evolutionarily
divergent proteins.
74
75
76
• Seq 1: MARSIFLT
• Seq 2: MADQLTEE

77
Types of Gap penalties
1. Constant gap penalty
This is the simplest type of gap penalty: a fixed
negative score is given to every gap, regardless of its
length. This encourages the algorithm to make fewer,
larger, gaps leaving larger contiguous sections.

Aligning two short DNA sequences, with '-' depicting a


gap of one base pair. If each match was worth 1 point
and the whole gap -1, the total score: 7 − 1 = 6.

78
2. Linear gap penalty
Compared to the constant gap penalty, the linear gap
penalty takes into account the length (L) of each
insertion/deletion in the gap. Therefore, if the penalty
for each inserted/deleted element is B and the length
of the gap L; the total gap penalty would be the
product of the two BL. This method favors shorter
gaps, with total score decreasing with each additional
gap.

Unlike constant gap penalty, the size of the gap is


considered. With a match with score 1 and each
gap -1, the score here is (7 − 3 = 4).

79
3. Affine gap penalty
In biological sequences, it is more likely that a one
big gap of length 10 occurs in a sequence, than 10
small gaps of length 1.
Therefore, affine gap penalties favour longer gaps
over single gaps of the same total length

They use a gap opening penalty, o < 0, and a gap


extension penalty, e < 0, such that |e| < |o|, to
encourage gap extension rather than gap
introduction.
A gap of length L is then given a penalty
g = o + (L−1)e.
80
81

You might also like