0% found this document useful (0 votes)

74 views81 pages

Module 2 (Bioinformatics)

Uploaded by

Asmi Tanzaen H N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views81 pages

Module 2 (Bioinformatics)

Uploaded by

Asmi Tanzaen H N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

PRESIDENCY UNIVERISTY,BENGALURU

School of Engineering

Department of Computer Science & Engineering

Introduction to Bioinformatics
CSE 3069

V Semester 2023-24
Module 2

Genomic Database and

Sequence similarity
Bioinformatics Resources and Tools

• A database is a structured collection of records stored in a

computer system.

• Genomic databases typically store DNA or protein sequences

as well as annotated information about those sequences.

• There are hundreds of genomics databases: some are

comprehensive, but are not carefully curated (GenBank),
while others are carefully curated, but are narrow (FlyBase).

3
Continued…

• Bioinformatic tools are computer programs that

analyze one or more sequences.

• There are a dizzying array of bioinformatic tools that can

analyze sequences to find protein domains (Pfam), or
that can search through databases of millions of
sequences to find ones that are similar (BLAST) or that
can find potential protein-coding regions (ORF-Finder).
Many are freely available over the web.

4
Continued…

Nucleotide Sequence Databases (the principal

ones)
• NCBI - National Center for Biotechnology Information
• EBI - European Bioinformatics Institute
• DDBJ - DNA Data Bank of Japan

Database Searching by Sequence Similarity

• BLAST @ NCBI
• PSI-BLAST @ NCBI
• FASTA @ EBI
• BLAT

5
Continued…

Protein Sequence Databases

• SWISS-PROT & TrEMBL - Protein sequence database

and computer annotated supplement
• UniProt - UniProt (Universal Protein Resource) is the
world's most comprehensive catalog of information
on proteins.
• PIR - Protein Information Resource
• MIPS - Munich Information centre for Protein
Sequences
• HUPO - HUman Proteome Organization

6
Continued…
Sequence Alignment

• USC Sequence Alignment Server - align 2

sequences with all possible varieties of dynamic
programming
• ClustalW @ EBI - multiple sequence alignment
• Spidey - an mRNA-to-genomic alignment program
• Wise2 - align a protein or profile HMM against
genomic sequence to predict a gene structure, and
related tools
• PipMaker - computes alignments of similar regions
in two (long) DNA sequences
• VISTA - align + detect conserved regions in long
genomic sequence.
7
List of Open Source Tools available for
Bioinformatics
• geWorkbench
• BioPerl
• UGENE Open Source Bioinformatics Tool Linux
• Biojava Bioinformatics Tool for Linux
• Biopython Test Genomic Software
• InterMine
• IGV Genomic Sequencing Tool
• GROMACS
• Taverna Workbench
• EMBOSS Bioinformatics Tool Linux
• Clustal Omega
• BLAST
• Bedtool
• Bioclipse Open Source Bioinformatics Tool
• Bioconductor

8
Types and classification of genome
databases
Genomics refers to the study of structure and function
of entire genome of a living organism. Genome refers
to the basic set of chromosomes.

Plant Genomics:
• It deals with the study of structure and function of
entire genome of plant species.

Animal Genomics:
• It deals with the study of structure and function of
entire genome of animal species.

9
Continued…
Eukaryotic Genomics:
• It deals with the study of structure and function of
entire genome of higher [multi-cellular]
organisms.

Prokaryotic Genomics:
• It deals with the study of structure and function of
entire genome of unicellular organisms.

10
Types of Databases

Primary databases
• Primary databases are also called as archival database.

• They are populated with experimentally derived data

such as nucleotide sequence, protein sequence or
macromolecular structure.

• Experimental results are submitted directly into the

database by researchers, and the data are essentially
archival in nature.

• Once given a database accession number, the data in

primary databases are never changed: they form part
of the scientific record.

11
Continued…

Examples
• ENA, GenBank and DDBJ (nucleotide sequence)

• Array Express Archive and GEO (functional genomics

data)

• Protein Data Bank (PDB; coordinates of three-

dimensional macromolecular structures)

12
Continued…

Secondary databases

• Secondary databases comprise data derived from the

results of analysing primary data.

• Secondary databases often draw upon information

from numerous sources, including other databases
(primary and secondary), controlled vocabularies and
the scientific literature.

• They are highly curated, often using a complex

combination of computational algorithms and manual
analysis and interpretation to derive new knowledge
from the public record of science.

13
Continued…

Examples

•InterPro (protein families, motifs and domains)

•UniProt Knowledgebase (sequence and functional

information on proteins)

•Ensembl (variation, function, regulation and more layered

onto whole genome sequences)

14
Continued…
Composite Databases

• The data entered in these types of databases are first

compared and then filtered based on desired criteria.

• The initial data are taken from the primary database, and
then they are merged together based on certain
conditions.

• It helps in searching sequences rapidly. Composite

Databases contain non-redundant data.

Examples –
• Composite Databases -OWL,NRD and Swissport +TREMBL

15
Genomic Databases

They contain data related to the genomic sequencing of

different organisms, and gene annotations.

• Human Genome Databases: include information on the

human gene sequencing.

• Model Organism Databases (MOD): stored data

coming from the sequencing projects of model organisms
(such as, e.g., MATDB); they are also intended to support
the Human Genome Project (HGP).

• Other Organism Databases: stored information derived

from sequencing projects not related to HGP.

16
Continued…

• Organelle Databases: stored genomic data of cellular

organelles, such as mitochondria, having their own
genome, distinct from the nuclear genome.

• Virus Databases: stored virus genomes.

17
Database-Searching
• The amount of biological relevant data is increasing so
rapidly, knowing how to access and search this information
is essential.

• There are three data retrieval systems of particular

relevance to molecular biologist:
– Sequence Retrieval System (SRS)
– Entrez
– DBGET.

• These systems allow text searching of multiple molecular

biology database and provide links to relevant information
for entries that match the search criteria.

• The three systems differ in the databases they search and

the links they have, to other information.
18
Sequence Retrieval System (SRS)
• SRS is a homogeneous interface to over 80 biological
databases that had been developed at the European
Bioinformatics Institute (EBI) at Hinxton, UK.

• It includes databases of sequences, metabolic

pathways, transcription factors, application results (like
BLAST, SSEARCH, FASTA), protein 3-D structures,
genomes, mappings, mutations, and locus specific
mutations.

• The web page listing all the databases contains a link

to a description page about the database including the
date on which it was last updated.

• The SRS is highly recommended for use.

19
Entrez

• Entrez is a molecular biology database and retrieval

system.

• Developed by the National Center for Biotechnology

information (NCBI).

• It is entry point for exploring distinct but integrated

databases.

• Of the three text-based database systems, Entrez is

the easiest to use, but also offers more limited
information to search.

20
DBGET

• DBGET is an integrated database retrieval system,

developed at the university of Tokyo.

• Provides access to 20 databases, one at a time.

• Having more limited options, the DBGET is less

recommended than the two others.

21
File Formats in Bioinformatics

FASTQ format was developed by Sanger institute in order

to group together sequence and its quality scores. In
FASTQ files each entry is associated with 4 lines.

• Line 1 begins with a ‘@‘ character and is a sequence

identifier and an optional description.
• Line 2 Sequence in standard one letter code.
• Line 3 begins with a ‘+‘ character and is optionally
followed by the same sequence identifier (and any
additional description) again.
• Line 4 encodes the quality values for the sequence in
Line 2, and must contain the same number of symbols
as letters in the sequence.

22
Example:

@K00188:208:HFLNGBBXX:3:1101:1428:1508
2:N:0:CTTGTA
ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATG
TAGTATCTNATNGACTGNCNCCANANGGCTAAAGT
+
AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJ
JJF#F#FJJ#F#JJJFJJJJJ

23
Continued…

• FASTA format is a simple way of representing

nucleotide or amino acid sequences of nucleic acids and
proteins.

• This is a very basic format with two minimum lines.

First line referred as comment line starts with ‘>’ and
gives basic information about sequence.

• After comment line, sequence of nucleic acid or protein

is included in standard one letter code.

• Any tabulators, spaces, asterisks etc in sequence will be

ignored.

24
Continued…

Example:

>XR_002086427.1 Candida albicans SC5314 uncharacterized

ncRNA (SCR1), ncRNA

TGGCTGTGATGGCTTTTAGCGGAAGCGCGCTGTTCGCGTACCTGC
TGTTTGTTGAAAATTTAAGAGCAAAGTGTCCGGCTCGATCCCTGC

25
Continued…

26
Continued…
Genbank:
• The Genbank format, which is commonly utilized by
public databases such as NCBI, is arguably the
industry standard in sequence file template.

• The Genbank file template is very adaptable, allowing

you to include annotations, comments, and references.

• Because the file is plain text, it can be viewed using a

text editor. The file extension '.gb' or '.genbank' is
commonly used for Genbank files.

• The start of the sequence is marked by a line

containing "ORIGIN" and the end of the sequence is
marked by two slashes ("//").
27
Continued…
EMBL:
• The EMBL format, which is similar to the Genbank
file in appearance, is used by public databases such
as the European Molecular Biology Laboratory.

PDB:
• The PDB file template is employed to keep both
sequence data and, more importantly, three-
dimensional structure data.
• This information can be utilized to conceptualize a
molecule's crystal structure (typically a protein).
• PDB files are merely text files that can be considered
with a text editor and usually have the extension
'.pdb'.

28
Continued…
SAM:
• The Sequence Alignment Map (SAM) format is used to
store sequence alignment information in text format. It is
divided into header (optional) and comparison part. The
header starts with @ and may have multiple lines.

29
Continued…
PIR
A sequence in PIR format consists of:
–One line starting with
1. a ">" (greater-than) sign, followed by
2. a two-letter code describing the sequence type (P1, F1,
DL, DC, RL, RC, or XX), followed by
3. a semicolon, followed by the sequence identification.

• One line containing a textual description of the sequence.

• One or more lines containing the sequence itself. The end

of the sequence is marked by a "*" (asterisk) character.

• A file in PIR format may comprise more than one

sequence.
30
Continued…

31
Continued…

Two letter code of the sequence:

• P1 - Protein (complete)
• F1 - Protein (fragment)
• D1 - DNA (e.g. EMBOSS seqret output)
• DL - DNA (linear)
• DC - DNA (circular)
• RL - RNA (linear)
• RC - RNA (circular)
• N3 - tRNA
• N1 - Other functional RNA
• XX - Unknown

32
Continued…

CLUSTALW:

• The first line in the file must start with the words
"CLUSTALW". Other information in the first line is
ignored.

• One or more empty lines.

• One or more blocks of sequence data. Each block

consists of: – One line for each sequence in the
alignment.

33
Continued…

• Each line consists of:

– The sequence name
– white space
– up to 60 sequence symbols.
– Optional - white space followed by a cumulative count
of residues for the sequences

• Some rules about representing sequences:

– Case doesn't matter.
– Sequence symbols should be from a valid alphabet.
– Gaps are represented using hyphens ("-").

34
Continued…

GCG Format:

• A sequence file in GCG format contains exactly one

sequence, begins with annotation lines and the start
of the sequence is marked by a line ending with two
dot ("..") characters. This line also contains the
sequence identifier, the sequence length and a
checksum. This format should only be used if the file
was created with the GCG package.

35
Continued…

An example sequence in GCG format is:

ID AB000263 standard; RNA; PRI; 368 BP.
XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 368 BP;
AB000263 Length: 368 Check: 4514 .. 1 acaagatgcc attgtccccc ggcctcctgc
tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg
gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt
tcctcgcttg gtggtttgag tggacctccc 181 aggccagtgc cgggcccctc
ataggagagg aagctcggga ggtggccagg cggcaggaag 241
gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac
ttcttctgga 301 agaccttctc ctcctgcaaa taaaacctca
cccatgaatg ctcacgcaag tttaattaca 361 gacctgaa

36
Continued…
MSF:
• MSF formatted multiple sequence files are most often
created when using programs of the GCG suite.

• MSF files include the sequence name and the sequence

itself, which is usually aligned with other sequences in
the file.

• You can specify a single sequence or many sequences

within an MSF file.

– Begins with the line (all uppercase) !!

NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid
sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for
amino acid sequences.
– Do not edit or delete the file type if its present. 37
Continued…

38
Frequent Words and k-mers in Text

• A k-mer is just a sequence of k characters in a string

(or nucleotides in a DNA sequence).

• We say that Pattern is a most frequent k-mer in Text

if it maximizes Count ( Text, Pattern) among all k-
mers.

• For example, "ACTAT" is a most frequent 5-mer in

"ACAACTATGCATCACTATCGGGAACTATCCT“

• "ATA" is a most frequent 3-mer of

"CGATATATCCATAG".

39
Continued…
• Consider the DNA sequence “ACGAGGTACGA” which
consists of 11 nucleotides. Let’s try to obtain all the 4-
mers (substrings of length 4) in this DNA sequence.

40
Continued…

• The idea is simple. We create a window of length 4

and slide it from left to right, shifting one character
at a time. If the length of the given DNA sequence is
N, we would end up with (N – k)+1 k-mers.

• Total no. of k-mers = (N – k) + 1

• In the above example, the given DNA sequence is 11

characters long (N=11) and k = 4, thus we get eight
4-mers (11 - 4 )+1.

41
Continued…

• The Total Count is simply how many times each k-mer has
appeared in the given sequence.
• The distinct k-mers are counted only once regardless of how
many times they appear.
• The unique k-mers are those which appear only once. In the
above example, since ACGA has appeared twice, its unique
count is zero.
42
Continued…

• "ATA" is a most frequent 3-mer of

"CGATATATCCATAG".

• What is the most frequent 5-mer in

"ACAACTAAGCATCACTAACGGGAACTAACCT”?

Ans: ACTAA

43
Why k-mers?
• Decomposing a sequence into its k-mers for analysis
allows this set of fixed-size chunks to be analysed
rather than the sequence, and this can be more
efficient.

• K-mers are very useful in sequence matching and set

operations are faster, easier, and there are a lot of
readily-available algorithms and techniques to work
with them.

• A simple example: to check if a sequence S comes

from organism A or from organism B, assuming the
genomes of A and B are known and sufficiently
different, we can check if S contains more k-
mers present in A or in B.

44
Applications of k-mer Counting

Some of the applications where k-mer counting is

used:

• Genome assembly
• Sequence alignment
• Sequence clustering
• Error correction of sequencing reads
• Genome size estimation
• Repeat identification

45
Frequent k-mer with matches

Pattern does not need to actually(exactly) appear as

a substring of Text.

• Ex 1:
Pattern: ATGATCAAG
Substring: atcaATGATCAACgtataagcATGATCAAGgtgct

• Ex 2:
Pattern: CTTGATCAT
Substring: gaaagCATGATCATggctgCTTGATCATctgtt

46
Hamming Distance Problem

Example 1:
• Input:
GGGCCGTTGGT
GGACCGTTGAC
• Output: 3

Example 2:
• Input:
GGGCCGTTGGT
GGAGCGCTGAC
• Output: 5

47
Approximate Pattern Matching
Problem
• Find All starting positions where Pattern appears
as a substring of Text with at
most ’d’ mismatches.
• Input
Pattern: ATTCTGGA
Substring:
CGCCCGAATCCAGAACGCATTCCCATATTTCG
GGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
d: 3

• Output:
6 7 26 27

48
Approximate Pattern Count

• Compute the Hamming distance between Pattern

and every k-mer substring of Text with atmost ‘d’
mismatches.

• Countd(Text, Pattern).
Pattern: GAGG
Text: TTTAGAGCCTTCAGAGG
d(max mismatches): 2

• Output: 4

TTTAGAGCCTTCAGAGG

49
Substring Reconstruction

• Given a string Text, its k-mer composition

Compositionk(Text) is the collection of all k-mer
substrings of Text (including repeated k-mers).
• Composition3(TATGGGGTGC)
• = {TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC}

• Note that we have to list k-mers in lexicographic

order (i.e., how they would appear in a
dictionary) rather than in the order of their
appearance in the Text.
• = {ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG}

50
String Reconstruction Problem

• String: AAT ATG GTT TAA TGT

• The most natural way to solve the String

Reconstruction Problem is to mimic the solution
of the Newspaper Problem and "connect" a pair
of k-mers if they overlap in k-1 symbols.

• It is easy to see that the string should start with

TAA because there is no 3-mer ending in TA.

• This implies that the next 3-mer in the string

should start with AA.

51
Continued…

• There is only one 3-mer satisfying this

condition, AAT:
• TAA AAT
• In turn, AAT can only be extended by ATG,
which can only be extended by TGT, and so on,
leading us to reconstruct TAATGTT
• TAA
AAT
ATG
TGT
GTT
TAATGTT

52
Continued…

• Ex 2: AAT ATG ATG ATG CAT CCA GAT GCC

GGA GGG GTT TAA TGC TGG TGT

• Solution:
• If we start with TAA again

• ATG can be extended either by TGC, or TGG, or

TGT. Let’s select TGT

53
Continued…

• After TGT, our only choice is GTT

• But, we should realize that we are approaching

towards the dead-end, since no k-mers start
with TT. He we need to backtrack and find an
alternative solution.

54
Continued…

Alternative Solution Final Sequence

55
Substitution Matrix

• In bioinformatics, a substitution matrix describes

the frequency at which a character in a nucleotide
sequence or a protein sequence changes to other
character states over evolutionary time.

• The information is often in the form of log odds of

finding two specific character states aligned and
depends on the assumed number of evolutionary
changes or sequence dissimilarity between
compared sequences.

56
Continued…

• Substitution matrices are usually seen in the

context of amino acid or DNA sequence
alignments, where they are used to calculate
similarity scores between the aligned
sequences.

• A substitution matrix is a collection of scores for

aligning nucleotides or amino acids with one
another. These scores generally represent the
relative ease with which one nucleotide or
amino acid may mutate into or substitute for
another, and they are used to measure
similarity in sequence alignments.
57
Example

In the process of evolution, from one generation to the next the

amino acid sequences of an organism's proteins are gradually
altered through the action of DNA mutations. For example, the
sequence

ALEIRYLRD

could mutate into the sequence

ALEINYLRD

in one step, and possibly

AQEINYQRD

over a longer period of evolutionary time.

58
Continued…

• Each amino acid is more or less likely to mutate

into various other amino acids. For instance, a
hydrophilic residue such as arginine is more
likely to be replaced by another hydrophilic
residue such as glutamine, than it is to be
mutated into a hydrophobic residue such as
leucine.

• This is primarily due to redundancy in

the genetic code, which translates similar
codons into similar amino acids.

59
Gap Penalties

• A Gap penalty is a method of scoring alignments of

two or more sequences.

• When aligning sequences, introducing gaps in the

sequences can allow an alignment algorithm to
match more terms than a gap-less alignment can.

• However, minimizing gaps in an alignment is

important to create a useful alignment. Too many
gaps can cause an alignment to become
meaningless.

• Gap penalties are used to adjust alignment scores

based on the number and length of gaps.
60
Sequence Alignment

• Sequence alignment is the procedure of

comparing two (pair-wise alignment) or more
(multiple sequence alignment) sequences by
searching for a series of individual characters or
character patterns that are in the same order in
the sequences.
• Two sequences are aligned by writing them
across a page in two rows. Identical or similar
characters are placed in the same column, and
non-identical characters can either be placed in
the same column as a mis-match or opposite a
gap in the other sequence.

61
• In an optimal alignment, nonidentical characters
and gaps are placed to bring as many identical
or similar characters as possible into vertical
register.

• Sequences that can be readily aligned in this

manner are said to be similar.

• There are two types of sequence alignment,

global and local.

62
• In global alignment, an attempt is made to
align the entire sequence, using as many
characters as possible, up to both ends of each
sequence. Sequences that are quite similar and
approximately the same length are suitable
candidates for global alignment.
• In local alignment, stretches of sequence with
the highest density of matches are aligned, thus
generating one or more islands of matches or
subalignments in the aligned sequences. Local
alignments are more suitable for aligning
sequences that are similar along some of their
lengths but dissimilar in others

63
64
Sequence alignment Ex.

• Problem: Align abcdef with somehow similar

abdgf

• Solution: Write second sequence below the first

one abcdef
abdgf

• Move the sequences to give maximum match

between them.

• Show characters that match using vertical bar.

65
66
abcdef
||
abdgf
• In order to maximise the alignment, we insert
gap between b and d in lower sequence to
allow d and f to align.
abcdef
|| | |
ab-dgf
• Note e and g don’t match

67
• We are looking for an alignment, which:
– Maximizes the number of base-to-base
matches

– If necessary to achieve this goal, inserts gaps

in either sequence (a gap means a base-to-
nothing match)

– The order of bases in each sequence must

remain preserved and

– Gap-to-Gap matches are not allowed.

68
• We need some scheme to evaluate the
goodness of alignment.

• The scoring scheme consists of character

substitution scores (i.e. score for each possible
character replacement) plus penalties for gaps.

• The alignment score is the sum of substitution

scores and gap penalties. The alignment score
reflects goodness of alignment.

69
• For DNA we can construct the following
substitution matrix: ‘+1’ as a reward for match,
and ‘-1‘ as the penalty for mismatch, and ignore
gaps:

C T A G
C +1 -1 -1 -1
T -1 +1 -1 -1
A -1 -1 +1 -1
G -1 -1 -1 +1

70
• Using this scoring scheme, let us evaluate the
following alignments (penalty for a gap is = 0):

A T G G C G is the query sequence

Ex. 1.A T G – A G The score: +1+1+1+0-1+1 =
3

Ex. 2.A – T G A G The score: +1+0-1+1-1+1 = 1

71
Find the score for following alignments

• ATCCGATTCGA
• GATCGTATCCA

• CCTCTCACTGAGACTCAGCC
• CGA-ACTCTT-GATC-TGCA

• CGTAG-ATAG-TGCTAG-AGAAT-GGG-CCACT
• GTAGCTGATC-ATCGATCGTACGTAGC-GCTGA

72
PAM
PAM
• One of the first amino acid substitution matrices,
the PAM (Point Accepted Mutation) matrix was
developed by Margaret Dayhoff in the 1970s.
• This matrix is calculated by observing the
differences in closely related proteins.
• Because the use of very closely related
homologs, the observed mutations are not
expected to significantly change the common
functions of the proteins.
• Thus the observed substitutions (by point
mutations) are considered to be accepted by
natural selection.

73
BLOSUM

BLOSUM
• Dayhoff's methodology of comparing closely
related species turned out not to work very well
for aligning evolutionarily divergent sequences.
• Sequence changes over long evolutionary time
scales are not well approximated by
compounding small changes that occur over
short time scales.
• The BLOSUM (BLOck SUbstitution Matrix) series
of matrices rectifies this problem.
• Henikoff & Henikoff constructed these matrices
using multiple alignments of evolutionarily
divergent proteins.
74
75
76
• Seq 1: MARSIFLT
• Seq 2: MADQLTEE

77
Types of Gap penalties
1. Constant gap penalty
This is the simplest type of gap penalty: a fixed
negative score is given to every gap, regardless of its
length. This encourages the algorithm to make fewer,
larger, gaps leaving larger contiguous sections.

Aligning two short DNA sequences, with '-' depicting a

gap of one base pair. If each match was worth 1 point
and the whole gap -1, the total score: 7 − 1 = 6.

78
2. Linear gap penalty
Compared to the constant gap penalty, the linear gap
penalty takes into account the length (L) of each
insertion/deletion in the gap. Therefore, if the penalty
for each inserted/deleted element is B and the length
of the gap L; the total gap penalty would be the
product of the two BL. This method favors shorter
gaps, with total score decreasing with each additional
gap.

Unlike constant gap penalty, the size of the gap is

considered. With a match with score 1 and each
gap -1, the score here is (7 − 3 = 4).

79
3. Affine gap penalty
In biological sequences, it is more likely that a one
big gap of length 10 occurs in a sequence, than 10
small gaps of length 1.
Therefore, affine gap penalties favour longer gaps
over single gaps of the same total length

They use a gap opening penalty, o < 0, and a gap

extension penalty, e < 0, such that |e| < |o|, to
encourage gap extension rather than gap
introduction.
A gap of length L is then given a penalty
g = o + (L−1)e.
80
81

Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Bioinformatics Database Basics
No ratings yet
Bioinformatics Database Basics
18 pages
Introduction to Bioinformatics Basics
No ratings yet
Introduction to Bioinformatics Basics
35 pages
"MBG1002 Biological Databases Week II
No ratings yet
"MBG1002 Biological Databases Week II
37 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Biological Data Bases
No ratings yet
Biological Data Bases
36 pages
Overview of Sequence Databases
No ratings yet
Overview of Sequence Databases
135 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
120-202 Lab 01 - Fall 2018
No ratings yet
120-202 Lab 01 - Fall 2018
13 pages
Bioinformatics 1
No ratings yet
Bioinformatics 1
37 pages
2024.HF BioInformatics Lec3p
No ratings yet
2024.HF BioInformatics Lec3p
11 pages
Biological - Databases Class Work 60
No ratings yet
Biological - Databases Class Work 60
60 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
Lecture 1 - Biological Database
No ratings yet
Lecture 1 - Biological Database
14 pages
BCH 505 Bioinformatics 3 (2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3 (2 2) Databases
17 pages
Bi 5&10mark Q&A Mse 1
No ratings yet
Bi 5&10mark Q&A Mse 1
14 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
First Bioinformatics Database Overview
No ratings yet
First Bioinformatics Database Overview
34 pages
Bioinformatics
No ratings yet
Bioinformatics
5 pages
Biological Database
No ratings yet
Biological Database
18 pages
Introduction to Bioinformatics Basics
No ratings yet
Introduction to Bioinformatics Basics
9 pages
Lecture 5 Information Retrieval From Databases
No ratings yet
Lecture 5 Information Retrieval From Databases
22 pages
Bioinformatics & Protein Analysis Guide
No ratings yet
Bioinformatics & Protein Analysis Guide
70 pages
Latthika
No ratings yet
Latthika
21 pages
Unit Ii
No ratings yet
Unit Ii
23 pages
Biological Databases ODL
No ratings yet
Biological Databases ODL
31 pages
Bioinfo U2 KD 2
No ratings yet
Bioinfo U2 KD 2
3 pages
4 Bioinformaticsdatabases
No ratings yet
4 Bioinformaticsdatabases
71 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
MSC - Bioinformatics - Year1 Detailing by Bioinformatics Centre SPPU - 03082023
No ratings yet
MSC - Bioinformatics - Year1 Detailing by Bioinformatics Centre SPPU - 03082023
33 pages
Peace BMCB Seminar
No ratings yet
Peace BMCB Seminar
13 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Biological Sequence Databases
No ratings yet
Biological Sequence Databases
35 pages
PB Bioinfo L1 2023
No ratings yet
PB Bioinfo L1 2023
21 pages
CH12
No ratings yet
CH12
8 pages
Capture D'écran . 2023-03-14 À 00.15.22
No ratings yet
Capture D'écran . 2023-03-14 À 00.15.22
54 pages
Bioinfo Course Notes M1 2020 DR Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 DR Mbulli
56 pages
Lecture 2
No ratings yet
Lecture 2
24 pages
BTH 403-BTG407 Lecture 1
No ratings yet
BTH 403-BTG407 Lecture 1
6 pages
Biological Database ODL
No ratings yet
Biological Database ODL
21 pages
Biotech Database Classifications Overview
No ratings yet
Biotech Database Classifications Overview
16 pages
Biol BDs Singapore
No ratings yet
Biol BDs Singapore
24 pages
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
No ratings yet
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
42 pages
Bioinformatics Lecture 1
No ratings yet
Bioinformatics Lecture 1
48 pages
Basics of Bioinformatics in Biological Research
No ratings yet
Basics of Bioinformatics in Biological Research
5 pages
Zoya Bioinformatics Assignment
No ratings yet
Zoya Bioinformatics Assignment
36 pages
Intro to Bioinformatics Course
No ratings yet
Intro to Bioinformatics Course
104 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
Bioinformatics (Final)
No ratings yet
Bioinformatics (Final)
41 pages
Biological Databases
No ratings yet
Biological Databases
17 pages
Introduction To Databases
No ratings yet
Introduction To Databases
21 pages
Chromosome-Level Genome Assembly of The Greenfin Horse-Faced Filefish (Thamnaconus
No ratings yet
Chromosome-Level Genome Assembly of The Greenfin Horse-Faced Filefish (Thamnaconus
28 pages
Practical No: Date: DDBJ Database
No ratings yet
Practical No: Date: DDBJ Database
1 page
Genomics
No ratings yet
Genomics
6 pages
Bioinformatics Expert CV
No ratings yet
Bioinformatics Expert CV
3 pages
Bioinformatics 30day Roadmap
No ratings yet
Bioinformatics 30day Roadmap
2 pages
Molecular Phylogenetics and Evolution
No ratings yet
Molecular Phylogenetics and Evolution
3 pages
Understanding BLAST Sequence Alignment
No ratings yet
Understanding BLAST Sequence Alignment
3 pages
Bioinformatics Lab Report
No ratings yet
Bioinformatics Lab Report
5 pages
Introduction To Bioinformatics Lecture 3
No ratings yet
Introduction To Bioinformatics Lecture 3
20 pages
Protein Structure Prediction Guide
No ratings yet
Protein Structure Prediction Guide
53 pages
BTG3
No ratings yet
BTG3
2 pages
iEMEA GGP Ovine 50K Technical Sheet
No ratings yet
iEMEA GGP Ovine 50K Technical Sheet
1 page
Computational Phylogenetics Is The Application of Computational
No ratings yet
Computational Phylogenetics Is The Application of Computational
2 pages
Rasmol
No ratings yet
Rasmol
24 pages
Cse Q
No ratings yet
Cse Q
8 pages
Lecture 4
No ratings yet
Lecture 4
22 pages
Microbiology Course Schedules 2024/2025
No ratings yet
Microbiology Course Schedules 2024/2025
4 pages
Unit 1 Bioinformatics EBT-404
No ratings yet
Unit 1 Bioinformatics EBT-404
69 pages
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
No ratings yet
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
41 pages
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability
No ratings yet
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability
9 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
18 pages
Analyzing Gene vs. Protein Similarity
No ratings yet
Analyzing Gene vs. Protein Similarity
1 page
Computational Structural Biology Methods and Applications 1st Edition Torsten Schwede PDF Download
No ratings yet
Computational Structural Biology Methods and Applications 1st Edition Torsten Schwede PDF Download
86 pages
Bioinformation: Phylogenetic Analysis of Chloroplast Matk Gene From Zingiberaceae For Plant Dna Barcoding
No ratings yet
Bioinformation: Phylogenetic Analysis of Chloroplast Matk Gene From Zingiberaceae For Plant Dna Barcoding
4 pages
Gene Set Enrichment Analysis Report
No ratings yet
Gene Set Enrichment Analysis Report
3 pages
Bioinformatics Engineering Course Outline
No ratings yet
Bioinformatics Engineering Course Outline
7 pages
Lab Work
No ratings yet
Lab Work
29 pages
Winter School Internship Program in Microbial Genomics & Bioinformatics From Sequencing Reads To Functional Insights
No ratings yet
Winter School Internship Program in Microbial Genomics & Bioinformatics From Sequencing Reads To Functional Insights
4 pages
Homo Sapiens Chromosome 11, GRCh38.p14 Primary Assembly - Nucleotide - NCBI
100% (1)
Homo Sapiens Chromosome 11, GRCh38.p14 Primary Assembly - Nucleotide - NCBI
3 pages
MATH 3510 W24 Course Outline
No ratings yet
MATH 3510 W24 Course Outline
10 pages