0% found this document useful (0 votes)
42 views53 pages

Selected Topic in Cs 1

Uploaded by

malkmoh781.mm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views53 pages

Selected Topic in Cs 1

Uploaded by

malkmoh781.mm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Selected Topic in CS 1

Computational Biology

1
1.1 What Is Bioinformatics?
extracting information embedded within of human DNA .
1.2 Applications of Bioinformatics
Some of the common applications of bioinformatics are listed below:
1- Protein Folding. The protein can only function after it has acquired a 3D structure.
Computational algorithms for predicting the protein structure from a protein sequence
(sequence of amino acid) continues to be an unsolved problem.
2- Alignment and Homology (Homology/Similarity)
3- Information Retrieval and Data Mining from Biological Databases
4- Analysis of Biological Sequences 1 and Pattern Discovery: extracting all features 
learning algorithm  identifying/understanding/classifying some features.
1.3 DNA: Deoxyribonucleic Acid
The DNA molecules consist of two complementary chains/strands that are twisted together. The
DNA molecule is comprised of four nucleotide bases belonging to two classes. These four bases
are adenine (A), guanine (G), cytosine (C) and Thymine (T). C always pairs with G while A
always pairs with T.
1.4. Biological Databases
There are too many databases related to the biological science.
1.4.1 Nucleotide Database (GENBANK)
GenBank is an annotated genetic sequence database. A sample entry/record of the GenBank
database is shown in the following figure.
Let us discuss the entries of the GenBank entry in the following sections.

1
Biological sequences = sequences of DNA letters

2
1.5.1.1 Header or Locus Line
The header/locus-line of a GENBANK sequence record specifies:
a- Locus name/Accession number (e.g., “AY279110”): a unique locus name for each
record/DNA sequence.
b- Length of the DNA sequence (e.g., “2704 bp”): Sequence length ranges from 1 to 350,000
base pairs (bp) in a single record.
a- Date (e.g., ”09-MAR-2005”): when the sequence record was made public or updated.

1.5.1.2 Definition, Accession, Version, and Keywords


a- Definition line (or “def line”): it give the full genus and species names (e.g., genus is
“Alouatta”, and species is “belzebul”). The genus and species’ names are followed by the
Product name (e.g., “beta globin”)

b- Accession number (e.g., “AY279110”): it represents the primary key to a given record in
the database. Accession number exists in one of two formats: the older format, known as
the “1+5” format comprised of a single character followed by five digits; the newer format,
or the “2+6” format comprises of two letters followed by six digits.

c- VERSION line: accession.version specifying the version of the sequence. It also contains a GI
or GenInfo identifier. Thus, two versions of the same sequence will have different GI numbers.
When sequence data changes, the version number is incremented by one and a new GenInfo
identifier is assigned to the sequence.

1.5.1.3. Sequence Origin and Taxonomy Lines


a- SOURCE line: it contains the scientific or common name of the organism that is the source
of the DNA sequenced data (i.e., genus name and species name followed by common
name).

1.5.1.4. Sequence References


Every GENBANK record must have at least one associated reference (i.e., scientific papers cited
this sequence due to using in a research study).

3
1.5.1.5. Sequence Features Table
a- CDS Feature: it provides location information for joining the sequence data and making
the protein sequence. A series of locations specifying the exons are defined using the
annotation “join”. In the example below three segments/exons of a nucleotide sequence are
joined together (from 952 to 1043, from 1174 to 1396, and from 2228 to 2356) to yield
product/protein called “beta globin” where its amino sequence is defined using the
annotation “/translation”.

1.5.2. Protein Sequence Databases (Swiss-Prot)


Proteins are mixed of 20 different amino acids of variable lengths. The most two common
protein databases are PIR and SWISS-PROT. These databases may be considered as secondary
sequence databases because the majority of their sequences from the translation of nucleotide
sequences in GenBank.
Swiss-Prot is an annotated protein sequence database. An annotated image of the parts of a
sequence entry is shown in the following figure.

4
1.5.2.1. Entry name (sequence identifier)
The Swiss-Prot entry name (i.e., ID) consists of up to 11 uppercase alphanumeric characters (i.e.,
A to Z and 1 to 9) that can be symbolized as X_Y. X is a protein name code of at most 5
alphanumeric characters representing the protein name (e.g., B2MG is for Beta-2-microglobulin,
HBA is for Hemoglobin alpha, and INS is for Insulin). Y is a species identification code of at most
5 alphanumeric characters. Y is generally made of the first three letters of the genus and the first
two letters of the species. For example, the code ALLMI in the entry name HBB_ALLMI denotes
that the source of the Hemoglobin beta chain sequence (HBB) is Alligator mississippiensis, where
Alligator is the genus and mississippiensis is the species.
1.5.2.2. Date
The first DT line indicates when the entry first appeared in the database. The second DT line
indicates when the sequence data was last modified. The third DT line indicates when data other
than the sequence was last modified.

1.5.2.3. Description:
The DE (DEscription) always starts with the proposed official name of the protein. Synonyms are
indicated between brackets.

1.5.2.4. Taxonomy Information:


The GN (Gene Name) indicates the name(s) of the gene(s) that translated into the stored protein
sequence. The OS (Organism Species) line specifies the organism that was the source of the stored
sequence (Genus “mus” and species “musculus”). The OC (Organism Classification) lines
contain the taxonomic classification of the source organism (complete classification).

5
1.5.2.5. Reference(s):
The references section is comprised of lines, such as RN (Reference Number), RX (Reference
cross-reference) indicates the identifier assigned to a specific reference in a bibliographic database,
RA (Reference Author), RT (Reference Title) indicates the title of the paper, RL (Reference
Location).

1.5.2.6. Comments:
The CC lines are free text comments on the entry. These lines are grouped together in comment
blocks; a block is made up of 1 or more comment lines.
1.5.2.7. Database References:
The DR (Database cross-Reference) lines are used as pointers to entries found in other data
collections (databases other than Swiss-Prot). These entries provides relative information.
1.5.2.8. Feature Table Data:
The FT (Feature Table) lines describes regions in the sequence. According to the following figure,
the length of the complete sequence “CHAIN” is 251 (from 1 to 251) followed by any description.
In addition, the position/site/region of a small protein called zinc finger or “ZN_FING” inside that
“CHAIN” starts from 126 to 150 followed by description to specify the type of that it.

1.5.2.9. Sequence:
The SQ (SeQuence header) line marks the beginning of the sequence data (as the sequence begin
at the next line directly) and gives a quick summary of its content. The format of the SQ line is:

6
The SQ line contains the length of the sequence in amino-acids (AA) and a Cyclic Redundancy
Check 64-bit (CRC64) that provides an error check number (16-hex digits) of the full sequence.
This number can be used to check if any updates has occurred for the message/sequence.
1.5.3. Biological Patterns Databases (PROSITE)
The PROSITE is a pattern/regular expression database for patterns found in protein sequences.

According to the regular expression for S-100 proteins, [LIVMFYW](2) denotes the occurrence of
two letters from the set {L,I,V,M,F,Y,W}, while [LK] denotes the occurrence of only one letter
from the set {L,K}. The residue x denotes a wild-card (any letter). Thus, a motif “LL-LD-K-D-
LDL-D-LDL-N-F-D-E-F-DN-L-L” will match the above regular expression.
1.6. Gene Ontology Database
The Gene Ontology (GO) database aims to provide a set of terms/words for describing the
biological domain, where terms are linked together using only two relationships, IS-A and PART-
OF. The main purpose of any ontological database is providing unified layer for generalization.
The ontologies are structured as directed acyclic graphs where a child (more specialized
term) can have many parent (general) terms. Each entry in GO has a unique numerical identifier of
the form GO:nnnnnnn, and an associated term.
1.6.1. MATLAB Interface to GO
The latest version of Gene Ontology database can be downloaded over the web into
MATLAB using the function geneont when the live parameter is true. Information about this
ontology or its terms can be provided by the MATLAB function get.
1.6.2. Searching
Ontology objects may be searched using regular expressions function regexpi. For example,
the gene ontology object “go” (go has all terms) may be searched for the terms that contain the
term/word ‘ribosome’. Each term has a set of properties/attributes; id, name,… etc. regexpi
function matches the regular expression ‘ribsome’ with the name property of all terms. The result
is an array comparison which is equal in size to the array Terms. Each element in the array contains
either an empty set (if the property name is not matched) or the location (if the property name
matched). The indices of the non-empty (~) cells in the array comparison is stored in the array
indices.

7
1.6.3. Ancestors, Descendants and Relatives
Create sub-ontology for the ancestors of specific term (term with id property =46680). These
ancestors are obtained UP until the root (usually the maximum height=5).

Create sub-ontology for the descendants of the specific term. These descendants are obtained Down
until a specified depth (usually the maximum depth=5).

Create sub-ontology for the both ancestors and descendants of the specific term. The ancestors are
obtained UP until a specified height (e.g., 3) and the descendants are obtained Down until a
specified depth (e.g., 2).

8
1.6.4. Matrix of Relationships
The matlab function getmatrix converts a GO object into a matrix of relationship values
between nodes. A value of 0 in this matrix indicates no relationship, a value of 1 indicates an “is_a”
relationship, and a value of 2 indicates a “part_of” relationship. For example, relation “(1, 5) = 2”
means that the node of index 1 (i.e., first index) in the returned id list is part_of the node of the
fifth index. In other words, node with the name property ‘mitochondrial envelope’ is part of node
‘mitochondrial membrane’. Also, the relation “(4, 5)=1” means that the node with id stored in the
fourth index inside the returned id list is_a an instance of the node with id stored in the fifth index
inside the returned id list.

9
Chapter 2
Sequence Homology

10
Goodness of retrieved information is often measured by two parameters: precision and
recall. Precision measures the number of relevant records result set as a fraction of all the records
in the retrieved. Recall measures the number of relevant records that were retrieved as a fraction
of all relevant records that the database contains. For example, if there are T records in the database
that are relevant to the user query, and the result set contains S records out of which R are relevant,
the parameters precision and recall may be defined as the following equations:
Recall = R/T (1)
Precision =R/S (2)
2.2. Dot Plots
dot plot is a simple visual representation of the similarity between two sequences. Identical
sequences having a diagonal line in the center of the matrix.

Dot plots are a simple technique for visualizing similarity between two nucleic acid or
protein sequences. This technique utilizes a two-dimensional matrix, where the vertical axis for
one sequence and horizontal axis for another sequence. When the intersected residues of both
sequences match at the same location on the plot, a black dot being placed.
To enhance the selectivity of matches reported by dot-plots, and to reduce the noise, a
threshold tuple-size is utilized. The following code illustrates the use of matlab function seqdotplot
for comparing two sequences seq1 and seq2. The matlab function provides a window size/tuple
size of 6 with at least 5 characters required to be matched within the window before a dot is placed.

Both sequences are converted into blocks of 6 nucleotides; Seq1 of 50 nucleotide includes
45 blocks (Seq1Length “50” – windowSize “6”+ “1” = 45), ACCTGA, CCTGAC…TGCCT, and
TGCCTT. Also for Seq2. Then, each bock in seq1 will be matched with every block in seq2.
2.3. Sequence Alignment
Sequence alignment essentially involves placing one sequence above the other and
comparing the aligned vertical pairs. The format for representing the alignment between two DNA
sequences is shown below. The matching pair in the two sequences is shown as a (—). The (:) is
often used to identify undefined (i.e., Not A, C, T, or G). A mismatch occurs between the bases G
and T at positions 16 and 26 of the upper and lower sequences respectively. This mismatch is
represented by nothing.

11
2.3.1. Edit Distance
Naturally there a number of ways to edit the source string and transform it to the target
string. The set of operations required two such transformations of the string "PASTRY" into the
string "FACTORY" may be listed as follows using the notation D to denote a deletion, I to denote
an insertion, and R to denote a replacement. Note that the character M denotes a character match
between the first and the second string that requires no string edits.

This, under the unit cost model where every edit operation has a cost of 1, has a cost of 3. The second
set of edit operations requires two delete operations and three insert operations accumulating a total
cost of 5. Minimum distance is preferred.
2.4. Dynamic Programming Algorithm
Evaluate all possible ways of aligning one sequence against another in polynomial time is
impossible, where number of possible alignments grows exponentially with the length of the two
sequences.
Dynamic programming is an efficient technique for solving non-polynomial np problems. Dynamic
programming based on optimization algorithms (i.e., distance-based scoring or similarity-based
scoring measure) lets us finding alignment that almost achieve the minimum distance without
explicitly trying all possible alignments of two sequences.
A distance function will define the distance between matching characters as zero, and
assign some positive values for mismatches and gaps and then aim at minimizing this distance. A
similarity function on the other hand will assign a high (positive) value to matches and a low
(negative) values for gaps and mismatches and then maximize the resulting score. In computing
the scores, each cell takes constant time to compute and so the overall algorithm has time
complexity O(mn) where m and n are the lengths of the two sequences.
2.4.1. Distance-Based Alignment (Needleman-Wunsch global alignment)
Let us assume the following unit cost scoring model where distance between matching
characters defined to be “0” and the distance between mismatching characters to be “1” and
gap/indel/-.

(3)

12
(4)
Di-1,j-1 Diagonal ( ) Di-1,j insertion (→↓) or deletion (←↑)
Di,j-1 insertion insertion (→↓) or deletion (←↑) Di,j
The result of building the grid is shown in the following figure (alignment of sequence
‘ACCG’ to sequence ‘TCCTG’). The process of constructing the optimal global alignment begins
from the lowest-rightmost corner, i.e. cell (m,n), and traces back to the cell (0,0).

Total Score = d(A,T) + d(C,C)+ d(C,C)+d(-,T)+d(G,G) = 1+0+0+1+0=2 operations

Total score = d(A,T) + 2d(C,C) + d(-,T) + d(G,G) =1+0+1+0=2

13
In MATLAB, nwalign method can determine Needleman-Wunsch alignment with
different scoring matrices specified by the parameter ‘scoringmatrix’. NUC44 scoring matrix
represents the default value for this parameter in case of aligning nucleotide sequences that
composed of ‘NT’ alphabet (A, T, G, C). on the other hand, BLOSUM50 represents the default
value for that parameter in case of aligning amino acids sequences that composed of ‘AA’ alphabet
(I, G, R…etc.).

According to nuc44, gap = -8, matching =5, and mismatching = -4. So Total score =
3d(A,T)+2d(C,C)+2d(G,G) + 3d(T,T) +d(C,T) + d(G,T)= 3×-4+2×5+2×5+3×5+ - 4+ - 4 = 35-
20=15.
2.4.2 Similarity-Based Alignment (Smith-Waterman algorithm)
The example shown in the following figure performs an optimization based on
maximization of similarity between the two sequences using the following unit scoring model.

Alignment Table (S)

14
Similar to computing distance-based alignments, similarity alignments can be implemented
in MATLAB by using swalign or Smith-Waterman Alignment function.

To get rid of negative signs, scoring condition of the Smith-Waterman Alignment function
can be adjusted by adding a rule for replacing negative values by zero during the construction of
alignment table. For obtaining the path, start from highest rightmost corner backtracking until
reach zero, we should follow the max values and (1 for matching, -1 for mismatching and gap).

15
2.5. Alignment Types
There are three types of alignments; global, local and fit or semiglobal alignments. In
global alignment, the comparison occurs between the entire sequences of approximately equal in
length or belong to the same domain (e.g., dogs and wolves). So homology between them is
expected and different segments can be detected.
Local alignment aims to find maximally similar sub-sequences from the two sequences.
Fit/semiglobal alignment, used for pattern detection, is a match between a sub-sequence of one
sequence and an entire sequence (i.e., pattern) of another. It is a hybrid of the global and local
alignments.

16
Chapter 3
Multiple Sequence
Alignment

17
Multiple Sequence alignment (MSA) is a generalization of Pairwise/two Sequence
Alignment to multiple sequences. Progressive alignments are a commonly methods for
developing multiple sequence alignments.
3.1. Progressive Alignment Methods
Most algorithms use a guide tree for establishing an order in which the sequences are
merged into the progressively growing multiple alignment. A guide tree is formed by the concept
of applying agglomerative clustering for constructing binary tree whose leaves represent sequences
and internal nodes represent alignments.
3.1.1. Constructing the Guide Tree
The construction of the guide tree for a set of N sequences essentially proceeds as follows:
1. The pairwise similarity/distance score matrix is computed between every two sequences. For
example, pair-wise alignment between S1 and S2 using unit cost distance

score = 4
2. Each of the N sequences is considered singleton group.
3. Groups are merged by choosing the most similar groups then recomputing similarity/distance
score among the new merged groups.
4. The merging process stops when all sequences belong to one large group containing all N
sequences.

Consider the following set of sequences:


A T G C
A 0 1 1 1
T 1 0 1 1
G 1 1 0 1
C 1 1 1 0
Scoring Matrix

18
Based on the distance values, the closest sequences in the set are sequences S1 and S5 are merged
into a new sequence S6. The distances of this merged sequence, S6, is next computed by taking the
average distance of S1 and S5 to each of the remaining sequences.

The smallest distance between any two sequence pairs in the above array is now 5. We can choose
to merge any of three pairs that have this minimum distance. Let us choose to merge sequences S6
and S4 into a new group S7. Thus, since group S6 already contains two sequences, the distance
between S7 and S2 is computed as:
𝟐𝟐 × 𝒅𝒅(𝑺𝑺𝟔𝟔, 𝑺𝑺𝟐𝟐) + 𝟏𝟏 × 𝒅𝒅(𝑺𝑺𝟒𝟒, 𝑺𝑺𝟐𝟐)
= 𝟓𝟓. 𝟔𝟔𝟔𝟔
𝟐𝟐 + 𝟏𝟏
Distance between S7 and S3 is similarly computed
𝟐𝟐 × 𝒅𝒅(𝑺𝑺𝟔𝟔, 𝑺𝑺𝟑𝟑) + 𝟏𝟏 × 𝒅𝒅(𝑺𝑺𝟒𝟒, 𝑺𝑺𝟑𝟑)
= 𝟕𝟕
𝟐𝟐 + 𝟏𝟏
Upon completion of this step, the new distance matrix looks as follows:

The closest sequence pair to merge next is that of S2 and S3. The merged group is labeled S8:

The final step is simply a merge of groups S7 and S8 to yield the final group S9 that comprises of the
entire sequence set. Guide tree provides the order in which the sequences will be merged into the
progressively growing alignment.

19
3.1.2. Constructing MSA using the Guide Tree
While progressive alignments are formed, the gaps introduced in a pairwise alignment are
replaced with a special character, such as an X. This allows the gaps to progress till the end. The
dynamic programming alignment algorithms must be adjusted to consider Alignment_Socre( X ,
anything ) = 0.

According to the guide tree, the next sequence to be aligned is S4. The procedure for aligning this
sequence to this group will attempt to align this sequence with each of the sequences in the group
according to unit cost model.

If any gaps are introduced in S1 (less distance) after the alignment operation, these gaps are
correspondingly added to each of the other sequences in the group to keep the MSA consistent.

Similar process for grouping S2 and S3.

The next step in the progressive alignment process involves the merging of the two sequence
groups, {S1, S4, S5} and {S2, S3}. For an alignment of two groups, all sequence pairs in the two
groups are tried and the best scoring alignment between the groups is used.

20
3.1.3. Modeling MSA as Profiles
MSA ata can be represented numerically (profile representation). The profile
representation for a MSA with L columns is 5 × L matrix.

A
C
G
T
-
count(A) 1 count(C) 0 count(−) 2
For example, in column (1): number of sequeces = 3 , = 3… =3
number of sequeces number of sequeces
count(A) 2 count(C) 2 count(−) 2
In column (2): number of sequeces = 3 , = 3… = 3.
number of sequeces number of sequeces

21
3.2. Progressive Alignment in MATLAB
Matlab provides the nwalign function for performing global alignment.

22
Chapter 4

Biolinguistic Methods

23
A new method for sequence comparison based on k-mer word frequency profiles. In this
algorithm, the distribution of the k-mer words (i.e., 3-mer means converting a sequence into a set
substrings of 3 letters associated with their frequencies) are treated as the signatures of sequences.
In this manner each sequence is represented by a set of k-mer words. For a sequence with
length N and k-mer word, the number of words contained in that sequence is (N −k +1). As the
number of words increases exponentially at the rate of 4k in case of nucleotide sequences and 20k
in case of protein sequences that may need to some reduction techniques. For example, 3-mer is
comprised of 43 = 64 words or categories, etc.
4.1. Sequence Profiles (Vector Space Comparison)
The comparison of k-mer profiles may be accomplished using cosine similarity. The two
profiles are treated as vectors on a k-mer space then their similarity is computed using cosine
Equation 1.1

(1.1)
As a simple example, consider the cosine similarity computations using a one mer document
frequency vectors computed over the following three sequences:
S1 = ACCTGGTATCCATTGCCA
S2 = CCTTAATTGGGTT
S3 = TTCCGGTAGCGATACAATTAAC
There are four one-mers (41) for each DNA sequences. The one-mer profiles for these sequences
may be computed by considering the frequencies of the four nucleotides and using Laplace’s rule
(eliminates the occurrence of zero probabilities by adding a one to all the frequencies). As shown
in the following table f1(A)=4 will become 5 and f1(A+B+C+T)=18 will become 22.

24
Pairwise cosine similarity between these 1-mer profiles can be computed as:

Based on the cosine similarity values for the one-mer frequencies, sequences S1 and S3 are the
closest neighbors.

As a more complex example, consider the cosine similarity computations using 2-mer document
frequency vectors computed over the previous three sequences:
2-mer (S1) = {AC, CC, CT, TG, GG, GT, TA, AT, TC, CC, CA, AT, TT, TG, GC, CC, CA}.
2-mer (S2) = {CC, CT, TT, TA, AA, AT, TT, TG, GG, GG, GT, TT}.
2-mer (S3) = {TT, TC, CC, CG, GG, GT, TA, AG, GC, CG, GA, AT, TA, AC, CA, AA, AT, TT, TA, AA, AC}
Word F1 F2 F3 P1 P2 P3 P12 P22 P32 P1 P2 P1 P3 P2 P3
AA 0 1 2 1/33 2/28 3/37 0.001 0.005 0.007 0.002 0.002 0.006
AC 1 0 2 2/33 1/28 3/37 0.004 0.001 0.007 0.002 0.005 0.003
AG 0 0 1 1/33 1/28 2/37 0.001 0.001 0.003 0.001 0.002 0.002
AT 2 1 2 3/33 2/28 3/37 0.008 0.005 0.007 0.006 0.007 0.006
CA 2 0 1 3/33 1/28 2/37 0.008 0.001 0.003 0.003 0.005 0.002
CC 3 1 1 4/33 2/28 2/37 0.015 0.005 0.003 0.009 0.007 0.004
CG 0 0 2 1/33 1/28 3/37 0.001 0.001 0.007 0.001 0.002 0.003
CT 1 1 0 2/33 2/28 1/37 0.004 0.005 0.001 0.004 0.002 0.002
GA 0 0 1 1/33 1/28 2/37 0.001 0.001 0.003 0.001 0.002 0.002
GC 1 0 1 2/33 1/28 2/37 0.004 0.001 0.003 0.002 0.003 0.002
GG 1 2 1 2/33 3/28 2/37 0.004 0.012 0.003 0.006 0.003 0.006
GT 1 1 1 2/33 2/28 2/37 0.004 0.005 0.003 0.004 0.003 0.004
TA 1 1 3 2/33 2/28 4/37 0.004 0.005 0.012 0.004 0.007 0.008
TC 1 0 1 2/33 1/28 2/37 0.004 0.001 0.003 0.002 0.003 0.002
TG 2 1 0 3/33 2/28 1/37 0.008 0.005 0.001 0.006 0.002 0.002
TT 1 3 2 2/33 4/28 3/37 0.004 0.020 0.007 0.009 0.005 0.012
Total 17 12 21 1.0 1.0 1.0 0.073 0.077 0.0694 0.065 0.061 0.064

25
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 (𝑃𝑃1 𝑃𝑃2 ) 0.065
𝐶𝐶𝐶𝐶𝐶𝐶(𝑃𝑃1 , 𝑃𝑃2 ) = = = 0.867
�𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑃𝑃12 )�𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑃𝑃22 ) √0.073√0.077

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 (𝑃𝑃1 𝑃𝑃3 ) 0.061


𝐶𝐶𝐶𝐶𝐶𝐶(𝑃𝑃1 , 𝑃𝑃3 ) = = = 0.859
�𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑃𝑃12 )�𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑃𝑃32 ) √0.073√0.069

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 (𝑃𝑃2 𝑃𝑃3 ) 0.064


𝐶𝐶𝐶𝐶𝐶𝐶(𝑃𝑃2 , 𝑃𝑃3 ) = = = 𝟎𝟎. 𝟖𝟖𝟖𝟖𝟖𝟖
�𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑃𝑃22 )�𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝑃𝑃32 ) √0.077√0.069
4.2. Divergence Measures (5 measures)
Profile based divergence measure is a distance metric that falls between the value of 0.0
and 1.0. Several techniques have been proposed for measuring divergence, such as Variation
distance V (p1, p2), Divergence I(p1, p2), Divergence J(p1, p2), L(p1, p2) divergence measure
utilized in our research is derived from relative entropy or the K-divergence.

(1.3)

(1.4)

(1.5)

(1.6)

(1.7)

where, W represents the different words in a profile.


The divergence measure may also be used for measuring the similarity between sequences A and
B.

(1.8)

According to the previous matrix (1-mer) showing the frequency of the three DNA
sequences (S1, S2, and S3)

V (P1, P2) = 0.5(|0.051|+|0.142|+|- 0.053|+|- 0.139|) = 0.5 (0.051+0.142+0.053+0.139) = 0.193


On the same manner, V(P1, P3)=0.091 and V(P2, P3)=0.186, are calculated.

According to the previous 2-mer matrix, the variational distances can be calculated
according to the following table:

26
| P1 – P2 | | P1 – P3 | | P2 – P3 |
0.041 0.051 0.01
0.025 0.02 0.045
0.005 0.024 0.018
0.019 0.01 0.01
0.055 0.037 0.018
0.05 0.067 0.017
0.005 0.051 0.045
0.011 0.034 0.044
0.005 0.024 0.018
0.025 0.007 0.018
0.047 0.007 0.053
0.011 0.007 0.017
0.011 0.048 0.037
0.025 0.007 0.018
0.019 0.064 0.044
0.082 0.02 0.062
Total = Total = Total =
0.437 0.475 0.477
V (P1, P2) = 0.5 × Total(| P1 – P2 |) = 0.5 × 0.437 = 0.2186
V (P1, P3) = 0.5 × Total(| P1 – P3 |) = 0.5 × 0.475 = 0.2375
V (P2, P3) = 0.5 × Total(| P2 – P3 |) = 0.5 × 0.477 = 0.2384

Other divergence measures are similarly computed (i.e. I, J, L, and V). For I, the following tables
compute the I values according to the 1-mer matrix mentioned previously as follow:
P1,2 = P1,3 = P2,3 = P1× P1 × P2 ×
1-mer
P1/P2 P1/P3 P2/P3 Ln(P1,2) Ln(P1,3) Ln(P2,3)
A 1.288 0.739 0.574 0.057 -0.07 -0.098
C 1.803 1.379 0.765 0.188 0.102 -0.047
G 0.773 0.945 1.224 -0.05 -0.01 0.047
T 0.662 1.013 1.529 -0.11 0.004 0.175
Total 4.526 4.076 4.091 0.086 0.027 0.077
I(P1,P2) = Total(P1× Ln(P1,2))= 0.086
I(P1,P3) = Total(P1× Ln(P1,3))= 0.027
I(P2,P3) = Total(P2× Ln(P2,3))= 0.077
P2,1 = P3,1 = P3,2 = P2 × P3 × P3 ×
1-mer
P2/P1 P3/P1 P3/P2 Ln(P2,1) Ln(P3,1) Ln(P3,2)
A 0.776 1.354 1.744 -0.04 0.0932 0.171
C 0.555 0.725 1.308 -0.1 -0.074 0.062
G 1.294 1.058 0.817 0.061 0.0108 -0.039
T 1.51 0.987 0.654 0.17 -0.003 -0.114
Total 4.135 4.124 4.522 0.082 0.0264 0.08

27
I(P2,P1) = Total(P1× Ln(P2,1))= 0.082
I(P3,P1) = Total(P1× Ln(P3,1))= 0.0264
I(P3,P2) = Total(P3× Ln(P3,2))= 0.08
According to values of the I measure, J measure can be computed as follow:
J(P1,P2) = J(P2,P1)= I(P1,P2) + I(P2,P1) = 0.086 + 0.082 = 0.168
J(P1,P3) = J(P3,P1)= I(P1,P3) + I(P3,P1) = 0.027 + 0.0264 = 0.053
J(P1,P2) = J(P2,P1)= I(P1,P2) + I(P2,P1) = 0. 0.077+ 0.08 = 0.157

For the relative entropy or the K-divergence, the following tables contribute computing the K
values according to the 1-mer matrix as follow:
1- M1,2 = M1,3 = M2,3 = P1× P1 × P2 ×
mer (P1+P2)/2 (P1+P3)/2 (P2+P3)/2 Ln(P1/M1,2) Ln(P1/M1,3) Ln(P2/M2,3)
A 0.202 0.267 0.242 0.0269 -0.037 -0.056
C 0.247 0.274 0.204 0.0802 0.047 -0.025
G 0.209 0.187 0.214 -0.025 -0.005 0.0225
T 0.343 0.271 0.341 -0.062 0.0018 0.0783
Total 1 1 1 0.0202 0.0066 0.0198
K(P1,P2) = Total(P1× Ln(M1,2))= 0.0202
K(P1,P3) = Total(P1× Ln(M1,3))= 0.0066
K(P2,P3) = Total(P2× Ln(M2,3))= 0.0198
1- M2,1 = M3,1 = M3,2 = P2× P3 × P3 ×
mer (P1+P2)/2 (P1+P3)/2 (P2+P3)/2 Ln(P2/M2,1) Ln(P3/M3,1) Ln(P3/M3,2)
A 0.202 0.267 0.242 -0.024 0.0431 0.0738
C 0.247 0.274 0.204 -0.06 -0.04 0.0289
G 0.209 0.187 0.214 0.0284 0.0053 -0.02
T 0.343 0.271 0.341 0.0761 -0.0017 -0.063
Total 1 1 1 0.0212 0.0066 0.0191

K(P2,P1) = Total(P2× Ln(M2,1))= 0.0212


K(P3,P1) = Total(P3× Ln(M3,1))= 0.0066
K(P3,P2) = Total(P3× Ln(M3,2))= 0.0191
According to values of the K measure, L measure can be computed according to the 1-mer matrix
mentioned previously as follow:
L(P1,P2) = L(P2,P1)= (K(P1,P2) + K(P2,P1))/2 = (0.0202 + 0.0212)/2 = 0.0207
L(P1,P3) = L(P3,P1)= (K(P1,P3) + K(P3,P1))/2 = (0.0066 + 0.0066)/2 = 0.0066
L(P1,P2) = L(P2,P1)= (K(P1,P2) + K(P2,P1))/2 = (0.0198+ 0.0191)/2 = 0.0195
As observed by all the divergence measures, sequences S1 and S3 are the closest neighbors.

28
Exercises

29
1- Answer the following questions for the Ontology graph shown in the following figure, Assume
that the arrows represent an “is a” relationship:

a. What are the descendants for GO:0030684?


b. What are the ancestors for GO:0009547?
c. What are the relatives for GO:000313?
d. Geneont is a MATLAB bioinformatics function for downloading the current version of the GO database from
the Web. (answered)
e. go.terms(183).id displays the GO identifier of the 183rd term in the geneont object, go. (answered)
f. get(go(213).terms, ‘name’) displays the property name of the Go identifier “213” in the geneont object, go.
(answered)

g. num2goid(go.terms(183).id) formats the GO identifier into a character array. (answered)

h. Find the index or position number (in the object go) of the term object whose name property is 'membrane'.
(answered)

2- Given a sequence entry from the GENBANK shown in the following figure, provide a short
description for each of the following:
a. What is the accession number for this sequence?
b. What is the length of this sequence?
c. Which genus and species of the organism is this sequence derived from?
d. If a researcher wants to read further about the gene reported in this sequence, what would
be a good place to start their research? Be specific.
e. Use matlab help to retrieve the sequence of with the accession number M10051. 
Answer: S = getgenbank('M10051');

30
3- Find the optimal local alignment between strings s = ‘AAAGGCATT’ and t =
‘ATAGAGCCAGTT’ assuming a unit cost model.
4- Find optimal global alignment between strings s = ‘AAAGGC’ and t = ‘AAGGC’ assuming a
unit cost model.
5- Given that the scoring scheme is S(a,a) = +1, and S(a,-) = S(-, a) = S(a,b) = -1, find the global
alignment between sequences, ‘ACCCGGGTTGC’ and ‘ATCCTGGGC’, below. Clearly show
the alignment grid and list all the optimal alignments.
6- Nucleotide similarity scores under the unit cost model are defined as s(u, u) = +1, s(u, v) = -1,
s(u,-) = -1, and s(-, u) = -1. Compute the local alignment between U=’ATTAGGAATTAA’ in
sequence V = ‘CCACCATTTAATTT’.
7- Which of the following is not an alignment between sequences s = ‘ATTACG’ and t = ‘TTAG’
(note that “:” means mismatch) ?

8- Assume that the score for a match is a +1. The penalty for a mismatch is -2 and for every gap
is -1. What is the score for each of the following alignments?

31
9- Which of the following represents a local alignment between sequences s = ‘AATACG’ and t
= ‘TTTACT’?

10- Given the set of four sequences being aligned,


S1 = ’ATTGGCACCA’
S2 = ’ATTTGGACCA’
S3 = ’TGGTTCCA’
S4 = ’ATTCCACCAC’
The distance scores have been computed under the unit model, as follows:

(a) Compute the guide tree. Show the values of the distance matrix after each merge operation.
(b) Assume that the distance scores between the sequences continue to be the same as the process of
progressive alignment proceeds, determine what the multiple alignment would look like.
11- Consider the following set of sequences in DNA database:
- Seq-1: AGACTGTTACCCAGAAAACTTACAAATTGTAAATGAGAGGTTAGTGAAGAT
Seq-2: GGATCCAGCCTGACCTTGTAAAATAGCCTAACGTGTGTTCCCTAG
Seq-3: CTTAAGACATCAAACAATGTATGTTGAGTTTAACAAGGGAACACAACAAGATG
Seq-4: CTGCTCTAGGAAAAAATGCCTAGATACAAATAAAGACTTT
Seq-5: CTCAGCTTTTGTTTGTCTTGGAAAGTTGTTATTTTTCCCTCATTTCTGAAGGTCA
a. Develop a 2-gram frequency profile for the sequence collection.
b. Using cosine similarity to find the nearest neighbor for the query sequence below. Compare your
results with those obtained with variational distance:
Query: AAGATAACACATACAGAAAATGTGAGAAAA
31. Given the set of four sequences being aligned:
S1 = ’ATTGGCACCA’
S2 = ’GGTTCCA’
S3 = ’AAATTGGACC’
S4 = ’TATTCCACCA’
The similarity scores have been computed, as follows:

a. Compute the guide tree. Show the values of the similarity matrix after each merge operation.
b. Assume that the similarity scores between the sequences continue to be the same as the process of
progressive alignment proceeds; can you come up with what the multiple alignment would look
like?

32
32. What is the main goal of the Bioinformatics field?
33. Suppose there are four types of databases (A- Relational Database, B- Annotated Database, C-
Regular Expression Database, D- knowledge base), assign the matching choice to the following
databases: GenBank (.…....), SwissProt (….....), Gene ontology (.…....), and PROSITE (.…....).
34. According to the following definition line of a GenBank record, answer the following:

a. What is Species of the associated organism? ………….


b. What is Genus of the associated organism? ..…………
c. What is Protein name of the associated nucleotide sequence? .……
d. Suppose that the protein code is “HBB”, suggest a suitable entry name
for the given protein.
e. Why SWISS-PROT database is considered as secondary sequence
databases?
f. Given a set of IDs (ATCG279110, AY279110, AT279110CG,
A279110Y), which one may be considered is as a genbank accession
number? ……
g. According to the following cds feature section of a GenBank record,
how many amino acids of the beta login product?

35. According to the following feature table lines of a SWISS-PROT record, what do terms
“BTEB4”, “C2H2 Zinc Fing”, and “Pro-rich” denote?

36. According to the SQ line of a SWISS-PROT record, what do terms “251 AA”, and
“3F0D7739BF7B1FA4” denote?

37. According to the given prosite signature of calcium binding protein, suggest two amino
sequences accepted by this signature, where the amino acid alphabet composes is
{ACDEFGHIKLMNOQRSTVWY}

33
38. Convert the following gene-ontology object into its corresponding relationship matrix

39. Ontologies are structured as directed acyclic graphs, in the same context answer the following:
a. Can a child can have many parents in ontological databases?
b. What is the main task of ontological databases?
c. What are the common two relationships used in ontological databases?
40. Goodness of retrieved information is often measured by two parameters (i.e., precision and
recall), according to the given counts of a specific request to an information retrieval system,
how goodness of this system?
Retrieved Not Retrieved
Relevant 30 20
Not Relevant 10 40
41. Dot plots is a simple form for data visualization. In the same context, answer the following:
a. How identical sequences are described according to this visualization?
b. How to enhance the selectivity of matches reported by dot-plots?
42. There are three types of alignment, give a brief description for each type.
43. Which of the following represents a local alignment between sequences s = ‘AATACG’ and t
= ‘TTTACT’?

44. Given the following five sequences and the distance values among them, answer
the following points:

a. Systematically, construct the guide tree needed to complete the multi-


sequence alignment (MSA) process.
b. Assume that the final loop of the MSA process with the unit cost scoring
model (+1 for mismatching and gap, and zero for others) outputs two blocks

34
of aligned sequences (i.e., {S1, S4, S5}, and {S2, S3}). How would MSA of
the five sequences look like according to the following six aligned pairs of
the two blocks?
c. Construct a profile for the five aligned sequences.

35
DNA General View2
(Sequence modeling)

1
5.1 Independent Identical Distribution (IID)

5.2 Markov Chain Model


In a first-order Markov Chain Models (MCMs), the probability of observing a
nucleotide at location ( i ) is only depends on the nucleotide at location (i-1).
Similarly, in a second-order MCMs the probability of observing a nucleotide at
location ( i ) is only depends on the nucleotides at locations (i-1) and (i-2).

(3)

(4)

2
3
6.1. Weight Matrices
A DNA sequence matrix is a set of fixed-length DNA sequence segments
aligned with respect to an experimentally determined biologically significant
function. A DNA sequence motif can be defined as a matrix of depth 4 utilizing a
cut-off value. The 4-column/mononucleotide matrix description of a genetic signal
is based on the assumption that the motif is of fixed length, and that each nucleotide
(i.e., A, C, T, and G) is independently recognized by a transacting mechanism. If a
set of aligned signal sequences of length L correspond to the functional signal under
consideration, then F = [fbi], (b ∈ Σ), (j = 1. . .L) is the nucleotide frequency matrix,
where fbi is the absolute frequency of occurrence of the bth. type of the nucleotide out
of the set Σ = {A,C, G, T } at the ith. position along the functional site. A method for
converting the frequency matrix into a weight matrix has been proposed. This
method is based on weights at a given position being proportional to the logarithm
of the observed base frequencies. These are increased by a small term that prevents
the logarithm of zero and minimizes sampling errors. The weight matrix is computed
as shown in Eq. 6.1. The term fbi is the frequency of base b at position i, and eb
represents the expected frequency of base b, ci a column specific constant, and s, a
smoothing percentage.

(1)
These optimized weight matrices can be used to search for functional signals
in nucleotide sequences. Any nucleotide fragment of length L is analyzed and tested
for assignment to the proper functional signal. A matching score of ∑𝑳𝑳𝒊𝒊=𝟏𝟏 𝑾𝑾(𝒃𝒃𝒊𝒊 , 𝒊𝒊) is
assigned to the nucleotide position being examined along the sequence. In the search
formulation, bi is the base at position i along the sequence, and W(bi, i) represents the
corresponding weight matrix entry for base bi occurring along the ith position in the
motif. For example, the weight matrix reported for the functional pattern commonly
known as the TATAA-Box is shown in Table 6.1. Matrices such as PAM and
BLOSUM matrices are derived in a manner similar to the process described above.
The term Position Specific Scoring Matrix, (PSSM), as these are called, is often used
to define the individual score profile within the various columns of the pattern. A
PSSM can be used to search for a match in a longer sequence by evaluating a score
Sj for each starting point j in the sequence from position 1 to (N − L + 1) where L is
the length of the PSSM.
Consider a block of DNA sequences representing an un-gapped alignment:

4
The computation of frequency values used in the above score matrices utilize the Laplace rule,
such that all frequency values are incremented by 1 to avoid occurrences of zero probability. Thus,
frequencies of {A, C, G, T} in the first column are set to {(5+1), (1+1), (0+1), (0+1)}.

Thus, PA = 26 / 60 = 13 / 30, PC = 9 / 60 = 3 / 20, PG = 8 / 60 = 2 / 15, PT = 17 / 60.

Next, the weight matrix is constructed by considering the log-odds score of (fbi / eb). For example,
assuming that s = 10 and c = 0, the log-odds score of the nucleotide A at column 1 is
𝑓𝑓 𝑠𝑠 6/10 10
log 2 � 𝑒𝑒𝐴𝐴1 + 100� = log 2 �13/30 + 100� = 𝟎𝟎. 𝟓𝟓𝟓𝟓𝟓𝟓. The value thus obtained is multiplied by 100, and
𝐴𝐴
the fractional part is dropped, yielding a weight of 57 for nucleotide A in column 1 of the matrix
shown below. Other values are similarly computed.

6.1.1. Conversion of sequence to position probability matrix


A position weight matrix (PWM) has one row for each symbol of the alphabet (4 rows
for nucleotides in DNA sequences or 20 rows for amino acids in protein sequences) and one
column for each position in the pattern. In the first step in constructing a PWM, a basic position
frequency matrix (PFM) is created by counting the occurrences of each nucleotide at each position.
From the PFM, a position probability matrix (PPM) can now be created by dividing that former
nucleotide count at each position by the number of sequences, thereby normalizing the values. For
example, given the following DNA sequences:

5
Probability of a sequence given a PPM mode can be calculated by multiplying the relevant
probability at each position. For example, the probability of the sequence S = “GAGGTAAAC”
given the above PPM model M can be calculated as follow:
P(S|M) = 0.1 × 0.6 × 0.7 × 1.0 × 1.0 × 0.6 × 0.7 × 0.2 × 0.2 = 0.0007056
6.1.2. Conversion of position probability matrix to position weight matrix
Most often the elements in PWMs are calculated as log likelihoods as show in Eq. 3.1.
Suppose that the column specific constant (ci) the smoothing percentage (s) are set to zero, the
elements of a PPM are transformed using as follows:
Mk, j = log2 (fbi / eb)

The score is 0 if the sequence has the same probability of being a functional site and of
being a random site. The score is greater than 0 if it is more likely to be a functional site (how
strong relation between nucleotide and the specified position according to the given set of
sequences ) than a random site, and less than 0 if it is more likely to be a random site than a
functional site.

6
6.2 Position Dependent Markov Models
Markov models have been considered as a means to define the background DNA sequence.
Markov models enable us to define the probability of a nucleotide (i.e., A, C, G, T) conditioned
upon the nucleotides that occur in the preceding position. However, the modeled dependency is
position-invariant. A position dependent Markov model may be utilized for the representation of a
sequence signal/motif/pattern. This model is defined on the sample space Σ∗ and assigns to every
sequence x on Σ∗ a probability given by Eq. 3.2 below:

(2)
This model has |Σ| + (n − 1) × |Σ| parameters. The first Σ parameters in this equation are first-
2

order probabilities estimated using the occurrences of the symbol α ∈ Σ at the first position of a
pattern. The other (n−1) × |Σ|2 probability values are for the conditional occurrence (i.e. first order
Markov process) of symbol α at position i given that symbol β occurred at position (i−1). These
|Σ|2 parameters need to be estimated for each of the remaining (n − 1) positions in the pattern.
Thus, the position-specific dependencies on the previous position are determined by allowing a
unique set of transition probabilities to be associated with each position along the signal. Generally,
this model assumes that an ungapped multiple sequence alignment of the pattern instances is
available, and that the number of sequences is sufficient training to induce position specific Markov
probabilities.
Consider An ungapped alignment of a set of sequences and the corresponding parameters
are shown below:

Position specific Markov frequencies for positions 2-N

7
The frequency values listed in the above table may be converted to probability values by
multiplying it with (1/21). Therefore, the probability of matching the subsequence “ATTCA” with
the model pattern can be calculated using Eq. 3.2 as P1(A)P2(T|A)P3(T|T)P4(C|T)P5(A|C)= 6/9 ×
3/21 × 2/21 × 1/21 × 1/21 = 0.000021.
6.3. Hidden Markov Models
There are several extensions to the classical Markov chains, and Hidden Markov Models
(HMM) are one such extension. HMM utilizes a set of hidden states with an emission/observation
of the symbols associated with each state. An N-state HMM is parameterized using the set λ = {A,B,
π} defined as follows:

Matrix A represents transition probabilities among hidden states (e.g., weather states SK:
cloudy, rainy, and sunny), while B matrix represents corresponding probabilities (i.e., emission
probabilities Ox) of the some observed variables/properties (i.e., emitting states, such as sad  and
happy) that depends on the weather states (i.e., hidden states).

8
Probability of the previous situation P(Y=--, X=sunny-cloudy-sunny) can be calculated as
follow:

P(X1=sunny) = frequency of sunny days / total number of days. Suppose that the stationary state
ℼ=[rainy cloudy sunny] = [0.218, 0.273, 0.509], P(Y=--, X=sunny-cloudy-sunny) = 0.509
* 0.8 * 0.3 * 0.4 * 0.4 * 0.2 = 0.00391.

Although a general topology of a fully connected HMM allows state transitions from any
state to any other, this structure is almost never used. Consequently, more restrictive HMMs reduce
the model’s complexity and the number of model parameters that are needed. One such model is
defined to be the profile-HMM, which is induced from a multiple sequence alignment.
Let us consider a set of six DNA sequences shown in Fig. 6.1(a). A multiple sequence
alignment of these sequences is the first step towards the process of inducing the Hidden Markov
Model. As shown in this figure, each match state (Mi) at position “i” is followed by an insert sate
at the same position (Ii), match state at next position (Mi+1), or deletion state at next position (Di+1).
Model Topology: The topology of the HMM is established using the consensus/canonical
sequence that can be generated from the aligned columns. For example, the second, third, and forth-
aligned columns in Fig. 6.1(b) have “A”, “G”, and “C” as the dominant nucleotides with
percentages 5/6 > 50%, 5/6 > 50%, and 4/6 > 50%, respectively. Therefore, the consensus sequence
is “-AGC---”. If the percentage was (< 50) for all the 4 nucleotides, the corresponding column in the
consensus sequence will be denoted by “N” / “-” in case of the domination of gaps.
The aligned columns of symbols correspond to either emissions from the same match state
or to emissions from the same insert state. In this formalism therefore, the columns that correspond
to the match state are established to define the match states of the HMM architecture. As shown in
Fig. 6.2 there columns are marked as M1, M2 and M3 (“-AGC---” “*M1M2M3***”).

Fig. 6.1. The set of sequences in (a) are aligned and shown in (b)
Now, we should compare sates of the consensus sequence “*M1M2M3***” with each
aligned sequence. For example, the consensus sequence to be matched with the seq-1

9
“**GCCCA”, ‘M1’ is missing (delete sate) and the last three positions “CCA” are extra (insert
states). Therefore, the transition states of seq-1 should be “M0D1M2M3I3I3I3”

Fig. 6.2(a): the consensus columns are used to define the match states M1, M2 and M3 for the HMM. After having
defined the match states, the corresponding insert and delete states are defined to complete the profile-HMM
topology.
seq-1 M0 D1 M2 M3 I3 I3 I3 M4
seq-2 M0 M1 M2 M3 M4
seq-3 M0 I0 M1 M2 M3 M4
seq-4 M 0 M 1 M 2 M 3 I 3 M4
seq-5 M0 M1 M2 M3 I3 M4
seq-6 M0 M1 M2 M3 M4
Fig. 6.2(b): the consensus columns are used to define the match states M1, M2 and M3 for the HMM. After having
defined the match states, the corresponding insert and delete states are defined to complete the profile-HMM
topology.

Transition Probabilities: the value of each transition probability is computed using the
frequency of the transitions as each sequence is considered. The model parameters are computed
using the state transition sequences defined in Fig. 6.2(b). The frequency of each of the transitions
and corresponding probabilities are shown in Fig. 6.3.
State
0 1 2 3
(00,1) (11,2) (22,3) (33,4)
MM 4 5 6 3
MD 1 0 0 -
MI 1 0 0 3
IM 1 0 0 3
ID 0 0 0 -
II 0 0 0 2
DM - 1 0 0
DD - 0 0 -
DI - 0 0 0
Fig. 6.3(a): The state transitions inferred from the above topology are used to compute the frequency of transitions
for the various states in the model and state transition sequences in Fig. 3.2., where ‘-‘ means no state (e.g., D0, and
D4). These frequency are subsequently utilized to compute the transition probabilities shown on the model above.
Laplace rule is used to avoid zero probabilities (‘-’ will not be considered).

10
Fig. 6.3(b): The state transitions inferred from the above topology are used to compute the frequency of transitions
for the various states in the model and state transition sequences in Fig. 6.2., where ‘-‘ means no state (e.g., D0, and
D4). These frequency are subsequently utilized to compute the transition probabilities shown on the model above.
Laplace rule is used to avoid zero probabilities (‘-’ will not be considered).

Emission Probabilities: Having thus specified the state transition sequence, the emission
probabilities for each of the symbol (i.e., α : A, C, T, and G), α ∈ |Sigma| is computed for each
match and insert state, k, in the model. The emission probability is computed using the formula
Eq. 3. Thus, an emission probability is associated with each state, and specifies the probability of
emitting each of the symbols in |Σ| in the state k.

(3)
Using the above formulation, the emission probability for each state is computed as shown in Fig.
6.4.

Fig. 6.4(a): The state specific frequency of observation of a symbol is used for determining the probabilities of
emissions. Again, Laplace rule is used to avoid zero probabilities.

11
Fig. 6.4(b): The state specific frequency of observation of a symbol is used for determining the probabilities of
emissions. Again, Laplace rule is used to avoid zero probabilities.

For constructing the emission probabilities shown in Fig. 6.4 (a), match and insert states
are only focused. There are only three match states (i.e., M1, M2, and M3) in positions (1, 2, and 3)
and two insert states (I0, I3). According to Fig. 6.1 (b), for M1, “A” is observed 5 times in position
“1”, while other nucleotides (C, T, G) are observed in the same position. For M2, “A” and “G” are
observed once and 5 times respectively in position “2”. For M3, “A” and “C” are observed 2 and 4
times respectively in position “3”. For, I0, “A” is observed once in position “0”. For I3, “A” and
“C” are observed once and twice respectively in position “4” as I3 usually comes after M3.

From a standpoint of parameters used to characterize an HMM, all states have an emission
probability vector associated with them. It may be easy to think of the emission matrix as an (N ×
(|Σ +1|)) matrix, corresponding to the extension of the base alphabet with the symbol φ used for
denoting a deletion. The probability of the emission of the symbol φ is set to 0 for the insert and
the match states. Correspondingly, the probability of emission of symbol φ is set to 1 for the delete
state with all other probabilities being set to 0.
Similar process is followed for inducing the HMM for an aligned set of protein sequences
(i.e., 20 amino acids instead of 4 nucleotides). Consider the following sequence alignment
representing a protein motif.

12
6.4 Hidden Markov Models with MATLAB
A multiple sequence alignment must be generated as a prerequisite to constructing Hidden
Markov models. The MSA is then fed into the HMM generation function which analyzes the
alignment in a column wise fashion and comes up with the requisite number of match, insert, and
delete states that are maximally probable from the observations of the multiple sequence alignment.
So, the process is begun by generating a MSA as shown below.

13
Exercises

14
1- Assume that the background sequence is given as shown below:
CCTTA ATTAC CAAGG CATTA CCGAT
a. Construct an IID model for the sequence and compute the probability of finding the
pattern CAAT.
b. Compute the probability of finding CAAT under a first order Markov model. You
only need to compute the conditional probabilities that you require for this purpose.

2- Given the sequence:


ATATTATGCCGTATAACCGGTT
Construct its IID model and a first order Markov chain model. Using these models, estimate
the probability of finding sequence, ATTA in the sequence.

3- Assume that the background sequence is given as shown below. Construct an IID model for
the sequence and compute the probability of finding the pattern ATTA.
ATTTT CTGGG ATATC CGGAG GATAT GGGAC CCTAG
4- What is the probably of finding the pattern ATTA in a sequence with an IID model
characterized by PA = PC = PG = PT = 1/4 (model with equal probabilities)?
5-

Also, construct a profile hidden Markov Chain model


6- Consider the following sequence set:
ATG
GTG
ATG
TTG
ATG
ATG
GTG
ATG
These sequences represent the translation start sites obtained as a result of multiple sequence
alignments of a family of genes. Show the profile matrix resulting from these sites. Also,
develop the position scoring weight matrix.

15
7- What is the probability of finding a pattern ATTACG in a DNA sequence where all bases (i.e.,
A, C, T, and G) are equally likely? How many occurrences of this pattern do you expect to see
in a sequence that is 108 base pairs in length (i.e., length of sequence 108)?
Ans:
- PA=PC=PT=PG=1/4  P(ATTACG)= PA PT PT PA PC PG =(0.25)6= 0.000244.
- 0.000244 * 108 = 24400
8- Assume that PA= PT = 0.2, and PG = PC = 0.3, Calculate the occurrence of the pattern
ACCTGACC in a sequence window of 500 bp long?
Ans:
- Probability of one occurrence P(ACCTGACC) = (0.2)3 × (0.3)5 = 0.00001944.
Therefore the expected occurrence (once occurrence) of this pattern = 0.00001944 *
500 = 0.00972 = λ.
9- MCQ
- For a sequence of length 99 and 3-mer word, what is the number of words in this
sequence?
A- 99 B- 3 C- 33 D- 97 D
- What is the possible number of 3-mer words obtained in case of nucleotide sequences?
A- 4 B- 3 C- 64 D- 16 (C)
- How does increasing the mer length affect the computational time?
A- exponentially B- linearly C- logarithmically D- polynomial (A)
- Suppose we have two sequences S1=”TATA” & S2=”TTAT” for constructing profile,
probability of nucleotide “T” for S1 with considering one-gram frequency and Laplace
is ………………..
A- 3/4 B- 1/2 C- 3/8 D- 1/4 (C).
- According to the previous statement (4), the “V” measure to calculate the divergence
between S1 and S2 equals …..………..
A- 1/4 B- 1/2 C- 1/8 D- 3/8 (C)
4
- How many occurrence of the pattern “ATTA” in a DNA sequence of 10 expected length
using IID modeling where all nucleotide bases are equally likely?
A- 39 B- 25 C- 65 D- zero (A)
- How many parameters needed to construct a first-order markov model for modeling
nucleotide sequences?
A- 8 B- 12 C- 16 D- 20 (D)
- How many parameters needed to construct a position dependent markov model for
modeling nucleotide sequences aligned into five positions?
A- 25 B- 68 C- 21 D- N/A (B)
- Considering the sequence analysis, the conditional probability P(β|α) =
…………………
A- P(αβ) / P(α) B- P(βα) / P(α) C- P(αβ) / P(β) D- P(βα) / P(β) (A)

16
- Given probabilities P(A)=5/24, P(C)=7/24, P(AC)=1/24, and P(CA)=2/24, with
considering the sequence analysis, what is the probability of finding “C” at position
“i+1” where “A” has been found at position “i”?
A- 1/7 B- 2/7 C- 2/5 D- 1/5 (D)
- Considering Weight Matrices for multi sequence modeling, what is the weight matrix
size needed for modeling ten DNA sequences aligned into 7 positions or columns?
A- 10×7 B- 4×7 C- 7×7 D- 4×10 (B)
- Considering the weight matrix for modeling multi-sequences, what does the score zero
refer?
A- Functional B- Random C- Functional or Random D- Non-exist
site site site site (C)
- Considering the weight matrix for modeling multi-sequences, what does the score > 0
refer?
A- functional B- random C- functional or random D- nonsexist (A)
site site site site
- Considering the weight matrix for modeling multi-sequences, what does the score < 0
refer?
A- functional B- random C- functional or random D- nonsexist (B)
site site site site
- The Maximally Likelihood Estimation procedure proposed by Baum-Welch
is…………HMM.
A- Generalized B- Restricted C- Free connected D- Fully connected (B)
- Considering profile-HMM for modeling sequences, the emission probabilities are
computed for each…………..state
A- match & B- match & C- delete & D- match & delete & (B)
delete insert insert insert
- Sequence mining considers ………………
A- order B- adjacency C- order & adjacency D- no-repeat (A).
- The greater the degree of divergence, the higher the similarity (True, False)
- Laplace rule eliminates the occurrence of zero probabilities by adding a one to all the
frequencies (True, False)
- L-divergence measure is derived from the relative entropy or so-called K-divergence
(True, False)
- According to the first order Markov, one-mer is the minimum allowed length of a word
(True, False)

17
- Position dependent markov can be used for modeling multi-aligned sequences (True,
False)
- Independent identical distribution is a suitable algorithm for modeling multi-sequences
(True, False)
10- Topology of HMM is established using consensus sequence. According to the following
five aligned protein sequences, what is the length of the produced consensus sequence?
A- 8 B- 9 C- 6 D- 3 (D)

How many delete states in the constructed topology?


A- 8 B- 9 C- 6 D- 3 (D)
What is the transition probability from match state 1 to match state 2 considering Laplace?
A- 1/4 B- 1/5 C- 1/3 D- 1/2 (A)
What is the emission probability at match state 1 for the amino acid “V” considering
Laplace?
A- 3/5 B- 3/20 C- 3/25 D- 5/25 (D)
11- Considering an 3-state hidden markov model (HMM) parameterized using the set {A, B,
π}, where “A” stands for the transition probabilities of three hidden states, “B” stands for
the emission probabilities for two observed sates, and π is the stationary probabilities of the
three hidden states, what is the size of the “A” matrix?
A- 3×3 B- 2×2 C- 3×2 D- 1×3 (A)
what is the size of the “B” matrix?
A- 3×3 B- 2×2 C- 3×2 D- 1×3 (C)
what is the size of the “π” matrix?
A- 3×3 B- 2×2 C- 3×2 D- 1×3 (D)

18

You might also like