0% found this document useful (0 votes)
10 views4 pages

BioInfor Assignment

Bioinformatics assignment

Uploaded by

cath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

BioInfor Assignment

Bioinformatics assignment

Uploaded by

cath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

NAME: STANSLAUS GEDE

STUDENT NUMBER: N02021956F

COURSE: BIOINFORMATICS

COURSE CODE: SIA 4101

LECTURER: MRS CHIVASA

1.The Damerau-Levenshtein distance is an algorithm used in computational linguistics and


computer science to measure the minimum number of operations required to transform one string
into another. It is an extension of the Levenshtein distance, incorporating an additional operation:
transposition of two adjacent characters.

Operations in Damerau-Levenshtein Distance

1. Insertion: Adding a character (e.g., transform kitte to kitten by inserting n).

2. Deletion: Removing a character (e.g., transform kitten to kitte by deleting n).

3. Substitution: Replacing one character with another (e.g., transform kitten to sitten by
replacing k with s).

4. Transposition: Swapping two adjacent characters (e.g., transform abc to acb by swapping b
and c).

Use Cases

 Spell Checkers: To find the closest word to a misspelled input.


 Data Deduplication: Identifying records with similar text, even if there are typos or
transpositions.

 DNA Sequencing: Comparing DNA sequences with small mutations, including inversions of
base pairs.

 Natural Language Processing (NLP): For tasks such as autocorrect or fuzzy string matching.

Complexity

The algorithm typically runs in O(m×n)O(m \times n)O(m×n), where mmm and nnn are the
lengths of the two strings. It uses a matrix to calculate the distance iteratively, filling it based
on the costs of the operations.

Example Calculation

To transform ca into abc:

1. Insert 'b': ca → cba → cab


2. Transpose 'a' and 'b': cab → abc

The Damerau-Levenshtein distance is 2 because it takes two operations.

2.The Hamming distance is a metric used to measure the number of positions at which the
corresponding elements in two strings (or sequences) differ. In bioinformatics, it is commonly
applied to analyze DNA, RNA, or protein sequences.

How It Works

1. Definition: The Hamming distance is applicable only to strings of the same length. It
calculates the total number of mismatches between two strings.

Formula:

H(s1,s2)=∑i=1nδ(s1[i],s2[i])H(s_1, s_2) = \sum_{i=1}^n \delta(s_1[i], s_2[i])H(s1,s2)=i=1∑nδ(s1[i],s2


[i])

Where:

o s1[i]s_1[i]s1[i] and s2[i]s_2[i]s2[i] are characters at position iii in strings s1s_1s1 and
s2s_2s2, respectively.

o δ(a,b)=1\delta(a, b) = 1δ(a,b)=1 if a≠ba \neq ba=b, and 000 otherwise.

2. Example: Comparing two DNA sequences:

o Sequence 1: GATTACA

o Sequence 2: GACTATA

o Differences occur at positions 2 and 6. Therefore, the Hamming distance = 2.


Applications in Bioinformatics

1. Sequence Alignment:

o Hamming distance provides a simple measure to compare sequences of equal


lengths. It's useful for identifying point mutations (single-nucleotide polymorphisms,
SNPs) in DNA.

2. Phylogenetics:

o Hamming distances between sequences are used to infer evolutionary relationships


by constructing phylogenetic trees. Closely related species tend to have smaller
Hamming distances between their DNA or protein sequences.

3. Gene and Protein Matching:

o Hamming distance can be used to compare protein sequences to find functional or


structural similarities. For proteins, this is typically limited to direct substitutions and
doesn't account for insertions or deletions.

4. Error Detection and Correction:

o In sequencing technologies, Hamming distance helps identify errors in DNA reads.


For instance, if a sequence read from a machine differs from the reference by a
small Hamming distance, the error can be corrected.

Limitations

 Length Requirement: The Hamming distance requires sequences of equal length, making it
unsuitable for comparing sequences with insertions or deletions.

 Context Blind: It does not account for the biological or evolutionary significance of
mismatches.

o Example: A substitution in a critical codon position may have more significant


implications than in a non-coding region, but the Hamming distance treats all
substitutions equally.

Comparison to Other Metrics

 Levenshtein Distance: Unlike Hamming distance, Levenshtein distance accounts for


insertions, deletions, and substitutions, making it more flexible but computationally
expensive.

 Jukes-Cantor Model: In evolutionary studies, this model extends Hamming distance by


considering the probabilities of mutations over time.

Conclusion
The Hamming distance is a fast and effective tool for comparing biological sequences of the same
length. While limited in its scope, it serves as a foundational measure in bioinformatics for tasks like
SNP detection and error correction in DNA sequencing.

4o

You might also like