Comparative
Sequence Analysis
Department of Life Sciences, SBASSE, LUMS
Genome to Gene
Heredity Unit
2
Latest on Genome Sequencing
• Human Genome Project (1990 – 2003)
Now!
3
Our Genome and Need for Comparative
Genomics
• Number of bases: 3.2 billion bases
• Number of chromosomes: 23 pairs
• Percentage of genes: Only 1% of genome is genes
• Protein-coding Gene Number: 20,000 - 25,000
• Average gene size: ~ 3000 bases & huge variation
• Largest known human gene consists of 2.4 million bases (dystrophin)
• Repetition: Almost 45-50% of the DNA is repetitive
• Similarity between individuals: Almost all (99.9%) nucleotide bases are exactly the same
in all people 4
Proteome to Protein
Genes: 30,000
Alternative Splicing: 2 - 3 per gene
3 x 30,000 = 90,000 proteins
Post translational modifications
10 x 90,000 = 900,000 proteins
Peng and Gygi, JMS 2001
Asa Wheelock
5
Need for Comparative Proteomics
• Number of reported proteins: 150 million and counting
6
Benefits of Comparative Genomics
• Comparison of whole genome sequences provides a highly detailed
view of how organisms are related to each other at the genetic level
• Comparative genomics also provides a powerful tool for studying
evolutionary changes among organisms
• Helps to identify genes that are conserved or common among species
that give each organism its unique characteristics
7
Fly vs. Humans
Comparison between fruit fly genome with the human genome:
• about 75% percent of genes are conserved
• two organisms appear to share a core set of genes
• two-thirds of human genes known to be involved in cancer have
counterparts in the fruit fly
8
Evolutionary Relationship
9
COV2
10
http://bacterialphylogeny.info/overview.html
11
What have we done and what’s
next?
DONE: Gene and Protein Sequences
• GenBank (DNA Sequences)
• Uniprot (Protein Sequences)
• GeneMark (Gene Prediction)
NEXT: Sequence & Structure Analysis
• BLAST (nucleotide, protein)
• PDB
• iTASSER
12
From Sequences to Comparisons
• Problem: If we sequence a new gene or protein, can we compare it
with the existing information in GenBank or Uniprot?
• Idea: Compare NOVEL sequences with KNOWN (previously
characterized) genes or proteins.
• Benefit: STRUCTURAL , FUNCTIONAL and EVOLUTIONARY
information can be inferred from WELL DESIGNED comparisons.
• The most common tool used is called BLAST.
13
BLAST?
• Basic Local Alignment Search Tool
• A method for rapid searching of sequence databases, for both
nucleotides and proteins.
• The BLAST algorithm detects local as well as global matches
(alignments) and regions of similarity embedded in otherwise unrelated
proteins.
• Uses statistical theory to determine if a match might have occurred by
chance.
14
https://blast.ncbi.nlm.nih.gov/Blast.cgi
15
BLAST - Workflow
1. BLAST searches the database sequences using “Dynamic Programming” on “promising”
sequences.
2. This is done by indexing all database sequences in a so-called suffix-tree which makes it
very fast to search for perfect matching sub-strings. A suffix tree is the quickest possible
way (so far) to search for the longest matching sub-string between two strings.
3. BLAST creates a list of all “words” (short subsequences) that have a certain “threshold”
score when compared with the query sequence. Words are 16-256 nucleotides or 3
amino acids put together in a row consecutively.
4. A lookup hash table is made of all such words and “neighboring” words present in the
query sequence (rather than just random words).
5. When a BLAST search is run, candidate sequences from the database is picked based on
perfect matches to small sub-sequences in the query sequence. 16
BLOSUM62 Match/Mismatch Matrix
17
• Here the word is PQG and
Score from neighboring words are
BLOSUM everything with a score
above 13 (for three
letters) as calculated by
the given scoring system
(e.g., BLOSUM62).
T is user provided threshold!
• PSG is a neighboring word,
PQA is not.
18
Example Blast search method
Query sequence: PQGELV
•Make list of all possible k-mer words (length 3 for proteins)
PQG (score 18)
QGE (score 16)
GEL (score 15)
ELV (score 13)
•Assign scores from Blosum62, use those with score >= 13
• PQG, QGE, GEL & ELV
•In total we get: PQG, QGE, GEL & ELV
Example Blast search method
• Make k-mer (word-size 3) of all sequences in database
• Store in a suffix-tree (fast tree-structure to search for identical matches)
• Find all database sequences that has at least 2 matches among our 3 words
• PQG, GEL & PEG
• Find database hit and extend alignment (High-scoring Segment Pair):
Query: M E T P Q G I A V
Database: - - - P Q G E L V
8 5 5 2 0 8
• HSP: PQGI (score 8+5+5+2)
• If 2 HSP in query sequence are < 40 positions away
• Full alignment on query and hit sequences
Advantages of BLAST
• The BLAST algorithm was written balancing speed and
increased sensitivity for finding distant sequence relationships.
• Speed is achieved by:
1. Pre-indexing the database before the search
2. Parallel processing
3. Hash table that contains neighborhood words rather than just random words.
• BLAST emphasizes regions of local alignment to detect
relationships among sequences having isolated regions of
similarity between them.
21
BLAST for Nucleotides and Proteins
• Nucleotides
• blastn
• Compares a nucleotide query sequence against a nucleotide sequence
database.
• Proteins
• blastp
• Compares an amino acid query sequence against a protein sequence
database.
22
Comparing an unknown nucleotide
sequence with possible “protein”
sequences!!
• blastx
> but what about the 6 possible ORFs?
• Compares a nucleotide query sequence translated in all reading
frames against a protein sequence database.
• This option may be used to find potential translation products of
an unknown nucleotide sequence.
23
How about the reverse of blastx?
• tblastn
• Compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading
frames.
24
Comparing all translated ORFs of a
nucleotide sequence with all ORFs
of a nucleotide DB
• tblastx
• Compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide
sequence database.
25
Getting started with BLAST
Getting started:
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/BLAST/
and
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
26
So what if we find out the Alien
Gene in GenBank?
• Homologs
• Features (including DNA and protein sequences) in species being compared that are similar
because they are ancestrally related
• Homologs can be either Orthologs and Paralogs
• Orthologs
• Homologous genes (or any DNA sequences) that separated because of a speciation event
• Derived from the same gene in the last common ancestor
• Paralogs
• Homologous genes that separated because of gene duplication events within the same species
27
28