Module5 Session3
Module5 Session3
Module5 Session3
to
Bioinforma)cs
Online
Course:
IBT
Genomics
Compara.ve
Genomics
Session
3:
Compara.ve
Genomics
Part
1:
(intro)
Compara)ve
Genomics:
what
is
it?
Part
2:
Genomic
Varia)on
/
Compara)ve
Genomics:
WWWH?
Part
3:
The
input
Part
4:
The
methods
Part
5:
The
output
Session
1:
Compara.ve
Genomics
Navigate
through
genomic
ressources
to:
(intro)
Compara.ve
Genomics:
What
is
it?
WHY ? WHEN ?
WHAT ? HOW ?
WHY
?
Why
would
a
genome
evolve
?
WHEN
?
à
Genomic
plas)city
allows
an
organism
to:
-‐ adapt
to
environmental
changes
WHAT
?
-‐ Find
the
best
evolu)on
path
-‐ Acquire
virulence
genes,
enhanced
pathogenicity
HOW
?
-‐ Resistance
to
drugs
-‐ Increase
survival
chances
of
members
of
a
popula)on
-‐ …
WHEN ?
WHAT ?
HOW ?
WHY
?
Factors/events
WHEN
?
-‐ Gene
transfer
-‐ Environmental
pressure
for
selec)on
-‐
pH
WHAT
?
-‐
temperature
-‐
host
HOW
?
-‐
pathogen
-‐ …
WHY
?
What
could
be
affected
?
WHEN
?
-‐ Overall
genomic
sequence
(re-‐arrangements)
-‐ DNA
structure
-‐ Regulatory
elements
WHAT
?
-‐ Genes
size,
number,
func)on,
density
-‐ Nucleo)de
composi)on
HOW
?
-‐ …
WHY
?
How
could
this
happen
?
WHEN
?
-‐ Large
gene)c
structural
varia)ons
(duplica)on,
recombina)on…)
-‐ Transposable
elements
(retrotransposons…)
-‐ Evolu)on
of
mul)gene
families
WHAT
?
-‐ Evolu)on
of
genes
with
novel
func)ons
-‐ Exon
shuffling
-‐ Tandem
repeats
modifica)on
-‐ …
HOW
?
hTp://www.mrschamberlain.com/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Genomic
varia.on
:
WWWH
WHY
?
How
to
measure
the
changes
in
a
genome?
WHEN
?
-‐ Sequence
varia)on
Compara.ve
Genomics
-‐ between
2
genomes
(1
reference)
-‐ between
several
genomes
WHAT
?
-‐ Other
varia)ons
(structure,
folding…)
HOW
?
WHY ? WHEN ?
WHAT ? HOW ?
WHY
?
Why
would
we
compare
genomes
?
WHEN
?
-‐ Iden)fy
evolu)onary
history
-‐ Highlight
synteny
-‐ Iden)fy
genomic
rearrangements
(large
SV
events…)
WHAT
?
-‐ Study
convergent
evolu)on
for
some
organisms
(e.g.
virus)
HOW
?
-‐ understand
disease
outbreak
-‐ Iden)fy
pathogenicity
markers,
drug
targets
-‐ …
WHY
?
Why
would
we
compare
“por.ons”
of
genomes
?
WHEN
?
-‐ Comparing
smaller
por)ons
of
a
genome
allows
to
zoom
into
regions
of
genomic
re-‐arrangements
WHY
?
When
do
we
need
to
use
compara.ve
genomics
?
WHEN
?
à Establish
gene)c
and
evolu)onary
rela)onship
between
:
WHAT
?
-‐ En)re
organisms
-‐ Sequences
HOW
?
WHY
?
What
we
generally
compare
are
features
of
1
or
more
genomes
to
features
of
a
another
genome
(reference)
WHEN
?
A
genome
is
complex
and
composed
of
different
elements
(regulatory,
stuctural…)
WHAT
?
In
fact,
there
are
different
types
of
DNA
features
that
HOW
?
can
be
compared
between
2
genomes:
-‐ DNA
sequences
(small,
large,
coding/non-‐coding)
-‐ Genes
(nature,
order…)
-‐ Regulatory
elements
-‐ ...
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Compara.ve
Genomics
:
WWWH
WHY
?
Could
be
classified
in:
WHEN
?
-‐
Genome
structure
-‐
Genome
func.on
(coding
/
non-‐coding)
WHAT
?
-‐
Genome
evolu.on
HOW
?
WHY
?
Compara)ve
genomics
uses
Sequence
Alignment
WHEN
?
Compara)ve
genomics
is
based
on
Phylogeny
that
relies
on
several
key
issues:
WHAT
?
-‐ Several
genomes
are
sequenced
and
available
-‐ Homology
between
genes
(similar
func)ons)
HOW
?
-‐ …
WHY
?
Algorithms/programs
WHEN
?
in
vitro:
-‐ Fluorescence
In
Situ
Hybridiza)on
(FISH)
WHAT
?
-‐ Spectral
Karyotyping
(SKY)
and
Mul)plex-‐FISH
(M-‐
FISH)
HOW
?
-‐ Compara)ve
Genomic
Hybridiza)on
DATABASES
Compara.ve
Genomics:
The
input
hTp://www.ncbi.nlm.nih.gov/genome/browse/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
INPUT
hTps://nar.oxfordjournals.org/
hTp://database.oxfordjournals.org/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
INPUT
File formats
Whole
sequence
(or
subsequence)
Accession
hTp://database.oxfordjournals.org/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
INPUT
File formats
hTp://blast.ncbi.nlm.nih.gov/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
INPUT
File formats
hTp://blast.ncbi.nlm.nih.gov/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
INPUT
File formats
hTp://blast.ncbi.nlm.nih.gov/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
INPUT
File formats
hTp://blast.ncbi.nlm.nih.gov/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Part
4
Compara.ve
Genomics:
The
methods
ATCT
T
CGT
-‐
TG
ATCT
-‐
CGT
ATG
GAP
2
key
approaches
-‐ Global
Alignment
à
Op)mizes
the
alignment
to
span
the
ATCATTCGTTGACTGTG
A
-‐
-‐
-‐
T
T
-‐
G
-‐
T
GAC
-‐
-‐
TG
full
length
of
sequences
that
are
aligned.
-‐ Local
Alignment
à
Op)mizes
the
alignment
to
take
into
ATCATTCGTTGACTGTG
-‐
-‐
-‐
A
TT
-‐
G
-‐
T
GACTG
-‐
-‐
account
regions
of
the
highest
similarity
between
divergent
sequences.
Pairwise
Alignment
A
Pairwise
Alignment
is
an
op)mized
local
or
global
alignment
of
2
sequences.
-‐ 3
methods:
-‐ Dot-‐matrix
-‐ Dynamic
programming
-‐ Word-‐based
NB:
Efficiency
can
be
reduced
in
low
complexity
regions
(repe++ve
sequences…)
Can
be
evaluated
by
the
MUM
(Maximum
Unique
Match)
à
Long
MUM
sequences
=
more
related
sequences
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
METHODS
Pairwise
Alignment
INSERTION
INTO
QUERY
Dot-‐matrix
method
R:
AB
Q:
AIB
Pairwise
Alignment
Dynamic
programming
can
-‐ Use
a
scoring
matrix
-‐ Assign
a
match
score
(+),
a
mismatch
score
(-‐),
and
a
gap
penalty
(-‐).
-‐ Use
two
different
gap
penal)es
for
opening
a
gap
and
for
extending
a
gap
(gap
opening
>>>
gap
extension)
à
generally
results
in
less
gaps
in
an
alignment
and
gaps
are
grouped
together
=
more
biological
relevance.
Different
algorithms
for
Global
and
Local
Alignments
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
hTps://en.wikipedia.org/wiki/Sequence_alignment#cite_note-‐mount-‐1
THE
METHODS
Pairwise
Alignment
Dynamic
programming
-‐ Global
Alignment
à
Needleman–Wunsch
algorithm.
-‐ Local
Alignment
à
Smith–Waterman
algorithm.
Pairwise Alignment
Global
Alignment
à Needleman–Wunsch
algorithm.
Pairwise Alignment
Local
Alignment
à Smith-‐Waterman
algorithm.
Pairwise
Alignment
Word-‐based
method
-‐ Op)mal
alignment
not
garanteed,
but
efficient
and
faster
than
dynamic
programming
-‐ Useful
for
databases
searches
-‐ «
words
»
are
small
por)ons
(length
k)
of
the
query
sequence
that
are
used
to
screen
the
database
à
Ex:
BLAST
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
(Basic
Local
Alignment
Search
Tool)
-‐ Algorithm
to
compare
a
query
sequence
to
a
library
or
database
of
sequences
-‐ Allows
to
es)mate
iden)ty
with
a
certain
confidence
threshold
-‐ Popular
in
the
scien)fic
community
()me
efficiency…)
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
►
Best
match
is
followed
by
an
extension
in
both
direc.on,
with
scoring
►
Extension
con)nued
only
if
the
alignment
is
above
the
threshold
►
The
con)guous
alignment
without
gaps
(now
possible)
and
a
higher
score
is
the
HSP
(High
Scoring
Segment
Pair)
QUERY
ATCATTCGTTGACTGTG
ATCATTCGTTG
DATABASE
ATCATTCGTTGACTGTG
ATTCGTTGACT
TCGTTGACTGT
EXTENSION
EXTENSION
CGTTGACTGTG
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
►
A
scoring
matrix
is
used
to
evaluate
the
quality
of
the
alignment
►
A
scoring
matrix
is
a
predefined
subs)tu)on
matrix
(match
=
1,
mismatch
=
0…)
►
ex:
BLOSUM
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
output
A
list
of
sequences
that
have
the
best
match
to
the
query
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
output
e-‐value:
probability
that
the
alignment
is
found
by
chance
(the
lower
the
e-‐value,
the
more
interes)ng
the
match)
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
output
Alignment
details
:
sequences
(query
and
database)
aligned
with
%
iden)ty…)
Pairwise
Alignment
Word-‐based
method
-‐ BLAST
have
different
variant
queries
according
to
the
type
of
query
sequence
(Q)
and
type
of
sequence
in
the
database
(R):
Q
R
BLASTN
Nucleic
Acid
à
Nucleic
Acid
BLASTX
Translated
Nucleic
Acid
à
Protein
TBLASTX
Translated
Nucleic
Acid
à
Translated
Nucleic
Acid
TBLASTN
Protein
à
Translated
Nucleic
Acid
BLASTP
Protein
à
Protein
à MSA
SIMILARITY
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
METHODS
hTp://www.cbs.dtu.dk
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
METHODS
hTp://www.cbs.dtu.dk
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
METHODS
Compara.ve
Genomics:
The
output
Genomic Structure
Genomic Func.on
Genomic Evolu.on
hTp://www.cbs.dtu.dk
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
OUTPUT
Genomic Evolu.on
COMPARING
CLOSELY
RELATED
SPECIES
COMPARING
EVOLUTIONARILY
DISTANT
SPECIES
hTp://www.beller.no
hTp://www.notcot.org/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Compara.ve
Genomics
and
Genome
Evolu.on
in
easy
words
COMPARING
CLOSELY
RELATED
SPECIES
COMPARING
EVOLUTIONARILY
DISTANT
SPECIES
Genomic Evolu.on
à Ebolavirus
genomes
very
similar
but
different
in
intergenic
regions
and
genes
of
specific
func)on
=
poten)al
vaccine
candidates.
(Jun
et
al.,
2015)
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
OUTPUT
Genomic Structure
Genomic
Structure
Whole
Genome
Comparison
of
(A) all
strains
(B) Enterobacter
(C) Erwinia
(D) Pantoea
Genomic Evolu.on
Phylogene.c
rela.onships
«
rela+onship
of
the
Malassezia
genus
with
respect
to
other
fungi
with
sequenced
genomes.»
G:
gene
family
gain;
L:
gene
family
loss
Genomic Evolu.on
►
Defined
as
the
overall
conserva)on
of
(gene/blocks)
order
in
chromosomes
between
different
genomes.
►
Evaluated
in
whole
genomes,
blocks
could
include
large
por)ons
of
genomes.
►
Recombina)on
/
crossing
over
affects
groups
of
adjacent
genes
in
a
chromosome
à
linkage
group.
hTp://www.cbs.dtu.dk
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
THE
OUTPUT
Genomic Evolu.on
Genomic Evolu.on
Genomic Evolu.on
Homology
2
genes
are
homologs
if
they
ORTHOLOGS
ORTHOLOGS
Genomic Evolu.on
Genomic Evolu.on
Genomic Evolu.on
Func.onal/Structural
Predic.ons
►
impact
of
a
muta.on
on
the
func.on
Analysis
of
the
impact
of
aa
subs)tu)on
à Strutural
and/or
Func)onal
effect
of
single
point
muta)ons
SNPs
-‐ PolyPhen-‐2
(hTp://gene)cs.bwh.harvard.edu/pph2/…)
-‐ SIFT
(hTp://siw.jcvi.org)
-‐ VEP
(hTp://www.ensembl.org/Homo_sapiens/Tools/VEP)
►
impact
of
the
muta.on
on
the
Struture
Impact
on
gene
coding
por)ons
(gain/loss)
or
non-‐coding
por)ons
Browsers
VISTA
(hip://genome.lbl.gov/vista/index.shtml)
Collec)on
of
resources
for
compara)ve
genomics
VISTA
browers
can
be
used
to
analyze
pre-‐computed
alignments
or
user
generated
or
queried
sequences
VISTA
servers
-‐ mVISTA
(query
sequences
vs
mul)-‐species
sequences)
-‐ rVISTA
(iden)fica)on
of
regulatory
TF
binding
sites)
-‐ gVISTA
(query
sequences
vs
whole-‐genome
assemblies)
-‐ wgVISTA
(alignment
of
10Mb
sequences
(finished/draw):
microbes…)
…
Browsers
Ensembl
Browser
(hTp://www.ensembl.org)
►
Compara)ve
analyses
at
the
genome
and
gene
levels
►
Genome
sequences
compared
using
pairwise
and
mul)ple
whole-‐genome
alignments
►
These
alignments
help
to
determine
-‐ Synteny
-‐ Sequence
conserva)on
scores
-‐ Gene
homology
rela)onships
(GeneTrees)
Input
/
Output
►
DNA
Sequences
(genome,
gene…)
►
Homology,
similarity,
evolu)onary
distance
Alignment
►
Whole
genome
:
MUMmer…
►
Mul)ple
genomes
:
MGA…
►
Mul)ple
Sequence
Alignment
:
Clustal…
►
Global/Local
Sequence
Alignment
:
BLAST…
Input
/
Output
files
►
Fasta/GenBank
to
alignment
or
phylogene)c
distances