Genome Analysis
Genome Analysis
Class I: LTR+retrotransposons
Retroposones (SINE, LINE)
Retrovirus-like elements with LTR
Recombination
Two recombinant
chromosomes
Recombinant
chromosome
that has been
inherited in the
next generation
By using microsatellite markers you can label each chromosome to follow up the segregation
5 7 1 3 Recombination
1 3 Gene * 1 3 5 7
Gene * Gene *
6 8 2 4
2 4 2 4 6 8
Two recombinant
chromosomes
1 3
Gene *
1 5 1 7 3 5
2 1 5 1 7 3 5
Gene * Gene * 4 Gene * Gene *
2 6 2 8 4 6 2 6 4 8 4 6
Recombinant
chromosome
that has been
inherited in the
next generation
Recessive pattern of inheritance
AR hereditary disorder AR hereditary disorder
No recombinations 1 recombination (arrow)
Disease chromosome: red and green Disease chromosome: red and green
Disease gene: Place in red and green chr. Disease gene: placed in the short arm
Recombination
5 7 1 Recombination1 3 5 7
1 3 Gene * Gene *
Gene * * *
6 8 2 4 2 4 6 8
2 4
1 3
Gene *
1 5 1 5 3 7 1 5 1 5 3 7
4 2 * * * *
* * * *
2 6 2 6 4 8 2 6 4 6 4 8
D1S1324 1/10
Gene */wt
D1S1546 8/12
AD trait
Linkage to chromosome 1
Disease haplotype 1,*,5
Two recombination events (red arrows)
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and haplotype analysis
Flanking microsatellite
markers to KCNQ4
Genome analysis: List of markers
Genome analysis: List of markers
Genome analysis: type of sequences
Sequence repeats:
Class I: LTR+retrotransposons
Retroposones (SINE, LINE)
Retrovirus-like elements with LTR
The output of the program is a detailed annotation of the repeats that are
present in the query sequence as well as a modified version of the query
sequence in which all the annotated repeats have been masked (default:
replaced by Ns).
Input format:
>Sequence1
ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG
GCGATCGATGTGCTAGATCAGATGACA
>Sequence2
GGGCTAGATTAGCACCACATACATCGCTCA
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/
Size limitations
In principle, there is no limit to the length of the query sequence or size of the
batch file.
However, the most common error message obtained by users is due to timing
out of the connection during the submission of long sequences.
The program returns three or four output files for each query.
4. Optionally a file with alignments of the query with the matching repeats
will be returned as well.
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/
In the "html" return format (default when the browser runs on PC) all output is
returned to your screen in one file.
In the "tar file" return format the masked sequence(s) and alignments can be
saved as compressed files.
The "links" return format returns links to these output files in a text format
(they look bad on the browser, but are fine when saved to your computer).
Searching sequence repeats
Options
1. Show alignments
When checked, alignments are returned in a file (ending in .aln) or to
the screen. Alignments are shown in order of appearance in the query
sequence.
With the option "Do not mask simple..." only interspersed repeats are
masked. This may, for example, be preferred when the masked
sequence will be fed to a gene prediction program.
Alternatively, with the option "Only mask simple...", one can mask only
these low complexity regions, e.g. when you are only interested to
quickly locate polymorphic simple repeats in a sequence.
Searching sequence repeats
Options
5. Fixed-width columns
The column widths in the annotation table are adjusted to the
maximum length of any string occurring in a column; this allows long
sequence names to be spelled out completely.
Searching sequence repeats
DNA source
About half of the repeats identified in human DNA are specific to primates,
i.e. they amplified after the eukaryotic radiation some 100 million years ago.
Most repeats that can be identified in mouse DNA are specific to rodents,
due to higher activity and faster mutation rates in the rodent lineage.
Note that the majority of sequences against which rodent and especially
other mammalian queries are compared are repeats identified in the human
genome and thought to predate the mammalian radiation.
Searching sequence repeats
Speed and sensitivity
For longer sequences the required time increases pretty much linearly with
the sequence length. Sequences shorter than 10 kb are analyzed
disproportionally faster.
The "slow" setting will take about 3 times longer and will find and mask
0-5% more repetitive DNA sequences than the default setting.
If there are any false matches, they tend to have scores close to the cutoff:
180 for the very old MIR, LINE2 and MER5 sequences.
Searching sequence repeats
Use in database searches
Many people mask repeats before designing primers or oligo probes from
sequence data.
The alignments can help in designing primers from sequences that are
completely masked. Regions that diverge much from the consensus are less
likely to misbehave than others.
Searching sequence repeats
How to read the results
It lists all best matches (above a set minimum score) between the query
sequence and any of the sequences in the repeat database or with low
complexity DNA.
The term "best matches" reflects that a match is not shown if its domain is
over 80% contained within the domain of a higher scoring match, where the
"domain" of a match is the region in the query sequence that is defined by
the alignment start and stop.
These domains have been masked in the returned masked sequence file.
In the output, matches are ordered by query name, and for each query by
position of the start of the alignment.
Searching sequence repeats
How to read the results
Example:
The simple repeat is derived from the poly A of the Alu element.
Searching sequence repeats
The first line is interpreted like this:
1306 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 336 103
6781
Searching sequence repeats
2) Alignments
These alignments may be most generally useful for designing PCR primers in a
region full of repeats.
It is possible to get primers that work in a whole genome, when the 3' end of it lies in
a region of (even a common) repeat that is very different from the consensus.
Alignments are shown in the orientation of the query sequence unless the option -
inv is typed in in the option box.
Searching sequence repeats
Alignments
Deleted ALU
Is marked by
One X
Searching sequence repeats
Alignments
The position of the deleted Alu in the query is indicated with an "X".
The lines in the annotation table describing this match appear as:
Searching sequence repeats
3) The summary (.tbl) file
However, one may opt to skip the low-complexity masking, for example
when using RepeatMasker in conjunction with a gene prediction program.
1. By typing the name of a gene we can get access to all the disorders
associated with this gene.
TEXT
A number sign (#) is used with this entry because this form of autosomal
dominant progressive sensorineural hearing loss, DFNA20/26, is caused by
mutation in the gamma-actin gene (ACTG1; 102560) on chromosome
17q25.3.
Clinical features
Mapping
Molecular genetics
References
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Sequence-based features
Structural features
sequence identity threshold is set to 50%, since this value guarantees the
conservation of basic structural characteristics.
(2) chooses closely related sequences that may share similar function to the
query sequence
(4) calculates normalized probabilities for all possible substitutions from the
alignment.
Positions with normalized probabilities less than 0.05 are
predicted to be deleterious.
http://sift.jcvi.org/www/SIFT_seq_submit2.html
http://sift.jcvi.org/www/SIFT_seq_submit2.html
If you get back too many results you can narrow it down by
specifying the organism.
Your protein should be first in the alignment. The length of the alignment should
correspond to the query protein and there should be no gaps in the query protein
sequence.
Since steps (1) through (3) are skipped in the SIFT procedure, you will get your
results SUPER-DUPER FAST and we encourage you to use this submission
form instead of the others.
Sift program
E) Submitting Substitutions
The format for a substitution is to have X#Y where X is the original amino acid, # is
the position of the substitution and Y is the new amino acid. One substitution per
line is allowed.
Example:
M1Y
K3S
T4P
Sift program
Sift program
Sift program
The first column indicates the variant submitted. If alleles are submitted with
respect to the - strand, they will be automatically converted to + strand.
Please not that if you do not submit the variant correctly, it will default to a
synonymous change.
One way to check is if the reference and non-reference alleles in the coordinates
column now match, this indicates that you most likely did not submit your variant
correctly.
The second column denotes the codon that has been changed, the bases are
with respect to + mRNA orientation.
C, W, D, F, M, I, Y, V, G, P, S, H, N, A, L, T, E are predicted to be
deleterious because they have normalized probabilities < 0.05 and none of
these appear in the alignment (small letters).
Amino acids are color coded: nonpolar, uncharged polar, basic, acidic.
Sift Sequence program
http://sift.jcvi.org/www/SIFT_seq_submit2.html
Sift program
http://sift.jcvi.org/www/SIFT_seq_submit2.html
Sift Blink program
http://sift.jcvi.org/www/SIFT_BLink_submit.html
Sift program
http://sift.jcvi.org/www/SIFT_BLink_submit.html
Sift program
Sift program
Sift program
Probabilities
Each row corresponds to a position in the reference protein.
Below each position is the fraction of sequences that contain one of
the basic amino acids.
A low fraction indicates the position is either severely gapped or
unalignable and has little information. Expect poor prediction at these
positions.