0% found this document useful (0 votes)
110 views108 pages

Genome Analysis

This document discusses genome analysis techniques. It describes eukaryotic genome features like large size and presence of introns. It also covers concepts like inheritance patterns, genotypes, haplotypes, and different types of genetic maps. The document discusses genome sequence variations like SNPs, CNVs, and repeats that can be useful for finding genetic causes of diseases. It provides details on different types of sequence repeats and how they can explain chromosomal rearrangements linked to pathology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views108 pages

Genome Analysis

This document discusses genome analysis techniques. It describes eukaryotic genome features like large size and presence of introns. It also covers concepts like inheritance patterns, genotypes, haplotypes, and different types of genetic maps. The document discusses genome sequence variations like SNPs, CNVs, and repeats that can be useful for finding genetic causes of diseases. It provides details on different types of sequence repeats and how they can explain chromosomal rearrangements linked to pathology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Bioinformatics

UNIT 4: Genome analysis


Genome analysis

Eukaryotic genomes features:


1. A visible nucleus with visible
chromosomes.
2. They are commonly large.
3. They contains tandem repeats of
sequences
4. Their genes contains introns.
Concepts
Recombination: events that generate genetic
diversity (meiosis).
Inheritance patterns:
AD: one mutant allele.
AR: two mutant alleles.
X-linked: associated with gender. X-AD and X-AR
Mitochondrial: Maternal inheritance. Only males
affected
Genotype: List of different alleles especific for a
given subject.
Haplotype: Combination of markers that are
inherited together.
Genome analysis: genetic map
Genome analysis: cytogenetic map
Genome analysis: sequence variations
Genome analysis: type of sequences

Genome sequences of individuals vary

1. Sequence variations can be found in humans and


other species.
2. This variations are defined as single nucleotide
polymorphisms (SNPs), copy number variations
(CNVs) and sequence repeats.
3. These variations can be useful to find the genetic
cause of a given hereditary disease by combining
these data with haplotype information and linkage
analysis.
Genome analysis: SNPs variations
Genome analysis: Haplotypes
Genome analysis: type of sequences
Sequence repeats:

1. Sequence repeats and satellite DNA:

Satellites, Mini and microsatellites.

2. Repeats due to transposable elements (TEs)- Move from one


chromosome to other.

Class I: LTR+retrotransposons
Retroposones (SINE, LINE)
Retrovirus-like elements with LTR

Class II: DNA-based mechanisms for transposons.

Class III: Miniature, inverted, repeat TEs (MITES).


Sequence repeats and satellite DNA

Satellites are composed of repeats of one thousand to several


thousand bases in very large tandem arrays up to 100 million bases
long and are typically near centromeres and telomeres.

Minisatellites are made up of repeats approximatelly 15 bases in


length in arrays highly variable in length of a few hundred bases up
to thousands of kilobases. They are also known as variable number
tandem repeats (VNTRs)

Microsatellites are made up of short repeats 2-6 bases long in


arrays that are highly variable in length from 10 to 100 bases. The
length is inherited from one generation to the next. Very useful for
genetic and evolutionary analyses of genomes.They are also known
as simple sequence repeats (SSRs) and short tandem repeats
(STR).
Summary
High resolution maps: HRMs
Use of HRMs to identify disease genes
First
disease
gene
identified:
Positional
cloning of
the FQ
gene
Genotyping and haplotype analysis

Formal definition of microsatellite:

It starts with a D (DNA) followed by a number that makes reference to the


chromosome number and followed by an S (site) and the number that
define the marker.
Example: D1S1450, D3S2089, D5S79.

You have to select close to the candidate gene to be subjected to linkage


analysis highly heterozygous microsatellites (with a high number of different
alleles) to ensure informative results (genotyping) when the segregation
analysis is being performed.

The combination of alleles from different microsatellites for a given


chromosome define one haplotype.This allows chromosomal identification
in linkage analysis.

When linkage exists between the different markers and


the disease, the disease haplotype is inherited
associated ONLY with the affected subjects in the family.
Dominant pattern of inheritance
AD hereditary disorder AD hereditary disorder
No recombinations 1 recombination (arrow)
Disease chromosome: red Disease chromosome: red
Disease gene: Place in red chr. Disease gene: placed in short arm

Recombination

Two recombinant
chromosomes

Recombinant
chromosome
that has been
inherited in the
next generation
By using microsatellite markers you can label each chromosome to follow up the segregation

AD hereditary disorder AD hereditary disorder


No recombinations 1 recombination (arrow)
Disease chromosome: red Disease chromosome: red
Disease gene: Place in red chr. Disease gene: placed in short arm

5 7 1 3 Recombination
1 3 Gene * 1 3 5 7
Gene * Gene *

6 8 2 4
2 4 2 4 6 8
Two recombinant
chromosomes
1 3
Gene *
1 5 1 7 3 5
2 1 5 1 7 3 5
Gene * Gene * 4 Gene * Gene *

2 6 2 8 4 6 2 6 4 8 4 6

Recombinant
chromosome
that has been
inherited in the
next generation
Recessive pattern of inheritance
AR hereditary disorder AR hereditary disorder
No recombinations 1 recombination (arrow)
Disease chromosome: red and green Disease chromosome: red and green
Disease gene: Place in red and green chr. Disease gene: placed in the short arm

Recombination

We need two affected Recombinant


chromosomes (RED/GREEN) in chromosome
the affected subjects that is inherited
in the next
generation
Recessive pattern of inheritance
AR hereditary disorder AR hereditary disorder
No recombinations 1 recombination (arrow)
Disease chromosome: red and green Disease chromosome: red and green
Disease gene: Place in red and green chr. Disease gene: placed in short arm

5 7 1 Recombination1 3 5 7
1 3 Gene * Gene *
Gene * * *
6 8 2 4 2 4 6 8
2 4

1 3
Gene *
1 5 1 5 3 7 1 5 1 5 3 7
4 2 * * * *
* * * *
2 6 2 6 4 8 2 6 4 6 4 8

We need two affected Recombinant


chromosomes (RED/GREEN) in chromosome
the affected subjects that is inherited
in the next
generation
Genotyping and haplotype analysis: Linkage to AD trait

D1S1324 1/2 3/4 Physical map


Gene * /wt wt/wt D1S1324
D1S1546 5/6 7/8 Gen
D1S1546

D1S1324 9/10 1/4 2/4 1/3


Gene wt/wt */wt wt/wt */wt
D1S1546 11/12 5/8 6/8 5/8

D1S1324 1/10
Gene */wt
D1S1546 8/12

AD trait
Linkage to chromosome 1
Disease haplotype 1,*,5
Two recombination events (red arrows)
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and
haplotype analysis
Genome analysis: genotyping and haplotype analysis

Flanking microsatellite
markers to KCNQ4
Genome analysis: List of markers
Genome analysis: List of markers
Genome analysis: type of sequences
Sequence repeats:

1. Sequence repeats and satellite DNA:

Satellites, Mini and microsatellites.

2. Repeats due to transposable elements (TEs)- Move from one


chromosome to other.

Class I: LTR+retrotransposons
Retroposones (SINE, LINE)
Retrovirus-like elements with LTR

Class II: DNA-based mechanisms for transposons.

Class III: Miniature, inverted, repeat TEs (MITES).


Summary

They can explain many of


the chromosomal rearrangements
that can lead to pathology:
1) Cytogenetic-based disorders
2) Copy number variations
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/

RepeatMasker is a program that screens DNA sequences for interspersed


repeats and low complexity DNA sequences.

The output of the program is a detailed annotation of the repeats that are
present in the query sequence as well as a modified version of the query
sequence in which all the annotated repeats have been masked (default:
replaced by Ns).

On average, almost 50% of a human genomic DNA sequence currently will be


masked by the program.

Sequence comparisons in RepeatMasker are performed by the program


cross_match, an efficient implementation of the Smith-Waterman-Gotoh
algorithm developed by Phil Green.
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/

Input format:

Sequences can be pasted in or uploaded as files, both in fasta format.

Multiple fasta format sequences may be pasted in at once or may be contained


within a file. Fasta format looks like this:

>Sequence1
ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG
GCGATCGATGTGCTAGATCAGATGACA

>Sequence2
GGGCTAGATTAGCACCACATACATCGCTCA
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/

Size limitations

In principle, there is no limit to the length of the query sequence or size of the
batch file.

However, the most common error message obtained by users is due to timing
out of the connection during the submission of long sequences.

Furthermore, longer sequences (> 50kb) are queued (when necessary),


whereas shorter sequences are handled instantly (see also "sensitivity and
speed" below.
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/

Output / return format

The program returns three or four output files for each query.

1. One contains the submitted sequence(s) in which all recognized


interspersed or simple repeats have been masked. In the masked areas,
each base is replaced with an N, so that the returned sequence is of the
same length as the original.

2. A table annotating the masked sequences.

3. A table summarizing the repeat content of the query sequence will be


returned to your screen.

4. Optionally a file with alignments of the query with the matching repeats
will be returned as well.
Searching sequence repeats
RepeatMasker: http://www.repeatmasker.org/

Output / return format

In the "html" return format (default when the browser runs on PC) all output is
returned to your screen in one file.

In the "tar file" return format the masked sequence(s) and alignments can be
saved as compressed files.

The "links" return format returns links to these output files in a text format
(they look bad on the browser, but are fine when saved to your computer).
Searching sequence repeats
Options

1. Show alignments
When checked, alignments are returned in a file (ending in .aln) or to
the screen. Alignments are shown in order of appearance in the query
sequence.

2. Do not mask simple.../Only mask simple...


Regions of low complexity, like simple tandem repeats, polypurine and
AT-rich regions can lead to spurious matches in database searches. By
default they are masked along with the interspersed repeats.

With the option "Do not mask simple..." only interspersed repeats are
masked. This may, for example, be preferred when the masked
sequence will be fed to a gene prediction program.

Alternatively, with the option "Only mask simple...", one can mask only
these low complexity regions, e.g. when you are only interested to
quickly locate polymorphic simple repeats in a sequence.
Searching sequence repeats
Options

3. Only mask Alus


By checking this option, you limit the masking and annotation to
(primate) Alu repeats. 7SL RNA (the ancestral sequence of Alus), SVA
(which contains several Alu sequences and a fragment of LTR5) and LTR5
are masked as well. This option only works for primate DNA.

4. Mask with Xs...


When checked, the repeat sequences are replaced by Xs instead of
Ns. This allows one to distinguish the masked areas from possibly existing
ambiguous bases or other stretches of Ns in the original sequence.
However, when running BLAST searches (and maybe other programs) Xs
are deleted out of the query and the returned BLAST matches will have
position numbers not necessarily corresponding to that of the original
sequence.

5. Fixed-width columns
The column widths in the annotation table are adjusted to the
maximum length of any string occurring in a column; this allows long
sequence names to be spelled out completely.
Searching sequence repeats
DNA source

Interspersed repeats are specific to a (group of) species, dependent on the


time of activity of the source transposable element.

About half of the repeats identified in human DNA are specific to primates,
i.e. they amplified after the eukaryotic radiation some 100 million years ago.

Most repeats that can be identified in mouse DNA are specific to rodents,
due to higher activity and faster mutation rates in the rodent lineage.

RepeatMasker has separate protocols optimized for analysis of rodent and


primate genomes.

Interspersed repeats in other mammals have not been so well catalogued as


yet. Among these, artiodactyl queries are treated best by RepeatMasker, but
repeats specific to other orders are also present.
Searching sequence repeats
DNA source

The numbers of different repeat consensus sequences against which queries


of different species are compared gives an impression of how far the
different libraries are developed:

Note that the majority of sequences against which rodent and especially
other mammalian queries are compared are repeats identified in the human
genome and thought to predate the mammalian radiation.
Searching sequence repeats
Speed and sensitivity

On average, with default settings, a 10 kb human cosmid will be analyzed


in about 30-40 seconds.

For longer sequences the required time increases pretty much linearly with
the sequence length. Sequences shorter than 10 kb are analyzed
disproportionally faster.

The speed is further somewhat dependent on the repeat content of the


sequence; repeat dense regions, especially Alu-rich regions, are analyzed
faster.

The program can be run at different levels of speed or sensitivity.

The "slow" setting will take about 3 times longer and will find and mask
0-5% more repetitive DNA sequences than the default setting.

The "quick" settings miss 5-10% of the sequences masked by default,


but will be 3 to 6 times faster. The alignments may extend more or be
somewhat more accurate in the more sensitive settings as well.
Searching sequence repeats
Selectivity and matches to coding sequences

The cutoff Smith-Waterman scores for masking interspersed


repeats are conservative, since masking of one short potentially interesting
region generally is more harmful than not masking a number of hard to find
matches.

If there are any false matches, they tend to have scores close to the cutoff:

225 for most repeats,

300 for the low-complexity LINE1 search,

180 for the very old MIR, LINE2 and MER5 sequences.
Searching sequence repeats
Use in database searches

RepeatMasker is most commonly used to avoid spurious matches in


database searches.

Generally this step is strongly recommended before doing BLASTN or


BLASTX equivalent searches with mammalian DNA sequence.

The most common concern is of course if RepeatMasker ever masks


coding regions.

In the majority of these cases, the sequences appear to be improperly


annotated or to represent either artificially or naturally defective mRNAs
(e.g. alternatively spliced exons comprised of a small fragment of a repeat).

Genuine overlaps of interspersed repeats with coding sequences usually


involve terminal regions of the ORFs. Since the transposable element
derived region is unique to the protein in that (group of) species, the
masking does not interfere with database searches.
Searching sequence repeats
Other uses

Many people mask repeats before designing primers or oligo probes from
sequence data.

Primers/probes designed from regions unmasked by RepeatMasker have a


much better success rate.

The alignments can help in designing primers from sequences that are
completely masked. Regions that diverge much from the consensus are less
likely to misbehave than others.
Searching sequence repeats
How to read the results

1) The annotation file contains the cross_match output lines.

It lists all best matches (above a set minimum score) between the query
sequence and any of the sequences in the repeat database or with low
complexity DNA.

The term "best matches" reflects that a match is not shown if its domain is
over 80% contained within the domain of a higher scoring match, where the
"domain" of a match is the region in the query sequence that is defined by
the alignment start and stop.

These domains have been masked in the returned masked sequence file.
In the output, matches are ordered by query name, and for each query by
position of the start of the alignment.
Searching sequence repeats
How to read the results

Example:

This is a sequence in which a Tigger1 DNA transposon has


integrated into a MER7 DNA transposon copy.

Subsequently two Alus integrated in the Tigger1 sequence.

The simple repeat is derived from the poly A of the Alu element.
Searching sequence repeats
The first line is interpreted like this:

1306 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 336 103

6781
Searching sequence repeats
2) Alignments

Alignments are shown in order of appearance in the query sequence.

These alignments may be most generally useful for designing PCR primers in a
region full of repeats.

It is possible to get primers that work in a whole genome, when the 3' end of it lies in
a region of (even a common) repeat that is very different from the consensus.

Alignments are shown in the orientation of the query sequence unless the option -
inv is typed in in the option box.
Searching sequence repeats
Alignments

Deleted ALU
Is marked by
One X
Searching sequence repeats
Alignments

In cross_match alignments the mismatches are indicated:

"-" indicates an insertion/deletion.

"i" a transition (G<->A, C<->T).

"v" a transversion (all other substitutions).

The position of the deleted Alu in the query is indicated with an "X".

The lines in the annotation table describing this match appear as:
Searching sequence repeats
3) The summary (.tbl) file

The summary file is pretty


much self explanatory. Below
is an example.
Searching sequence repeats
Low-complexity DNA and simple repeats

By default, along with the interspersed repeats, RepeatMasker masks low-


complexity DNA:

Simple repeats (micro-satellites) can originate at any site in the


genome, and therefore have an interspersed character.

Other low-complexity DNA, primarily poly-purine/ poly-pyrimidine


stretches.

Regions of extremely high AT or GC content will result in spurious


matches in some database searches as well (especially in the ungapped
BLASTN searches).

However, one may opt to skip the low-complexity masking, for example
when using RepeatMasker in conjunction with a gene prediction program.

Under the current settings a 100 bp stretch of DNA is masked when it is


>87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC)
nucleotides.
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker Interface
RepeatMasker:annotation file
RepeatMasker:masked file
RepeatMasker:alignment file
OMIM: Online Mendelian Inheritance in Man
OMIM: Online Mendelian Inheritance in Man

It is an Online catalog of Human Genes and Genetic disorders.

1. By typing the name of a gene we can get access to all the disorders
associated with this gene.

2. By typing a disorder we can obtain information about the genes involved


in that pathology.
OMIM: Online Mendelian Inheritance in Man

By typing the name of a gene: i.e. ACTG1


OMIM: Online Mendelian Inheritance in Man

By typing the name of a disorder: i.e. DFNA20 hearing loss


OMIM: Online Mendelian Inheritance in Man

OMIM: link to Gene Map related to DFNA20


OMIM: Online Mendelian Inheritance in Man

OMIM: link to Clinical Synopses related to DFNA20


OMIM: Online Mendelian Inheritance in Man

OMIM: link to Clinical Synopses

It contains information about Phenotype genes relationships

TEXT

A number sign (#) is used with this entry because this form of autosomal
dominant progressive sensorineural hearing loss, DFNA20/26, is caused by
mutation in the gamma-actin gene (ACTG1; 102560) on chromosome
17q25.3.

Clinical features

Mapping

Molecular genetics

References
Polyphen2: prediction of functional effects of
human non synonimous SNPS

PolyPhen-2 (Polymorphism Phenotyping v2) is a tool which


predicts possible impact of an amino acid substitution on
the structure and function of a human protein.

For a given amino acid substitution in a protein, PolyPhen-


2 extracts various sequence and structure-based features
of the substitution site and feeds them to a probabilistic
classifier.
Polyphen2: prediction of functional effects of
human non synonimous SNPS

Sequence-based features

A substitution may occur at a specific site, e.g., active or binding, or in


a non-globular, e.g., trans-membrane, region.

PolyPhen-2 tries to identify a query protein as an entry in the human


proteins subset of UniProtKB/Swiss-Prot database and use the feature
table (FT) section of the corresponding entry.

PolyPhen-2 checks if the amino acid replacement occurs at a site


which is annotated as:

DISULFID, CROSSLNK bond, BINDING, ACT_SITE, LIPID, METAL,


SITE, MOD_RES, CARBOHYD, NON_STD site
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Sequence-based features

At this step PolyPhen-2 memorizes all positions which are annotated


in the query protein as BINDING, ACT_SITE, LIPID, and METAL.

At a later stage if the search for a homologous protein with known 3D


structure is successful, it is checked whether the substitution site is in
spatial contact with these critical for protein function residues.

PolyPhen-2 also checks if the substitution site is located in the region


annotated as:

TRANSMEM, INTRAMEM, COMPBIAS, REPEAT, COILED, SIGNAL,


PROPEP

For a substitution in an annotated or predicted trans-membrane


region, PolyPhen-2 uses the PHAT trans-membrane specific matrix
score to evaluate possible functional effect of a nsSNP.
Polyphen2: prediction of functional effects of
human non synonimous SNPS

Structural features

Mapping of amino acid replacement to the known 3D


structure reveals whether the replacement is likely to
destroy the hydrophobic core of a protein, electrostatic
interactions, interactions with ligands or other important
features of a protein.

If the spatial structure of a query protein is unknown, one


can use the homologous proteins with known structure.
Polyphen2: prediction of functional effects of
human non synonimous SNPS

Mapping of the substitution site to known protein 3D structures

PolyPhen-2 BLASTs query sequence against protein structure database


(PDB) and by default retains all hits that meet the given criteria:

sequence identity threshold is set to 50%, since this value guarantees the
conservation of basic structural characteristics.

minimal hit length is set to 100.

maximal number of gaps is set to 20.

By default, a hit is rejected if its amino acid at the corresponding position


differs from the amino acid in the input sequence.
The position of the substitution is then mapped onto the corresponding
positions in all retained hits.
Hits are sorted according to the sequence identity or E-value of the sequence
alignment with the query protein.
Polyphen2: prediction of functional effects of
human non synonimous SNPS

You have to identify the protein sequence.


1. Identify the amino acid position.
2. Recreate the substitution.
3. Run the program.
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Polyphen2: prediction of functional effects of
human non synonimous SNPS

Protein: Human EYA1


Position: 12
Amino acid change: Arg to His
Polyphen2: prediction of functional effects of human non
synonimous SNPS
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Polyphen2: prediction of functional effects of
human non synonimous SNPS
Sift program

SIFT predicts whether an amino acid substitution affects


protein function.

SIFT prediction is based on the degree of conservation of


amino acid residues in sequence alignments derived from
closely related sequences, collected through PSI-BLAST.

SIFT can be applied to naturally occurring nonsynonymous


polymorphisms or laboratory-induced missense mutations.
How does SIFT work?
SIFT takes a query sequence and uses multiple alignment information to predict
tolerated and deleterious substitutions for every position of the query sequence.

SIFT is a multistep procedure that

(1) searches for similar sequences,

(2) chooses closely related sequences that may share similar function to the
query sequence

(3) obtains the alignment of these chosen sequences, and

(4) calculates normalized probabilities for all possible substitutions from the
alignment.
Positions with normalized probabilities less than 0.05 are
predicted to be deleterious.

Positions greater than or equal to 0.05 are predicted to be


tolerated.
Sift programs
Sift program

http://sift.jcvi.org/www/SIFT_seq_submit2.html

Input for SIFT


You can submit a protein sequence (slow), or your query sequence along with
related sequences (fast) or your query sequence aligned with related sequences
(even faster).

A) Submitting a NCBI GI # (SIFT-Blink)


You can submit a NCBI GI #id to obtain SIFT predictions. Predictions are based
on pre-computed BLAST searches and are returned within a minute. This is the
preferred method of submission.
Sift program

http://sift.jcvi.org/www/SIFT_seq_submit2.html

To find a NCBI GI # for a particular protein sequence, go to


the NCBI protein database and type in the gene name.

If you get back too many results you can narrow it down by
specifying the organism.

For example, if looking for the human MLH1 gene, type


"MLH1"[GENE] AND "homo sapiens"[ORGANISM] into the
NCBI text box and a shorter list of genes restricted to
human will be returned.
Sift program

B) Submitting a sequence (SIFT-Sequence)


You can submit a protein sequence in FASTA format. The
entire SIFT procedure will be executed and results will be
returned to you. This procedure is slow; if you have additional
information about the protein, you can get your results much
faster.

C) Submitting a group of related sequences (SIFT-related


sequences)
If you know of proteins related to your query protein, you can
get results much faster by submitting your sequence and
related sequences. Steps (1) & (2) of the SIFT procedure are
skipped. Submit in FASTA format with your protein of interest
as the first sequence in the file.
Sift program

D) Submitting a multiple alignment (SIFT-Aligned sequences)


If you have a multiple alignment containing your protein of interest, you can
submit the alignment in CLUSTAL, MSF, or FASTA format.

Your protein should be first in the alignment. The length of the alignment should
correspond to the query protein and there should be no gaps in the query protein
sequence.

Since steps (1) through (3) are skipped in the SIFT procedure, you will get your
results SUPER-DUPER FAST and we encourage you to use this submission
form instead of the others.
Sift program

E) Submitting Substitutions

SIFT will return predictions on whether your substitutions are tolerant or


intolerant based on the scores.

The format for a substitution is to have X#Y where X is the original amino acid, # is
the position of the substitution and Y is the new amino acid. One substitution per
line is allowed.

Example:
M1Y
K3S
T4P
Sift program
Sift program
Sift program

The first column indicates the variant submitted. If alleles are submitted with
respect to the - strand, they will be automatically converted to + strand.

Please not that if you do not submit the variant correctly, it will default to a
synonymous change.

One way to check is if the reference and non-reference alleles in the coordinates
column now match, this indicates that you most likely did not submit your variant
correctly.

The second column denotes the codon that has been changed, the bases are
with respect to + mRNA orientation.

If dbSNP has a variant overlapping at the same position, the rs ID is displayed.


However, the alleles may not be the same.
Sift program

Single Protein Output

For single protein submissions, the following output is also returned:


A table of probabilities
Here is an example of one of the rows in the table.

This lists normalized probabilities for position 9I of the query sequence.


Underneath 9I is the fraction of sequences that are represented at this
position. In this case, 75% of the sequences had a basic amino acid
appearing in the sequence; 25% had either gaps or Xes. The normalized
probability for an I->W substitution is < 0.05 so it is predicted deleterious
and highlighted in red.
Sift program

Predictions for each position

Here is an example of the output.

At position 7Q in the query sequence, 95% of the sequences have an


amino acid appearing at this position. K, Q, R are predicted as tolerated
and are observed in the alignment (capitalized).

C, W, D, F, M, I, Y, V, G, P, S, H, N, A, L, T, E are predicted to be
deleterious because they have normalized probabilities < 0.05 and none of
these appear in the alignment (small letters).

Amino acids are color coded: nonpolar, uncharged polar, basic, acidic.
Sift Sequence program

http://sift.jcvi.org/www/SIFT_seq_submit2.html
Sift program

http://sift.jcvi.org/www/SIFT_seq_submit2.html
Sift Blink program

http://sift.jcvi.org/www/SIFT_BLink_submit.html
Sift program

http://sift.jcvi.org/www/SIFT_BLink_submit.html
Sift program
Sift program
Sift program

Probabilities
Each row corresponds to a position in the reference protein.
Below each position is the fraction of sequences that contain one of
the basic amino acids.
A low fraction indicates the position is either severely gapped or
unalignable and has little information. Expect poor prediction at these
positions.

Each column corresponds to one of the twenty amino acids.

Each entry contains the score at a particular position (row) for an


amino acid substitution (column).

Substitutions predicted to be intolerant are highlighted in red.


Sift program

You might also like