0% found this document useful (0 votes)
47 views18 pages

Some Significant Databases Blast Blast

The document discusses several significant biological databases including the International Nucleotide Sequence Database Collaboration (INSDC), which consists of EMBL, GenBank, and DDBJ that contain the same DNA and RNA sequence data. It also mentions NCBI's Genome database and Ensembl database. The majority of the document focuses on describing the BLAST algorithm, including its input, output, process, programs, uses, and definitions.

Uploaded by

Adnan Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views18 pages

Some Significant Databases Blast Blast

The document discusses several significant biological databases including the International Nucleotide Sequence Database Collaboration (INSDC), which consists of EMBL, GenBank, and DDBJ that contain the same DNA and RNA sequence data. It also mentions NCBI's Genome database and Ensembl database. The majority of the document focuses on describing the BLAST algorithm, including its input, output, process, programs, uses, and definitions.

Uploaded by

Adnan Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture 2:

Some significant databases


BLAST
Instructor: Dr. M. Asif Rasheed
International Nucleotide Sequence Database
Collaboration (INSDC)
INSDC consists of a joint effort to collect and disseminate databases
containing DNA and RNA sequences.

EMBL GenBank DDBJ


Housed Housed Housed
at EBI at NCBI in Japan
European National National
Bioinformatics Center for Institute
Institute Biotechnology of Genetics
Information

These three databases contain the same data at any given time!
Genome
This resource organizes information on genomes including sequences,
maps, chromosomes, assemblies, and annotations.
http://www.ncbi.nlm.nih.gov/genome/
http://www.genome.jp/kegg/
Ensembl
BLAST
 Basic Local Alignment Search Tool
 Local vs Global alignment
 Comparing biological sequence
 Query vs library or database of sequences
 Identify library sequences above threshold
 Most widely used programs for sequence
searching
 Heuristic algorithm
 Faster than others
 Calculating optimal alignment
 Full alignment procedure (Smith–Waterman algorithm)
 Cannot guarantee optimal alignments
 Time-efficient
 Search only significant patterns in the sequences
Input

 Input sequences in FASTA


 Weight matrix
 Searching criteria
Output
 HTML, plain text and XML
 For NCBI's web-page
 HTML

 Graphical format
 Hitsfound
 Table showing sequences
identified with scoring data
Process
 Heuristic method
 Short matches between two sequences
 Seeding
 First match
 Begins local alignments
 Default word size is 3
 GLKFA
 GLK, LKF, KFA
 Heuristic algorithm
 Locates three-letter words between sequences
 Build an alignment
Process…
 Alignment score at least the threshold T
 Default scoring matrix
 BLOSUM62
 Alignment extended in both directions
 Score higher than T
 Included
 Otherwise discarded
 Increasing the T score limits
 Amount of search?
 Process speed?
Overview of the BLAST
algorithm
 Remove low-complexity region in the
query sequence
 k word list of the query sequence
 High-scoring words into search tree
 Scan the database for exact matches
 Extend exact matches to high-scoring
segment pair (HSP)
 List all HSPs in the database higher than T
BLAST Programs
 Nucleotide-nucleotide BLAST (blastn)
 DNA query, DNA database
 Protein-protein BLAST (blastp)
 Protein query, Protein database
 Position-Specific Iterative BLAST (PSI-BLAST)
 To find distant relatives of a protein
 A list of all closely related proteins is created
 A query against the protein database is then run
using this profile
 Iterative
BLAST Programs
 Nucleotide 6-frame translation-protein (blastx)
 Compares the six-frame translation products of nucleotide query against a protein
database
 Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx)
 Slowest of the BLAST family
 Translates the query nucleotide sequence in all six possible frames
 Compares it against the six-frame translations of a nucleotide sequence database
 To find very distant relationships between nucleotide sequences
 Protein-nucleotide 6-frame translation (tblastn)
 Compares a protein query against six frames of nucleotide database
 Large numbers of query sequences (megablast)
 Comparing large numbers of input sequences via the command-line BLAST
Uses of BLAST
 Identifying species
 Correctly identify a species or find homologous
 Locating domains
 To locate known domains within the sequence of interest
 Establishing phylogeny
 Create a phylogenetic tree using different alignments given
by BLAST
 DNA mapping
 Comparing unknown sequence against the known
sequences
Some definitions in BLAST
 Score
 Score of the alignment
 Query coverage
 How much query is found in database (percentage)
 Identity
 How much identity found in the covered query (percentage)
 E value (Expected value)
 Describes the random background noise
 Describes the number of hits one can "expect" to see by chance
 It decreases exponentially as the Score (S) of the match
increases

You might also like