Blast Introduction
Blast Introduction
David Fristrom Bibliographer/Librarian Science and Engineering Library [email protected] 617 358-4124
What is BLAST?
Free, online service from National Center for Biotechnology Information (NCBI)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
What is BLAST?
BLAST :
as
Google : Internet
What is BLAST?
Alignment
AACGTTTCCAGTCCAAATAGCTAGGC ===--=== =-===-==-====== AACCGTTC TACAATTACCTAGGC
Hits(+1): 18 Misses (-2): 5 Gaps (existence -2, extension -1): 1 Length: 3 Score = 18 * 1 + 5 * (-2) 2 2 = 6
Global Alignment
Compares total length of two sequences
Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):44353(1970).
Local Alignment
Compares segments of sequences Finds cases when one sequence is a part of another sequence, or they only match in parts.
Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7 (1981)
Search Tool
By aligning query sequence against all sequences in a database, alignment can be used to search database for similar sequences But alignment algorithms are slow
What is BLAST?
Quick, heuristic alignment algorithm Divides query sequence into short words, and initially only looks for (exact) matches of these words, then tries extending alignment. Much faster, but can miss some alignments
Altschul, S.F. et al. Basic local alignment search tool. J Mol Biol. 215(3):403-10(1990).
What is BLAST?
BLAST is not Google BLAST is like doing an experiment: to get good, meaningful results, you need to optimize the experimental conditions
Sample Search
Human beta globin (HBB)
Subunit of hemoglobin
Acquisition number: NP_000509 Limit to mouse to more easily show differences between searches
Interpreting Results
Score: Normalized score of alignment (substitution matrix and gap penalty). Can be compared across searches Max score: Score of single best aligned sequence Total score: Sum of scores of all aligned sequences
Interpreting Results
Query coverage: What percent of query sequence is aligned E Value: Number of matches with same score expected by chance. For low values, equal to p, the probability of a random alignment Typically, E < .05 is required to be considered significant
Protein Databases
Non-redundant protein sequences (nr)
Kitchen-sink:
Translations of GenBank coding sequences (CDS) RefSeq Proteins PDB (RCSB Protein Data Bank - 3d-structure) SwissProt Protein Information Resource (PIR) Protein Research Foundation (Japanese DB)
Protein Databases
Patented protein sequences (pat)
Patented sequences
Nucleotide Databases
Human genomic + transcript
http://www.ncbi.nlm.nih.gov/genome/guide/human/
Kitchen sink but not HTGS0,1,2, EST, GSS, STS, PAT, WGS
Nucleotide Databases
Reference mRNA sequences (refseq_rna) Reference genomic sequences (refseq_genomic)
NCBI Reference Sequences: Comprehensive, integrated, non-redundant, well-annotated set of sequences
Nucleotide Databases
Expressed sequence tags (est) Non-human, non-mouse ESTs (est_others)
http://www.ncbi.nlm.nih.gov/About/primer/est.html http://www.ncbi.nlm.nih.gov/dbEST/index.html
http://www.ncbi.nlm.nih.gov/dbGSS/index.html
Nucleotide Databases
High throughput genomic sequences (HTGS)
Unfinished sequences (phase 1-2). Finished are already in nr/nt http://www.ncbi.nlm.nih.gov/HTGS/
Nucleotide Databases
Human ALU repeat elements (alu_repeats)
Database of repetitive elements
Nucleotide Databases
Environmental samples (env_nt)
Nucleotide sequences from environmental samples (not associated with known organism)
Database Options
Limit to (or exclude) an organism Exclude Models (XM/XP)
Model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs which may have been generated from incomplete data.
Entrez Query
Use Entrez query syntax to limit search
blastp
Protein-protein BLAST Standard protein BLAST
PSI-BLAST
Protein-protein BLAST Position-Specific Iterated BLAST Finds more distantly related matches Iterates: Initial search results provide information on allowed mutations; subsequent searches use these to create custom substitution matrix
PHI-BLAST
Protein-protein BLAST Pattern Hit Initiated BLAST Variation of PSI-BLAST Specify a pattern that hits must match Use when you know protein family has a signature pattern: active site, structural domain, etc. Better chance of eliminating false positives Example: VKAHGKKV
megablast
Nucleotide BLAST Finds highly similar sequences Very fast Use to identify a nucleotide sequence
blastn
Nucleotide BLAST Use to find less similar sequences
discontiguous megablast
Nucleotide BLAST
Bioinformatics. 2002 Mar;18(3):440-5. PatternHunter: faster and more sensitive homology search. Ma B, Tromp J, Li M.
Even more dissimilar sequences Use to find diverged sequences (possible homologies) from different organisms
Algorithm Parameters
Scoring Matrix: PAM: Accepted Point Mutation
Empirically derived chance a substitution will be accepted, based on closely related proteins Higher PAM numbers correspond to greater evolutionary distance
Compositional adjustment changes matrix to take into account overall composition of sequence
Algorithm Parameters
Filters and Masking Can ignore low complexity regions in searching
Additional Sources
Pevsner, Jonathan Bioinformatics and Functional Genomics, 2nd ed. (Wiley-Blackwell, 2009) BLAST help pages: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=We b&PAGE_TYPE=BlastDocs Slides from class on similarity searching; lots of technical details on algorithms and similarity matrices: http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Mod ules/Similarity/simsrchlast.html