0% found this document useful (0 votes)

46 views42 pages

Blast Introduction

The document provides an introduction to BLAST (Basic Local Alignment Search Tool), which is a free online tool from the National Center for Biotechnology Information (NCBI) that is used to compare a query DNA or protein sequence against a database of sequences and identify sequences that resemble the query sequence above a certain threshold. The document discusses what BLAST is, some common uses of BLAST, how BLAST works by performing local alignments of sequences, and tips for optimizing BLAST searches such as choosing the appropriate database and algorithm parameters.

Uploaded by

Keri Gobin Samaroo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views42 pages

Blast Introduction

Uploaded by

Keri Gobin Samaroo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 42

Introduction to BLAST

David Fristrom Bibliographer/Librarian Science and Engineering Library [email protected] 617 358-4124

What is BLAST?
Free, online service from National Center for Biotechnology Information (NCBI)

http://blast.ncbi.nlm.nih.gov/Blast.cgi

What is BLAST?

BLAST :
as

Nucleotide/Protein Sequence Databases

Google : Internet

Some Uses for BLAST

Identify an unknown sequence Build a homology tree for a protein Get clues about protein structure by finding similar proteins with known structures Map a sequence in a genome Etc., etc.

What is BLAST?

Basic Local Alignment Search Tool

Alignment
AACGTTTCCAGTCCAAATAGCTAGGC ===--=== =-===-==-====== AACCGTTC TACAATTACCTAGGC
Hits(+1): 18 Misses (-2): 5 Gaps (existence -2, extension -1): 1 Length: 3 Score = 18 * 1 + 5 * (-2) 2 2 = 6

Global Alignment
Compares total length of two sequences
Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):44353(1970).

Local Alignment
Compares segments of sequences Finds cases when one sequence is a part of another sequence, or they only match in parts.
Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7 (1981)

Search Tool
By aligning query sequence against all sequences in a database, alignment can be used to search database for similar sequences But alignment algorithms are slow

What is BLAST?
Quick, heuristic alignment algorithm Divides query sequence into short words, and initially only looks for (exact) matches of these words, then tries extending alignment. Much faster, but can miss some alignments
Altschul, S.F. et al. Basic local alignment search tool. J Mol Biol. 215(3):403-10(1990).

What is BLAST?
BLAST is not Google BLAST is like doing an experiment: to get good, meaningful results, you need to optimize the experimental conditions

Sample Search
Human beta globin (HBB)
Subunit of hemoglobin

Acquisition number: NP_000509 Limit to mouse to more easily show differences between searches

Interpreting Results
Score: Normalized score of alignment (substitution matrix and gap penalty). Can be compared across searches Max score: Score of single best aligned sequence Total score: Sum of scores of all aligned sequences

Interpreting Results
Query coverage: What percent of query sequence is aligned E Value: Number of matches with same score expected by chance. For low values, equal to p, the probability of a random alignment Typically, E < .05 is required to be considered significant

Getting the most out of BLAST

1. 2. 3. 4. What kind of BLAST? Pick an appropriate database Pick the right algorithm Choose parameters

Step 0: Do you need to use BLAST?

Step 1: Nucleotide BLAST vs. protein BLAST

Largely determined by your query sequence BUT If your nucleotide sequence can be translated to a peptide sequence, you probably want to do it (use tool such as ExPASy Translate Tool) Protein blasts are more sensitive and biologically significant Sometimes it makes sense to use other blasts

Specialized Search: blastx

Search protein database using a translated nucleotide query Use to find homologous proteins to a nucleotide coding region Translates the query sequence in all six reading frames Often the first analysis performed with a newly determined nucleotide sequence
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#blastx

Specialized Search: tblastn

Search translated nucleotide database using a protein query Does six-frame translations of the nucleotide database Find homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG)
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tblastn

Specialized Search: tblastx

Search translated nucleotide database using a translated nucleotide query Both translations use all six frames Useful in identifying potential proteins encoded by single pass read ESTs Good tool for identifying novel genes Computationally intensive
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#tblastx

Even More Specialized

Make specific primers with Primer-BLAST Search trace archives Find conserved domains in your sequence (cds) Find sequences with similar conserved domain architecture (cdart) Search sequences that have gene expression profiles (GEO) Search immunoglobulins (IgBLAST) Search for SNPs (snp) Screen sequence for vector contamination (vecscreen) Align two (or more) sequences using BLAST (bl2seq) Search protein or nucleotide targets in PubChem BioAssay Search SRA transcript libraries Constraint Based Protein Multiple Alignment Tool

Step 2: Choose a Database

Too large:
Takes longer Too many results More random, meaningless matches

Too small or wrong one:

Miss significant matches

Protein Databases
Non-redundant protein sequences (nr)
Kitchen-sink:
Translations of GenBank coding sequences (CDS) RefSeq Proteins PDB (RCSB Protein Data Bank - 3d-structure) SwissProt Protein Information Resource (PIR) Protein Research Foundation (Japanese DB)

Reference proteins (refseq_protein)

NCBI Reference Sequences: Comprehensive, integrated, nonredundant, well-annotated set of sequences

Swissprot protein sequences (swissprot)

Swiss-Prot: European protein database (no incremental updates)

Protein Databases
Patented protein sequences (pat)
Patented sequences

Protein Data Bank proteins (pdb)

Sequences from RCSB Protein Data Bank with experimentally determined structures

Environmental samples (env_nr)

Protein sequences from environmental samples (not associated with known organism)

Nucleotide Databases
Human genomic + transcript
http://www.ncbi.nlm.nih.gov/genome/guide/human/

Mouse genomic + transcript

http://www.ncbi.nlm.nih.gov/genome/guide/mouse/

Nucleotide collection (nr/nt)

nr stands for non-redundant, but it isnt
GenBank (NCBI) EMBL (European Nucleotide Sequence Database) DDBJ (DNA Databank of Japan) PDB (RCSB Protein Data Bank - 3d-structure)

Kitchen sink but not HTGS0,1,2, EST, GSS, STS, PAT, WGS

Nucleotide Databases
Reference mRNA sequences (refseq_rna) Reference genomic sequences (refseq_genomic)
NCBI Reference Sequences: Comprehensive, integrated, non-redundant, well-annotated set of sequences

NCBI Genomes (chromosome)

Complete genomes and chromosomes from Reference Sequences

Nucleotide Databases
Expressed sequence tags (est) Non-human, non-mouse ESTs (est_others)
http://www.ncbi.nlm.nih.gov/About/primer/est.html http://www.ncbi.nlm.nih.gov/dbEST/index.html

Genomic survey sequences (gss)

Like EST, but genomic rather than cDNA (mRNA)
random "single pass read" genome survey sequences. cosmid/BAC/YAC end sequences exon trapped genomic sequences Alu PCR sequences transposon-tagged sequences

http://www.ncbi.nlm.nih.gov/dbGSS/index.html

Nucleotide Databases
High throughput genomic sequences (HTGS)
Unfinished sequences (phase 1-2). Finished are already in nr/nt http://www.ncbi.nlm.nih.gov/HTGS/

Patent sequences (pat)

Patented genes

Protein Data Bank (pdb)

Sequences from RCSB Protein Data Bank with experimentally determined structures http://www.rcsb.org/pdb/home/home.do

Nucleotide Databases
Human ALU repeat elements (alu_repeats)
Database of repetitive elements

Sequence tagged sites (dbsts)

Short sequences with known locations from GenBank, EMBL, DDBJ

Whole-genome shotgun reads (wgs)

http://www.ncbi.nlm.nih.gov/Genbank/wgs.htm l

Nucleotide Databases
Environmental samples (env_nt)
Nucleotide sequences from environmental samples (not associated with known organism)

Database Options
Limit to (or exclude) an organism Exclude Models (XM/XP)
Model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs which may have been generated from incomplete data.

Entrez Query
Use Entrez query syntax to limit search

Step 3: Choose an Algorithm

How close a match are you looking for? Determines how similarities are scored Affects speed of search and chance of missing match Again, what is the goal of the search?

blastp
Protein-protein BLAST Standard protein BLAST

PSI-BLAST
Protein-protein BLAST Position-Specific Iterated BLAST Finds more distantly related matches Iterates: Initial search results provide information on allowed mutations; subsequent searches use these to create custom substitution matrix

PHI-BLAST
Protein-protein BLAST Pattern Hit Initiated BLAST Variation of PSI-BLAST Specify a pattern that hits must match Use when you know protein family has a signature pattern: active site, structural domain, etc. Better chance of eliminating false positives Example: VKAHGKKV

megablast
Nucleotide BLAST Finds highly similar sequences Very fast Use to identify a nucleotide sequence

blastn
Nucleotide BLAST Use to find less similar sequences

discontiguous megablast
Nucleotide BLAST
Bioinformatics. 2002 Mar;18(3):440-5. PatternHunter: faster and more sensitive homology search. Ma B, Tromp J, Li M.

Even more dissimilar sequences Use to find diverged sequences (possible homologies) from different organisms

Step: 4 Algorithm Parameters

Fine-tune the algorithm Short Queries Expect threshold: The lower it is, the fewer false positives (but you might miss real hits)

Algorithm Parameters
Scoring Matrix: PAM: Accepted Point Mutation
Empirically derived chance a substitution will be accepted, based on closely related proteins Higher PAM numbers correspond to greater evolutionary distance

BLOSUM: Blacks Substitution Matrix

Another empirically derived matrix, based on more distantly related proteins Lower BLOSUM numbers correspond to greater evolutionary distance

Compositional adjustment changes matrix to take into account overall composition of sequence

Algorithm Parameters
Filters and Masking Can ignore low complexity regions in searching

Additional Sources
Pevsner, Jonathan Bioinformatics and Functional Genomics, 2nd ed. (Wiley-Blackwell, 2009) BLAST help pages: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=We b&PAGE_TYPE=BlastDocs Slides from class on similarity searching; lots of technical details on algorithms and similarity matrices: http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Mod ules/Similarity/simsrchlast.html

Blast Introduction
No ratings yet
Blast Introduction
42 pages
Blast
No ratings yet
Blast
6 pages
Blast
100% (1)
Blast
21 pages
BLAST
No ratings yet
BLAST
17 pages
Ncbi Blast Name: Rohith ND Roll No:20054
No ratings yet
Ncbi Blast Name: Rohith ND Roll No:20054
11 pages
Using Genbank and BLAST in The Biology Classroom: Matt Wester
No ratings yet
Using Genbank and BLAST in The Biology Classroom: Matt Wester
9 pages
Blast Analisis II
No ratings yet
Blast Analisis II
15 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
No ratings yet
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
41 pages
Some Significant Databases Blast Blast
No ratings yet
Some Significant Databases Blast Blast
18 pages
BLAST
100% (1)
BLAST
4 pages
Bio Tools Booklet
No ratings yet
Bio Tools Booklet
5 pages
Lecture 05
No ratings yet
Lecture 05
36 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
BLAST
No ratings yet
BLAST
30 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
No ratings yet
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
6 pages
blast-170122070200
No ratings yet
blast-170122070200
22 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
Plant Biotechnology
No ratings yet
Plant Biotechnology
44 pages
Lab Report 03
No ratings yet
Lab Report 03
18 pages
IBB.MB.501 Database search and sequence alignment
No ratings yet
IBB.MB.501 Database search and sequence alignment
51 pages
Blast Nsuite
No ratings yet
Blast Nsuite
19 pages
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
100% (1)
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
4 pages
BTH 403-BTG407 PRACTICAL SESSION1
No ratings yet
BTH 403-BTG407 PRACTICAL SESSION1
12 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
An Introduction To NCBI BLAST: Prerequisites Resources
No ratings yet
An Introduction To NCBI BLAST: Prerequisites Resources
23 pages
Final Blast PDF
No ratings yet
Final Blast PDF
31 pages
UNIT IV _ BLAST (1)
No ratings yet
UNIT IV _ BLAST (1)
21 pages
Databases
No ratings yet
Databases
2 pages
How To Use BLAST
No ratings yet
How To Use BLAST
18 pages
Bioinformatics Tutorial 2019
No ratings yet
Bioinformatics Tutorial 2019
54 pages
Fundamentals of bioinformatics_L5
No ratings yet
Fundamentals of bioinformatics_L5
56 pages
Using BLAST: FASTA Format
0% (1)
Using BLAST: FASTA Format
3 pages
BI205 Prac 5&6
No ratings yet
BI205 Prac 5&6
11 pages
BE Blast
No ratings yet
BE Blast
11 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Basic Local Alignment
No ratings yet
Basic Local Alignment
36 pages
Bioinformatics 3 vedant
No ratings yet
Bioinformatics 3 vedant
7 pages
Production of Biodiesel From Vegetable Oils
No ratings yet
Production of Biodiesel From Vegetable Oils
9 pages
Bioinformatics: Arushi Dinesh Kasi Shruthi
No ratings yet
Bioinformatics: Arushi Dinesh Kasi Shruthi
28 pages
The New Blast Results Page: Scope
No ratings yet
The New Blast Results Page: Scope
4 pages
02. Biological Sequence Databases
No ratings yet
02. Biological Sequence Databases
35 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Bioinformatics Manual Updated (2) (1)
No ratings yet
Bioinformatics Manual Updated (2) (1)
48 pages
About Basic Local Alignment Search Tool
No ratings yet
About Basic Local Alignment Search Tool
17 pages
TY-Exercise_4_(35)
No ratings yet
TY-Exercise_4_(35)
8 pages
Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
No ratings yet
Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
11 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
ElasticSearch Server
From Everand
ElasticSearch Server
Rafal Kuc
No ratings yet
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Advanced Perl Techniques for Bioinformatics: Optimizing Data Analysis and Computational Biology
From Everand
Advanced Perl Techniques for Bioinformatics: Optimizing Data Analysis and Computational Biology
Adam Jones
No ratings yet
Henon Map
No ratings yet
Henon Map
2 pages
Construct2 Manual
No ratings yet
Construct2 Manual
9 pages
Canvas Student Guide
No ratings yet
Canvas Student Guide
419 pages
Turning Android Smartphones Into Professional Thermography Cameras
No ratings yet
Turning Android Smartphones Into Professional Thermography Cameras
4 pages
Solutions From Enderle
No ratings yet
Solutions From Enderle
9 pages
Bio Med Hipaa RileyD - 2
No ratings yet
Bio Med Hipaa RileyD - 2
1 page
Event Planning Guidelines
No ratings yet
Event Planning Guidelines
29 pages
Agri Portfolio-1
100% (1)
Agri Portfolio-1
37 pages
Table of Important Fourier Tranforms For Signals and Systems
No ratings yet
Table of Important Fourier Tranforms For Signals and Systems
1 page