BIOPERL
TUTORIAL
(ABREV.)
Getting
started
in
Bio::Perl
1)
Simple
script
to
get
a
sequence
by
Id
and
write
to
specified
format
use Bio::Perl;
# this script will only work if you have an internet connection on the
# computer you're using, the databases you can get sequences from
# are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq'
$seq_object = get_sequence('genbank',"ROA1_HUMAN");
write_sequence(">roa1.fasta",'fasta',$seq_object);
That
second
argument
of
write_sequence,
'fasta',
is
the
sequence
format.
You
can
choose
among
all
the
formats
supported
by
SeqIO.
Try
writing
the
sequence
file
in
'genbank' format.
2) Another
example
is
the
ability
to
blast
a
sequence
using
the
facilities
as
NCBI.
Please
be
careful
not
to
abuse
the
resources
that
NCBI
provides
and
use
this
only
for
individual
searches.
If
you
want
to
do
a
large
number
of
BLAST
searches,
please
download
the
blast
package
and
install
it
locally.
use Bio::Perl;
$seq = get_sequence('genbank',"ROA1_HUMAN");
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_result);
Bio::Perl
has
a
limited
number
of
functions
to
retrieve
and
manipulate
sequence
data.
(Bio::Perl
manpage)
Sequence
objects
(Seq,
PrimarySeq,
RichSeq,
LargeSeq,
LocatableSeq,
RelSegment,
LiveSeq,
SeqWithQuality,
SeqI)
This
section
describes
various
Bioperl
sequence
objects.
Typically
you
dont
need
to
know
the
type
of
Sequence
object
because
SeqIO
assesses
and
creates
the
right
type
of
object
when
given
a
file,
filehandle
or
string.
Seq
The
central
sequence
object
in
bioperl.
When
in
doubt
this
is
probably
the
object
that
you
want
to
use
to
describe
a
DNA,
RNA
or
protein
sequence
in
bioperl.
Most
common
sequence
manipulations
can
be
performed
with
Seq.
(Bio::Seq
manpage).
Seq
objects
may
be
created
for
you
automatically
when
you
read
in
a
file
containing
sequence
data
using
the
SeqIO
object
(see
below).
In
addition
to
storing
its
identification
labels
and
the
sequence
itself,
a
Seq
object
can
store
multiple
annotations
and
associated
``sequence
features'',
such
as
those
contained
in
most
Genbank
and
EMBL
sequence
files.
This
capability
can
be
very
useful
-
especially
in
development
of
automated
genome
annotation
systems.
PrimarySeq
A
stripped-down
version
of
Seq.
It
contains
just
the
sequence
data
itself
and
a
few
identifying
labels
(id,
accession
number,
alphabet
=
dna,
rna,
or
protein),
and
no
features.
For
applications
with
hundreds
or
thousands
or
sequences,
using
PrimarySeq
objects
can
significantly
speed
up
program
execution
and
decrease
the
amount
of
RAM
the
program
requires.
(Bio::PrimarySeq
manpage).
RichSeq
Stores
additional
annotations
beyond
those
used
by
standard
Seq
objects.
If
you
are
using
sources
with
very
rich
sequence
annotation,
you
may
want
to
consider
using
these
objects.
RichSeq
objects
are
created
automatically
when
Genbank,
EMBL,
or
Swissprot
format
files
are
read
by
SeqIO.
(Bio::Seq::RichSeqI
manpage)
LargeSeq
Used
for
handling
very
long
sequences
(e.g.
>
100
MB).
(Bio::Seq::LargeSeq
manpage).
LocatableSeq
Might
be
more
appropriately
called
an
``AlignedSeq''
object.
It
is
a
Seq
object
which
is
part
of
a
multiple
sequence
alignment.
It
has
start
and
end
positions
indicating
from
where
in
a
larger
sequence
it
may
have
been
extracted.
It
also
may
have
gap
symbols
corresponding
to
the
alignment
to
which
it
belongs.
It
is
used
by
the
alignment
object
SimpleAlign
and
other
modules
that
use
SimpleAlign
objects
(e.g.
AlignIO.pm,
pSW.pm).
LocatableSeq
objects
will
be
made
for
you
automatically
when
you
create
an
alignment
(using
pSW,
Clustalw,
Tcoffee,
Lagan,
or
bl2seq)
or
when
you
input
an
alignment
data
file
using
AlignIO.
However
if
you
need
to
input
a
sequence
alignment
by
hand
(e.g.
to
build
a
SimpleAlign
object),
you
will
need
to
input
the
sequences
as
LocatableSeqs.
Other
sources
of
information
include
the
Bio::LocatableSeq
manpage,
the
Bio::SimpleAlign
manpage,
the
Bio::AlignIO
manpage,
and
the
Bio::Tools::pSW
manpage.
SeqWithQuality
Used
to
manipulate
sequences
with
quality
data,
like
those
produced
by
phred.,
,
and
in
(Bio::Seq::SeqWithQuality
manpage).
RelSegment
Useful
when
you
want
to
be
able
to
manipulate
the
origin
of
the
genomic
coordinate
system.
This
situation
may
occur
when
looking
at
a
sub-sequence
(e.g.
an
exon)
which
is
located
on
a
longer
underlying
underlying
sequence
such
as
a
chromosome
or
a
contig.
Such
manipulations
may
be
important,
for
example
when
designing
a
graphical
genome
browser.
If
your
code
may
need
such
a
capability,
look
at
the
documentation
the
Bio::DB::GFF::RelSegment
manpage
which
describes
this
feature
in
detail.
LiveSeq
Addresses
the
problem
of
features
whose
location
on
a
sequence
changes
over
time.
This
can
happen,
for
example,
when
sequence
feature
objects
are
used
to
store
gene
locations
on
newly
sequenced
genomes
-
locations
which
can
change
as
higher
quality
sequencing
data
becomes
available.
Although
a
LiveSeq
object
is
not
implemented
in
the
same
way
as
a
Seq
object,
LiveSeq
does
implement
the
SeqI
interface
(see
below).
Consequently,
most
methods
available
for
Seq
objects
will
work
fine
with
LiveSeq
objects.
Section
III.7.4
and
the
Bio::LiveSeq
manpage
contain
further
discussion
of
LiveSeq
objects.
SeqI
Seq
``interface
objects''.
They
are
used
to
ensure
bioperl's
compatibility
with
other
software
packages.
SeqI
and
other
interface
objects
are
not
likely
to
be
relevant
to
the
casual
Bioperl
user.
(Bio::SeqI
manpage)
Accessing
sequence
data
from
local
and
remote
databases
Example:
create
a
simple
Seq
object.
Can
you
make
it
print
the
accession
number,
alphabet,
and
sequence
each
on
a
separate
line
$seq = Bio::Seq->new(-seq
-description
-display_id
-accession_number
-alphabet
=>
=>
=>
=>
=>
'actgtggcgtcaact',
'Sample Bio::Seq object',
'something',
'BIOL_5310',
'dna' );
In
most
cases,
you
will
probably
be
accessing
sequence
data
from
some
online
data
file
or
database.
Accessing
remote
databases
Example:
The
following
code
example
will
get
3
different
sequences
from
GenBank.
$gb = new Bio::DB::GenBank();
# this returns a Seq object :
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
# this also returns a Seq object :
$seq2 = $gb->get_Seq_by_acc('AF303112');
# this returns a SeqIO object, which can be used to get a Seq object :
$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);
$seq3 = $seqio->next_seq;
Can
you
make
it
print
all
the
sequence
descriptions?
Transforming
sequence
files
(SeqIO)
A
common
-
and
tedious
-
bioinformatics
task
is
that
of
converting
sequence
data
among
the
many
widely
used
data
formats.
Bioperl's
SeqIO
object,
however,
makes
this
chore
a
breeze.
SeqIO
can
read
a
stream
of
sequences
-
located
in
a
single
or
in
multiple
files
-
in
a
number
of
formats
including
Fasta,
EMBL,
GenBank,
Swissprot,
SCF,
phd/phred,
Ace,
fastq,
exp,
chado,
or
raw
(plain
sequence).
SeqIO
can
also
parse
tracefiles
in
alf,
ztr,
abi,
ctf,
and
ctr
format
Once
the
sequence
data
has
been
read
in
with
SeqIO,
it
is
available
to
Bioperl
in
the
form
of
Seq,
PrimarySeq,
or
RichSeq
objects,
depending
on
what
the
sequence
source
is.
Moreover,
the
sequence
objects
can
then
be
written
to
another
file
using
SeqIO
in
any
of
the
supported
data
formats
making
data
converters
simple
to
implement,
for
example:
use Bio::SeqIO;
$in = Bio::SeqIO->new(-file => "inputfilename",
-format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename",
-format => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
In
addition,
the
Perl
``tied
filehandle''
syntax
is
available
to
SeqIO,
allowing
you
to
use
the
standard
<>
and
print
operations
to
read
and
write
sequence
objects,
eg:
$in
= Bio::SeqIO->newFh(-file => "inputfilename" ,
-format => 'fasta');
$out = Bio::SeqIO->newFh(-format => 'embl');
print $out $_ while <$in>;
If
the
``-format''
argument
isn't
used
then
Bioperl
will
try
to
determine
the
format
based
on
the
file's
suffix,
in
a
case-insensitive
manner.
If
there's
no
suffix
available
then
SeqIO
will
attempt
to
guess
the
format
based
on
actual
content.
If
it
can't
determine
the
format
then
it
will
assume
``fasta''.
A
complete
list
of
formats
and
suffixes
can
be
found
in
the
SeqIO
HOWTO
(http://www.bioperl.org/wiki/HOWTO:SeqIO).
Practice:
Get
the
sequences
AJ312413,
NP_001073624,
XM_001807534
with
accession
number
from
genbank
and
write
them
to
a
file
in
fasta
format.
Transforming
alignment
files
(AlignIO)
Data
files
storing
multiple
sequence
alignments
also
appear
in
varied
formats.
AlignIO
is
the
Bioperl
object
for
conversion
of
alignment
files.
AlignIO
is
patterned
on
the
SeqIO
object
and
its
commands
have
many
of
the
same
names
as
the
commands
in
SeqIO.
Just
as
in
SeqIO
the
AlignIO
object
can
be
created
with
``-file''
and
``-format''
options:
use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file
=> "example.aln",
-format => "clustalw" );
If
the
``-format''
argument
isn't
used
then
Bioperl
will
try
and
determine
the
format
based
on
the
file's
suffix,
in
a
case-insensitive
manner.
Here
is
the
current
set
of
suffixes:
Format
bl2seq
clustalw
emboss*
fasta
maf
mase
mega
meme
metafasta
msf
nexus
pfam
phylip
po
prodom
psi
selex
stockholm
Suffixes
Comment
aln
water|needle
fasta|fast|seq|fa|fsa|nt|aa
maf
Seaview
meg|mega
meme
msf|pileup|gcg
nexus|nex
pfam|pfm
phylip|phlp|phyl|phy|phy|ph
psi
selex|slx|selx|slex|sx
GCG
interleaved
POA
PSI-BLAST
HMMER
*water,
needle,
matcher,
stretcher,
merger,
and
supermatcher.
Unlike
SeqIO,
AlignIO
cannot
create
output
files
in
every
format.
AlignIO
currently
supports
output
in
these
7
formats:
fasta, mase, selex, clustalw, msf/gcg, phylip
(interleaved), and po.
Another
significant
difference
between
AlignIO
and
SeqIO
is
that
AlignIO
handles
IO
for
only
a
single
alignment
at
a
time
but
SeqIO.pm
handles
IO
for
multiple
sequences
in
a
single
stream.
Syntax
for
AlignIO
is
almost
identical
to
that
of
SeqIO:
use Bio::AlignIO;
$in = Bio::AlignIO->new(-file => "inputfilename" ,
-format => 'clustalw');
$out = Bio::AlignIO->new(-file => ">outputfilename",
-format => 'fasta');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
The
only
difference
is
that
the
returned
object
reference,
$aln,
is
to
a
SimpleAlign
object
rather
than
to
a
Seq
object.
AlignIO
also
supports
the
tied
filehandle
syntax
described
above
for
SeqIO.
See
the
Bio::AlignIO
manpage,
the
Bio::SimpleAlign
manpage,
Manipulating
sequence
data
with
Seq
methods
Seq
provides
multiple
methods
for
performing
many
common
(and
some
not-so-
common)
tasks
of
sequence
manipulation
and
data
retrieval.
Here
are
some
of
the
most
useful:
These
methods
return
strings
or
may
be
used
to
set
values:
$seqobj->display_id();
$seqobj->seq();
$seqobj->subseq(5,10);
$seqobj->accession_number();
$seqobj->alphabet();
$seqobj->primary_id();
$seqobj->description();
#
#
#
#
#
#
#
#
the human read-able id of the sequence
string of sequence
part of the sequence as a string
when there, the accession number
one of 'dna','rna','protein'
a unique id for this sequence irregardless
of its display_id or accession number
a description of the sequence
It
is
worth
mentioning
that
some
of
these
values
correspond
to
specific
fields
of
given
formats.
For
example,
the
display_id()
method
returns
the
LOCUS
name
of
a
Genbank
entry,
the
(\S+)
following
the
>
character
in
a
Fasta
file,
the
ID
from
a
SwissProt
file,
and
so
on.
The
description()
method
will
return
the
DEFINITION
line
of
a
Genbank
file,
the
line
following
the
display_id
in
a
Fasta
file,
and
the
DE
field
in
a
SwissProt
file.
The
following
methods
return
an
array
of
Bio::SeqFeature
objects:
$seqobj->get_SeqFeatures;
$seqobj->get_all_SeqFeatures;
# The 'top level' sequence features
# All sequence features, including sub# seq features
For
a
comment
annotation,
you
can
use:
use Bio::Annotation::Comment;
$seq->annotation->add_Annotation('comment',
Bio::Annotation::Comment->new(-text => 'some description');
For
a
reference
annotation,
you
can
use:
use Bio::Annotation::Reference;
$seq->annotation->add_Annotation('reference',
Bio::Annotation::Reference->new(-authors =>
-title
=>
-location =>
-medline =>
'author1,author2',
'title line',
'location line',
998122 );
A
general
description
of
the
object
can
be
found
in
the
Bio::SeqFeature::Generic
manpage,
and
a
description
of
related,
top-level
annotation
is
found
in
Bio::Annotation::Collection
manpage.
There's
also
a
HOWTO
on
features
and
annotations
(
http://bioperl.org/HOWTOs/html/Feature-Annotation.html
)
and
there's
a
section
on
features
in
the
FAQ
(
http://bioperl.org/Core/Latest/faq.html#5
).
The
following
methods
returns
new
sequence
objects,
but
do
not
transfer
the
features
from
the
starting
object
to
the
resulting
feature:
$seqobj->trunc(5,10);
$seqobj->revcom;
$seqobj->translate;
# truncation from 5 to 10 as new object
# reverse complements sequence
# translation of the sequence
**Note
that
some
methods
return
strings,
some
return
arrays
and
some
return
objects.
See
the
Bio::Seq
manpage
for
more
information.
Many
of
these
methods
are
self-explanatory.
However,
the
flexible
translation()
method
needs
some
explanation.
Translation
in
bioinformatics
can
mean
two
slightly
different
things:
1. Translating
a
nucleotide
sequence
from
start
to
end.
2. Translate
the
actual
coding
regions
in
mRNAs
or
cDNAs.
The
Bioperl
implementation
of
sequence
translation
does
the
first
of
these
tasks
easily.
Any
sequence
object
which
is
not
of
alphabet
'protein'
can
be
translated
by
simply
calling
the
method
which
returns
a
protein
sequence
object:
$prot_obj = $my_seq_object->translate;
All
codons
will
be
translated,
including
those
before
and
after
any
initiation
and
termination
codons.
For
example,
ttttttatgccctaggggg
will
be
translated
to
FFMP*G
However,
the
translate()
method
can
also
be
passed
several
optional
parameters
to
modify
its
behavior.
For
example,
you
can
tell
translate()
to
modify
the
characters
used
to
represent
terminator
(default
'*')
and
unknown
amino
acids
(default
'X').
$prot_obj = $my_seq_object->translate(-terminator => '-');
$prot_obj = $my_seq_object->translate(-unknown => '_');
You
can
also
determine
the
frame
of
the
translation.
The
default
frame
starts
at
the
first
nucleotide
(frame
0).
To
get
translation
in
the
next
frame,
we
would
write:
$prot_obj = $my_seq_object->translate(-frame => 1);
The
codontable_id
argument
to
translate()
makes
it
possible
to
use
alternative
genetic
codes.
There
are
currently
16
codon
tables
defined,
including
'Standard',
'Vertebrate
Mitochondrial',
'Bacterial',
'Alternative
Yeast
Nuclear'
and
'Ciliate,
Dasycladacean
and
Hexamita
Nuclear'.
All
these
tables
can
be
seen
in
the
Bio::Tools::CodonTable
manpage.
For
example,
for
mitochondrial
translation:
$prot_obj = $seq_obj->translate(-codontable_id => 2);
If
we
want
to
translate
full
coding
regions
(CDS)
the
way
major
nucleotide
databanks
EMBL,
GenBank
and
DDBJ
do
it,
the
translate()
method
has
to
perform
more
checks.
Specifically,
translate()needs
to
confirm
that
the
sequence
has
appropriate
start
and
terminator
codons
at
the
very
beginning
and
the
very
end
of
the
sequence
and
that
there
are
no
terminator
codons
present
within
the
sequence
in
frame
0.
In
addition,
if
the
genetic
code
being
used
has
an
atypical
(non-ATG)
start
codon,
the
translate()
method
needs
to
convert
the
initial
amino
acid
to
methionine.
These
checks
and
conversions
are
triggered
by
setting
``complete''
to
1:
$prot_obj = $my_seq_object->translate(-complete => 1);
If
``complete''
is
set
to
true
and
the
criteria
for
a
proper
CDS
are
not
met,
the
method,
by
default,
issues
a
warning.
By
setting
``throw''
to
1,
one
can
instead
instruct
the
program
to
die
if
an
improper
CDS
is
found,
e.g.
$prot_obj = $my_seq_object->translate(-complete => 1,
-throw => 1);
You
can
also
create
a
custom
codon
table
and
pass
this
object
to
translate:
$prot_obj = $my_seq_object->translate(-codontable => $table_obj);
translate()
can
also
find
the
open
reading
frame
(ORF)
starting
at
the
1st
initiation
codon
in
the
nucleotide
sequence,
regardless
of
its
frame,
and
translate
that:
$prot_obj = $my_seq_object->translate(-orf => 1);
Most
of
the
codon
tables
used
by
translate()
have
initiation
codons
in
addition
to
ATG,
including
the
default
codon
table,
NCBI
``Standard''.
To
tell
translate()
to
use
only
ATG,
or
atg,
as
the
initiation
codon
set
-start
to
``atg'':
$prot_obj = $my_seq_object->translate(-orf => 1,
-start => "atg" );
The
-start
argument
only
applies
when
-orf
is
set
to
1.
Last
trick.
By
default
translate()
will
translate
the
termination
codon
to
some
special
character
(the
default
is
*,
but
this
can
be
reset
using
the
-terminator
argument).
When
-complete
is
set
to
1
this
character
is
removed.
So,
with
this:
$prot_obj = $my_seq_object->translate(-orf => 1,
-complete => 1);
the
sequence
tttttatgccctaggggg
will
be
translated
to
MP,
not
MP*.
See
the
Bio::Tools::CodonTable
manpage
and
the
Bio::PrimarySeqI
manpage
for
more
information
on
translation.
Obtaining
basic
sequence
statistics
(SeqStats,SeqWord)
In
addition
to
the
methods
directly
available
in
the
Seq
object,
bioperl
provides
various
helper
objects
to
determine
additional
information
about
a
sequence.
For
example,
SeqStats
object
provides
methods
for
obtaining
the
molecular
weight
of
the
sequence
as
well
the
number
of
occurrences
of
each
of
the
component
residues
(bases
for
a
nucleic
acid
or
amino
acids
for
a
protein.)
For
nucleic
acids,
SeqStats
also
returns
counts
of
the
number
of
codons
used.
For
example:
use SeqStats;
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
$monomer_ref = $seq_stats->count_monomers();
$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
Note:
sometimes
sequences
will
contain
ambiguous
codes.
For
this
reason,
get_mol_wt()
returns
a
reference
to
a
two
element
array
containing
a
greatest
lower
bound
and
a
least
upper
bound
of
the
molecular
weight.
The
SeqWords
object
is
similar
to
SeqStats
and
provides
methods
for
calculating
frequencies
of
``words''
(e.g.
tetramers
or
hexamers)
within
the
sequence.
See
the
Bio::Tools::SeqStats
manpage
and
the
Bio::Tools::SeqWords
manpage
for
more
information.
Identifying
restriction
enzyme
sites
(Bio::Restriction)
Another
common
sequence
manipulation
task
for
nucleic
acid
sequences
is
locating
restriction
enzyme
cutting
sites.
Bioperl
provides
the
Bio::Restriction::Enzyme,
Bio::Restriction::EnzymeCollection,
and
Bio::Restriction::Analysis
objects
for
this
purpose.
A
new
collection
of
enzyme
objects
would
be
defined
like
this:
use Bio::Restriction::EnzymeCollection;
my $all_collection = Bio::Restriction::EnzymeCollection;
Bioperl's
default
Restriction::EnzymeCollection
object
comes
with
data
for
more
than
500
different
Type
II
restriction
enzymes.
A
list
of
the
available
enzyme
names
can
be
accessed
using
the
available_list()
method,
but
these
are
just
the
names,
not
the
functional
objects.
You
also
have
access
to
enzyme
subsets.
For
example
to
select
all
available
Enzyme
objects
with
recognition
sites
that
are
six
bases
long
one
could
write:
my $six_cutter_collection = $all_collection->cutters(6);
for my $enz ($six_cutter_collection){
print $enz->name,"\t",$enz->site,"\t",$enz->overhang_seq,"\n";
# prints name, recognition site, overhang
}
There
are
other
methods
that
can
be
used
to
select
sets
of
enzyme
objects,
such
as
unique_cutters() and
blunt_enzymes().
You
can
also
select
a
Enzyme
object
by
name,
like
so:
my $ecori_enzyme = $all_collection->get_enzyme('EcoRI');
Once
an
appropriate
enzyme
has
been
selected,
the
sites
for
that
enzyme
on
a
given
nucleic
acid
sequence
can
be
obtained
using
the
fragments()
method.
The
syntax
for
performing
this
task
is:
use Bio::Restriction::Analysis;
my $analysis = Bio::Restriction::Analysis->new(-seq => $seq);
# where $seq is the Bio::Seq object for the DNA to be cut
@fragments = $analysis->fragments($enzyme);
# and @fragments will be an array of strings
For
more
information,
including
creating
your
own
RE
database
(REBASE),
see
the
Bio::Restriction::Enzyme
manpage,
the
Bio::Restriction::EnzymeCollection
manpage,
the
Bio::Restriction::Analysis
manpage,
and
the
Bio::Restriction::IO
manpage.
Identifying
amino
acid
cleavage
sites
(Sigcleave)
Predict
aa
cleavage
sites.
Please
see
the
Bio::Tools::Sigcleave
manpage
for
details
Miscellaneous
sequence
utilities:
OddCodes,
SeqPattern
OddCodes:
listing
of
an
amino
acid
sequence
showing
where
the
functional
aspects
of
amino
acids
are
located
or
where
the
positively
charged
ones
are. See
the
documentation
in
the
Bio::Tools::OddCodes
manpage
for
further
details.
SeqPattern:
used
to
manipulate
sequences
using
Perl
regular
expressions.
More
detail
can
be
found
in
the
Bio::Tools::SeqPattern
manpage.
Converting
coordinate
systems
(Coordinate::Pair,
RelSegment)
Coordinate
system
conversion
is
a
common
requirement,
for
example,
when
one
wants
to
look
at
the
relative
positions
of
sequence
features
to
one
another
and
convert
those
relative
positions
to
absolute
coordinates
along
a
chromosome
or
contig.
Although
coordinate
conversion
sounds
pretty
trivial
it
can
get
fairly
tricky
when
one
includes
the
possibilities
of
switching
to
coordinates
on
negative
(i.e.
Crick)
strands
and/or
having
a
coordinate
system
terminate
because
you
have
reached
the
end
of
a
clone
or
contig.
For
more
details
on
coordinate
transformations
and
other
GFF-related
capabilities
in
Bioperl
see
the
Bio::DB::GFF::RelSegment
manpage,
the
Bio::DB::GFF
manpage
Searching
for
similar
sequences
Running
BLAST
(using
RemoteBlast.pm)
A
skeleton
script
to
run
a
remote
blast
might
look
as
follows:
$remote_blast = Bio::Tools::Run::RemoteBlast->new (
-prog => 'blastp', -data => 'ecoli', -expect => '1e-10' );
$r = $remote_blast->submit_blast("t/data/ecolitst.fa");
while (@rids = $remote_blast->each_rid ) {
for $rid ( @rids ) {$rc = $remote_blast->retrieve_blast($rid);}
}
You
may
want
to
change
some
parameter
of
the
remote
job
and
this
example
shows
how
to
change
the
matrix:
$Bio::Tools::Run::RemoteBlast::HEADER{'MATRIX_NAME'} = 'BLOSUM25';
For
a
description
of
the
many
CGI
parameters
see:
http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
Note
that
the
script
has
to
be
broken
into
two
parts.
The
actual
Blast
submission
and
the
subsequent
retrieval
of
the
results.
At
times
when
the
NCBI
Blast
is
being
heavily
used,
the
interval
between
when
a
Blast
submission
is
made
and
when
the
results
are
available
can
be
substantial.
The
object
$rc
would
contain
the
blast
report
that
could
then
be
parsed
with
Bio::Tools::BPlite
or
Bio::SearchIO.
The
default
object
returned
is
SearchIO
after
version
1.0.
The
object
type
can
be
changed
using
the
-readmethod
parameter
but
bear
in
mind
that
the
favored
Blast
parser
is
Bio::SearchIO,
others
won't
be
supported
in
later
versions.
**Note
that
to
make
this
script
actually
useful,
one
should
add
details
such
as
checking
return
codes
from
the
Blast
to
see
if
it
succeeded
and
a
``sleep''
loop
to
wait
between
consecutive
requests
to
the
NCBI
server.
See
the
Bio::Tools::Run::RemoteBlast
manpage
for
details.
It
should
also
be
noted
that
the
syntax
for
creating
a
remote
blast
factory
is
slightly
different
from
that
used
in
creating
StandAloneBlast,
Clustalw,
and
T-Coffee
factories.
Specifically
RemoteBlast
requires
parameters
to
be
passed
with
a
leading
hyphen,
as
in
'-prog' => 'blastp',
while
the
other
programs
do
not
pass
parameters
with
a
leading
hyphen.
Parsing
BLAST
and
FASTA
reports
with
Search
and
SearchIO
No
matter
how
Blast
searches
are
run
(locally
or
remotely,
with
or
without
a
Perl
interface),
they
return
large
quantities
of
data
that
are
tedious
to
sift
through.
Bioperl
offers
several
different
objects
-
Search.pm/SearchIO.pm,
and
BPlite.pm
(along
with
its
minor
modifications,
BPpsilite
and
BPbl2seq)
for
parsing
Blast
reports.
Search
and
SearchIO
which
are
the
principal
Bioperl
interfaces
for
Blast
and
FASTA
report
parsing,
are
described
in
this
section.
The
older
BPlite
is
described
in
section
III.4.3.
We
recommend
you
use
SearchIO,
it's
certain
to
be
supported
in
future
releases.
The
Search
and
SearchIO
modules
provide
a
uniform
interface
for
parsing
sequence-
similarity-search
reports
generated
by
BLAST
(in
standard
and
BLAST
XML
formats),
PSI-BLAST,
RPS-BLAST,
bl2seq
and
FASTA.
The
SearchIO
modules
also
provide
a
parser
for
HMMER
reports
and
in
the
future,
it
is
envisioned
that
the
Search/SearchIO
syntax
will
be
extended
to
provide
a
uniform
interface
to
an
even
wider
range
of
report
parsers
including
parsers
for
Genscan.
Parsing
sequence-similarity
reports
with
Search
and
SearchIO
is
straightforward.
Initially
a
SearchIO
object
specifies
a
file
containing
the
report(s).
The
method
next_result()
reads
the
next
report
into
a
Search
object
in
just
the
same
way
that
the
next_seq()
method
of
SeqIO
reads
in
the
next
sequence
in
a
file
into
a
Seq
object.
Once
a
report
(i.e.
a
SearchIO
object)
has
been
read
in
and
is
available
to
the
script,
the
report's
overall
attributes
(e.g.
the
query)
can
be
determined
and
its
individual
hits
can
be
accessed
with
the
next_hit()
method.
Individual
high-scoring
segment
pairs
for
each
hit
can
then
be
accessed
with
the
next_hsp()
method.
Except
for
the
additional
syntax
required
to
enable
the
reading
of
multiple
reports
in
a
single
file,
the
remainder
of
the
Search/SearchIO
parsing
syntax
is
very
similar
to
that
of
the
BPlite
object
it
is
intended
to
replace.
Sample
code
to
read
a
BLAST
report
might
look
like
this:
# Get the report
$searchio = new Bio::SearchIO (-format => 'blast',
-file
=> $blast_report);
$result = $searchio->next_result;
# Get info about the entire report
$result->database_name;
$algorithm_type = $result->algorithm;
# get info about the first hit
$hit = $result->next_hit;
$hit_name = $hit->name ;
# get info about the first hsp of the first hit
$hsp = $hit->next_hsp;
$hsp_start = $hsp->query->start;
For
more
details
there
is
a
good
description
of
how
to
use
SearchIO
at
http://www.bioperl.org/HOWTOs/html/SearchIO.html
or
in
the
docs/howto
subdirectory
of
the
distribution.
Additional
documentation
can
be
found
in
the
Bio::SearchIO::blast
manpage,
the
Bio::SearchIO::psiblast
manpage,
the
Bio::SearchIO::blastxml
manpage,
the
Bio::SearchIO::fasta
manpage,
and
the
Bio::SearchIO
manpage.
There
is
also
sample
code
in
the
examples/searchio
directory
which
illustrates
how
to
use
SearchIO.
And
finally,
there's
a
section
with
SearchIO
questions
in
the
FAQ
(
http://bioperl.org/Core/Latest/faq.html#3
).
Parsing
BLAST
reports
with
BPlite,
BPpsilite,
and
BPbl2seq
Bioperl's
older
BLAST
report
parsers
-
BPlite,
BPpsilite,
BPbl2seq
and
Blast.pm
-
are
no
longer
supported
but
since
legacy
Bioperl
scripts
have
been
written
which
use
these
objects,
they
are
likely
to
remain
within
Bioperl
for
some
time.
A
complete
description
of
the
module
can
be
found
in
the
Bio::Tools::BPlite
manpage.
RefSeq
Accession
abbreviation
key:
http://www.ncbi.nlm.nih.gov/projects/RefSeq/key.html
-
accessions