100% found this document useful (1 vote)

593 views12 pages

BioPerl Tutorial

This document provides an overview of BioPerl sequence objects that can be used to represent and manipulate biological sequence data. It describes common sequence object types like Seq, PrimarySeq, and RichSeq, explaining what each stores and when they would be used. It also outlines less common object types like LocatableSeq, RelSegment, and LiveSeq that provide additional functionality for aligned sequences, sequences with changing coordinates, and more. The goal of BioPerl sequence objects is to make it easier to retrieve, analyze, and manipulate sequence data through Perl.

Uploaded by

Murali Mohan Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

593 views12 pages

BioPerl Tutorial

Uploaded by

Murali Mohan Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BIOPERL

TUTORIAL (ABREV.)

Getting started in Bio::Perl

1) Simple script to get a sequence by Id and write to specified format

use Bio::Perl;
# this script will only work if you have an internet connection on the
# computer you're using, the databases you can get sequences from
# are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq'
$seq_object = get_sequence('genbank',"ROA1_HUMAN");
write_sequence(">roa1.fasta",'fasta',$seq_object);

That second argument of write_sequence, 'fasta', is the sequence format. You
can choose among all the formats supported by SeqIO. Try writing the sequence file
in 'genbank' format.
2) Another example is the ability to blast a sequence using the facilities as NCBI. Please
be careful not to abuse the resources that NCBI provides and use this only for
individual searches. If you want to do a large number of BLAST searches, please
download the blast package and install it locally.
use Bio::Perl;
$seq = get_sequence('genbank',"ROA1_HUMAN");
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_result);

Bio::Perl has a limited number of functions to retrieve and manipulate sequence

data. (Bio::Perl manpage)

Sequence objects

(Seq, PrimarySeq, RichSeq, LargeSeq, LocatableSeq, RelSegment, LiveSeq,

SeqWithQuality, SeqI)
This section describes various Bioperl sequence objects. Typically you dont need to
know the type of Sequence object because SeqIO assesses and creates the right type
of object when given a file, filehandle or string.

Seq

The central sequence object in bioperl. When in doubt this is probably the object
that you want to use to describe a DNA, RNA or protein sequence in bioperl. Most
common sequence manipulations can be performed with Seq. (Bio::Seq manpage).
Seq objects may be created for you automatically when you read in a file containing
sequence data using the SeqIO object (see below). In addition to storing its
identification labels and the sequence itself, a Seq object can store multiple
annotations and associated ``sequence features'', such as those contained in most
Genbank and EMBL sequence files. This capability can be very useful - especially in
development of automated genome annotation systems.

PrimarySeq

A stripped-down version of Seq. It contains just the sequence data itself and a few
identifying labels (id, accession number, alphabet = dna, rna, or protein), and no
features. For applications with hundreds or thousands or sequences, using
PrimarySeq objects can significantly speed up program execution and decrease the
amount of RAM the program requires. (Bio::PrimarySeq manpage).

RichSeq

Stores additional annotations beyond those used by standard Seq objects. If you are
using sources with very rich sequence annotation, you may want to consider using
these objects. RichSeq objects are created automatically when Genbank, EMBL, or
Swissprot format files are read by SeqIO. (Bio::Seq::RichSeqI manpage)

LargeSeq

Used for handling very long sequences (e.g. > 100 MB). (Bio::Seq::LargeSeq
manpage).

LocatableSeq

Might be more appropriately called an ``AlignedSeq'' object. It is a Seq object which

is part of a multiple sequence alignment. It has start and end positions indicating
from where in a larger sequence it may have been extracted. It also may have gap
symbols corresponding to the alignment to which it belongs. It is used by the
alignment object SimpleAlign and other modules that use SimpleAlign objects (e.g.
AlignIO.pm, pSW.pm).
LocatableSeq objects will be made for you automatically when you create an
alignment (using pSW, Clustalw, Tcoffee, Lagan, or bl2seq) or when you input an
alignment data file using AlignIO. However if you need to input a sequence
alignment by hand (e.g. to build a SimpleAlign object), you will need to input the

sequences as LocatableSeqs. Other sources of information include the

Bio::LocatableSeq manpage, the Bio::SimpleAlign manpage, the Bio::AlignIO
manpage, and the Bio::Tools::pSW manpage.

SeqWithQuality

Used to manipulate sequences with quality data, like those produced by phred., , and
in (Bio::Seq::SeqWithQuality manpage).

RelSegment

Useful when you want to be able to manipulate the origin of the genomic coordinate
system. This situation may occur when looking at a sub-sequence (e.g. an exon)
which is located on a longer underlying underlying sequence such as a chromosome
or a contig. Such manipulations may be important, for example when designing a
graphical genome browser. If your code may need such a capability, look at the
documentation the Bio::DB::GFF::RelSegment manpage which describes this feature
in detail.

LiveSeq

Addresses the problem of features whose location on a sequence changes over time.
This can happen, for example, when sequence feature objects are used to store gene
locations on newly sequenced genomes - locations which can change as higher
quality sequencing data becomes available. Although a LiveSeq object is not
implemented in the same way as a Seq object, LiveSeq does implement the SeqI
interface (see below). Consequently, most methods available for Seq objects will
work fine with LiveSeq objects. Section III.7.4 and the Bio::LiveSeq manpage contain
further discussion of LiveSeq objects.

SeqI

Seq ``interface objects''. They are used to ensure bioperl's compatibility with other
software packages. SeqI and other interface objects are not likely to be relevant to
the casual Bioperl user. (Bio::SeqI manpage)

Accessing sequence data from local and remote databases

Example: create a simple Seq object. Can you make it print the accession number,
alphabet, and sequence each on a separate line
$seq = Bio::Seq->new(-seq
-description
-display_id
-accession_number
-alphabet

=>
=>
=>
=>
=>

'actgtggcgtcaact',
'Sample Bio::Seq object',
'something',
'BIOL_5310',
'dna' );

In most cases, you will probably be accessing sequence data from some online data
file or database.

Accessing remote databases

Example: The following code example will get 3 different sequences from GenBank.

$gb = new Bio::DB::GenBank();

# this returns a Seq object :
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
# this also returns a Seq object :
$seq2 = $gb->get_Seq_by_acc('AF303112');
# this returns a SeqIO object, which can be used to get a Seq object :
$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);
$seq3 = $seqio->next_seq;

Can you make it print all the sequence descriptions?

Transforming sequence files (SeqIO)

A common - and tedious - bioinformatics task is that of converting sequence data
among the many widely used data formats. Bioperl's SeqIO object, however, makes
this chore a breeze. SeqIO can read a stream of sequences - located in a single or in
multiple files - in a number of formats including Fasta, EMBL, GenBank, Swissprot,
SCF, phd/phred, Ace, fastq, exp, chado, or raw (plain sequence). SeqIO can also parse
tracefiles in alf, ztr, abi, ctf, and ctr format Once the sequence data has been read in
with SeqIO, it is available to Bioperl in the form of Seq, PrimarySeq, or RichSeq
objects, depending on what the sequence source is. Moreover, the sequence objects
can then be written to another file using SeqIO in any of the supported data formats
making data converters simple to implement, for example:
use Bio::SeqIO;
$in = Bio::SeqIO->new(-file => "inputfilename",
-format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename",
-format => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

In addition, the Perl ``tied filehandle'' syntax is available to SeqIO, allowing you to
use the standard <> and print operations to read and write sequence objects, eg:
$in

= Bio::SeqIO->newFh(-file => "inputfilename" ,

-format => 'fasta');
$out = Bio::SeqIO->newFh(-format => 'embl');
print $out $_ while <$in>;

If the ``-format'' argument isn't used then Bioperl will try to determine the format based
on the file's suffix, in a case-insensitive manner. If there's no suffix available then SeqIO
will attempt to guess the format based on actual content. If it can't determine the format
then it will assume ``fasta''. A complete list of formats and suffixes can be found in the
SeqIO HOWTO (http://www.bioperl.org/wiki/HOWTO:SeqIO).
Practice: Get the sequences AJ312413, NP_001073624, XM_001807534 with accession
number from genbank and write them to a file in fasta format.

Transforming alignment files (AlignIO)

Data files storing multiple sequence alignments also appear in varied formats. AlignIO is
the Bioperl object for conversion of alignment files. AlignIO is patterned on the SeqIO
object and its commands have many of the same names as the commands in SeqIO. Just as
in SeqIO the AlignIO object can be created with ``-file'' and ``-format'' options:
use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file
=> "example.aln",
-format => "clustalw" );

If the ``-format'' argument isn't used then Bioperl will try and determine the format based
on the file's suffix, in a case-insensitive manner. Here is the current set of suffixes:
Format
bl2seq
clustalw
emboss*
fasta
maf
mase
mega
meme
metafasta
msf
nexus
pfam
phylip
po
prodom
psi
selex
stockholm

Suffixes

Comment

GCG
interleaved
POA
PSI-BLAST
HMMER

*water, needle, matcher, stretcher, merger, and supermatcher.
Unlike SeqIO, AlignIO cannot create output files in every format. AlignIO currently
supports output in these 7 formats: fasta, mase, selex, clustalw, msf/gcg, phylip
(interleaved), and po.
Another significant difference between AlignIO and SeqIO is that AlignIO handles IO for
only a single alignment at a time but SeqIO.pm handles IO for multiple sequences in a
single stream. Syntax for AlignIO is almost identical to that of SeqIO:
use Bio::AlignIO;
$in = Bio::AlignIO->new(-file => "inputfilename" ,
-format => 'clustalw');
$out = Bio::AlignIO->new(-file => ">outputfilename",
-format => 'fasta');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }

The only difference is that the returned object reference, $aln, is to a SimpleAlign object
rather than to a Seq object.

AlignIO also supports the tied filehandle syntax described above for SeqIO. See the
Bio::AlignIO manpage, the Bio::SimpleAlign manpage,

Manipulating sequence data with Seq methods

Seq provides multiple methods for performing many common (and some not-so-
common) tasks of sequence manipulation and data retrieval. Here are some of the
most useful:

These methods return strings or may be used to set values:

$seqobj->display_id();
$seqobj->seq();
$seqobj->subseq(5,10);
$seqobj->accession_number();
$seqobj->alphabet();
$seqobj->primary_id();
$seqobj->description();

#
#
#
#
#
#
#
#

the human read-able id of the sequence

string of sequence
part of the sequence as a string
when there, the accession number
one of 'dna','rna','protein'
a unique id for this sequence irregardless
of its display_id or accession number
a description of the sequence

It is worth mentioning that some of these values correspond to specific fields of

given formats. For example, the display_id() method returns the LOCUS name of a
Genbank entry, the (\S+) following the > character in a Fasta file, the ID from a
SwissProt file, and so on. The description() method will return the DEFINITION line
of a Genbank file, the line following the display_id in a Fasta file, and the DE field in a
SwissProt file.

The following methods return an array of Bio::SeqFeature objects:

$seqobj->get_SeqFeatures;
$seqobj->get_all_SeqFeatures;

# The 'top level' sequence features

# All sequence features, including sub# seq features

For a comment annotation, you can use:

use Bio::Annotation::Comment;
$seq->annotation->add_Annotation('comment',
Bio::Annotation::Comment->new(-text => 'some description');

For a reference annotation, you can use:

use Bio::Annotation::Reference;
$seq->annotation->add_Annotation('reference',
Bio::Annotation::Reference->new(-authors =>
-title
=>
-location =>
-medline =>

'author1,author2',
'title line',
'location line',
998122 );

A general description of the object can be found in the Bio::SeqFeature::Generic

manpage, and a description of related, top-level annotation is found in
Bio::Annotation::Collection manpage. There's also a HOWTO on features and

annotations ( http://bioperl.org/HOWTOs/html/Feature-Annotation.html ) and there's

a section on features in the FAQ ( http://bioperl.org/Core/Latest/faq.html#5 ).

The following methods returns new sequence objects, but do not transfer the features
from the starting object to the resulting feature:
$seqobj->trunc(5,10);
$seqobj->revcom;
$seqobj->translate;

# truncation from 5 to 10 as new object

# reverse complements sequence
# translation of the sequence

**Note that some methods return strings, some return arrays and some return
objects. See the Bio::Seq manpage for more information.
Many of these methods are self-explanatory. However, the flexible translation() method
needs some explanation. Translation in bioinformatics can mean two slightly different
things:
1. Translating a nucleotide sequence from start to end.
2. Translate the actual coding regions in mRNAs or cDNAs.
The Bioperl implementation of sequence translation does the first of these tasks easily.
Any sequence object which is not of alphabet 'protein' can be translated by simply calling
the method which returns a protein sequence object:
$prot_obj = $my_seq_object->translate;

All codons will be translated, including those before and after any initiation and
termination codons. For example, ttttttatgccctaggggg will be translated to FFMP*G
However, the translate() method can also be passed several optional parameters to
modify its behavior. For example, you can tell translate() to modify the characters used
to represent terminator (default '*') and unknown amino acids (default 'X').
$prot_obj = $my_seq_object->translate(-terminator => '-');
$prot_obj = $my_seq_object->translate(-unknown => '_');

You can also determine the frame of the translation. The default frame starts at the first
nucleotide (frame 0). To get translation in the next frame, we would write:
$prot_obj = $my_seq_object->translate(-frame => 1);

The codontable_id argument to translate() makes it possible to use alternative genetic

codes. There are currently 16 codon tables defined, including 'Standard', 'Vertebrate
Mitochondrial', 'Bacterial', 'Alternative Yeast Nuclear' and 'Ciliate, Dasycladacean and
Hexamita Nuclear'. All these tables can be seen in the Bio::Tools::CodonTable manpage.
For example, for mitochondrial translation:
$prot_obj = $seq_obj->translate(-codontable_id => 2);

If we want to translate full coding regions (CDS) the way major nucleotide databanks
EMBL, GenBank and DDBJ do it, the translate() method has to perform more checks.
Specifically, translate()needs to confirm that the sequence has appropriate start and
terminator codons at the very beginning and the very end of the sequence and that there
are no terminator codons present within the sequence in frame 0. In addition, if the
genetic code being used has an atypical (non-ATG) start codon, the translate() method
needs to convert the initial amino acid to methionine. These checks and conversions are
triggered by setting ``complete'' to 1:
$prot_obj = $my_seq_object->translate(-complete => 1);
If ``complete'' is set to true and the criteria for a proper CDS are not met, the method, by
default, issues a warning. By setting ``throw'' to 1, one can instead instruct the program to
die if an improper CDS is found, e.g.
$prot_obj = $my_seq_object->translate(-complete => 1,
-throw => 1);

You can also create a custom codon table and pass this object to translate:
$prot_obj = $my_seq_object->translate(-codontable => $table_obj);
translate() can also find the open reading frame (ORF) starting at the 1st initiation

codon in the nucleotide sequence, regardless of its frame, and translate that:
$prot_obj = $my_seq_object->translate(-orf => 1);

Most of the codon tables used by translate() have initiation codons in addition to ATG,
including the default codon table, NCBI ``Standard''. To tell translate() to use only ATG,
or atg, as the initiation codon set -start to ``atg'':
$prot_obj = $my_seq_object->translate(-orf => 1,

-start => "atg" );

The -start argument only applies when -orf is set to 1.

Last trick. By default translate() will translate the termination codon to some special
character (the default is *, but this can be reset using the -terminator argument).
When -complete is set to 1 this character is removed. So, with this:
$prot_obj = $my_seq_object->translate(-orf => 1,

-complete => 1);

the sequence tttttatgccctaggggg will be translated to MP, not MP*.

See the Bio::Tools::CodonTable manpage and the Bio::PrimarySeqI manpage for more
information on translation.

Obtaining basic sequence statistics (SeqStats,SeqWord)

In addition to the methods directly available in the Seq object, bioperl provides
various helper objects to determine additional information about a sequence. For
example, SeqStats object provides methods for obtaining the molecular weight of

the sequence as well the number of occurrences of each of the component residues
(bases for a nucleic acid or amino acids for a protein.) For nucleic acids, SeqStats
also returns counts of the number of codons used. For example:
use SeqStats;
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
$monomer_ref = $seq_stats->count_monomers();
$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence

Note: sometimes sequences will contain ambiguous codes. For this reason,
get_mol_wt() returns a reference to a two element array containing a greatest
lower bound and a least upper bound of the molecular weight.
The SeqWords object is similar to SeqStats and provides methods for calculating
frequencies of ``words'' (e.g. tetramers or hexamers) within the sequence. See the
Bio::Tools::SeqStats manpage and the Bio::Tools::SeqWords manpage for more
information.

Identifying restriction enzyme sites (Bio::Restriction)

Another common sequence manipulation task for nucleic acid sequences is locating
restriction enzyme cutting sites. Bioperl provides the Bio::Restriction::Enzyme,
Bio::Restriction::EnzymeCollection, and Bio::Restriction::Analysis objects for this
purpose.
A new collection of enzyme objects would be defined like this:
use Bio::Restriction::EnzymeCollection;
my $all_collection = Bio::Restriction::EnzymeCollection;

Bioperl's default Restriction::EnzymeCollection object comes with data for more

than 500 different Type II restriction enzymes. A list of the available enzyme names
can be accessed using the available_list() method, but these are just the names,
not the functional objects. You also have access to enzyme subsets. For example to
select all available Enzyme objects with recognition sites that are six bases long one
could write:
my $six_cutter_collection = $all_collection->cutters(6);
for my $enz ($six_cutter_collection){
print $enz->name,"\t",$enz->site,"\t",$enz->overhang_seq,"\n";
# prints name, recognition site, overhang
}

There are other methods that can be used to select sets of enzyme objects, such as
unique_cutters() and blunt_enzymes(). You can also select a Enzyme object by
name, like so:
my $ecori_enzyme = $all_collection->get_enzyme('EcoRI');

Once an appropriate enzyme has been selected, the sites for that enzyme on a given
nucleic acid sequence can be obtained using the fragments() method. The syntax
for performing this task is:
use Bio::Restriction::Analysis;

my $analysis = Bio::Restriction::Analysis->new(-seq => $seq);

# where $seq is the Bio::Seq object for the DNA to be cut
@fragments = $analysis->fragments($enzyme);
# and @fragments will be an array of strings

For more information, including creating your own RE database (REBASE), see the
Bio::Restriction::Enzyme manpage, the Bio::Restriction::EnzymeCollection manpage,
the Bio::Restriction::Analysis manpage, and the Bio::Restriction::IO manpage.

Identifying amino acid cleavage sites (Sigcleave)
Predict aa cleavage sites. Please see the Bio::Tools::Sigcleave manpage for details

Miscellaneous sequence utilities: OddCodes, SeqPattern

OddCodes: listing of an amino acid sequence showing where the functional aspects of
amino acids are located or where the positively charged ones are. See the documentation
in the Bio::Tools::OddCodes manpage for further details.
SeqPattern: used to manipulate sequences using Perl regular expressions. More detail can
be found in the Bio::Tools::SeqPattern manpage.

Converting coordinate systems (Coordinate::Pair, RelSegment)

Coordinate system conversion is a common requirement, for example, when one

wants to look at the relative positions of sequence features to one another and
convert those relative positions to absolute coordinates along a chromosome or
contig. Although coordinate conversion sounds pretty trivial it can get fairly tricky
when one includes the possibilities of switching to coordinates on negative (i.e.
Crick) strands and/or having a coordinate system terminate because you have
reached the end of a clone or contig.
For more details on coordinate transformations and other GFF-related capabilities in
Bioperl see the Bio::DB::GFF::RelSegment manpage, the Bio::DB::GFF manpage

Searching for similar sequences

Running BLAST (using RemoteBlast.pm)
A skeleton script to run a remote blast might look as follows:
$remote_blast = Bio::Tools::Run::RemoteBlast->new (
-prog => 'blastp', -data => 'ecoli', -expect => '1e-10' );
$r = $remote_blast->submit_blast("t/data/ecolitst.fa");
while (@rids = $remote_blast->each_rid ) {
for $rid ( @rids ) {$rc = $remote_blast->retrieve_blast($rid);}
}

You may want to change some parameter of the remote job and this example shows
how to change the matrix:
$Bio::Tools::Run::RemoteBlast::HEADER{'MATRIX_NAME'} = 'BLOSUM25';

For a description of the many CGI parameters see:

http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
Note that the script has to be broken into two parts. The actual Blast submission and
the subsequent retrieval of the results. At times when the NCBI Blast is being heavily
used, the interval between when a Blast submission is made and when the results
are available can be substantial.
The object $rc would contain the blast report that could then be parsed with
Bio::Tools::BPlite or Bio::SearchIO. The default object returned is SearchIO after
version 1.0. The object type can be changed using the -readmethod parameter but
bear in mind that the favored Blast parser is Bio::SearchIO, others won't be
supported in later versions.
**Note that to make this script actually useful, one should add details such as
checking return codes from the Blast to see if it succeeded and a ``sleep'' loop to
wait between consecutive requests to the NCBI server. See the
Bio::Tools::Run::RemoteBlast manpage for details.
It should also be noted that the syntax for creating a remote blast factory is slightly
different from that used in creating StandAloneBlast, Clustalw, and T-Coffee
factories. Specifically RemoteBlast requires parameters to be passed with a leading
hyphen, as in '-prog' => 'blastp', while the other programs do not pass
parameters with a leading hyphen.

Parsing BLAST and FASTA reports with Search and SearchIO

No matter how Blast searches are run (locally or remotely, with or without a Perl
interface), they return large quantities of data that are tedious to sift through.
Bioperl offers several different objects - Search.pm/SearchIO.pm, and BPlite.pm
(along with its minor modifications, BPpsilite and BPbl2seq) for parsing Blast
reports. Search and SearchIO which are the principal Bioperl interfaces for Blast and
FASTA report parsing, are described in this section. The older BPlite is described in
section III.4.3. We recommend you use SearchIO, it's certain to be supported in
future releases.
The Search and SearchIO modules provide a uniform interface for parsing sequence-
similarity-search reports generated by BLAST (in standard and BLAST XML
formats), PSI-BLAST, RPS-BLAST, bl2seq and FASTA. The SearchIO modules also
provide a parser for HMMER reports and in the future, it is envisioned that the
Search/SearchIO syntax will be extended to provide a uniform interface to an even
wider range of report parsers including parsers for Genscan.
Parsing sequence-similarity reports with Search and SearchIO is straightforward.
Initially a SearchIO object specifies a file containing the report(s). The method
next_result() reads the next report into a Search object in just the same way that
the next_seq() method of SeqIO reads in the next sequence in a file into a Seq
object.
Once a report (i.e. a SearchIO object) has been read in and is available to the script,
the report's overall attributes (e.g. the query) can be determined and its individual

hits can be accessed with the next_hit() method. Individual high-scoring segment
pairs for each hit can then be accessed with the next_hsp() method. Except for the
additional syntax required to enable the reading of multiple reports in a single file,
the remainder of the Search/SearchIO parsing syntax is very similar to that of the
BPlite object it is intended to replace. Sample code to read a BLAST report might
look like this:
# Get the report
$searchio = new Bio::SearchIO (-format => 'blast',
-file
=> $blast_report);
$result = $searchio->next_result;
# Get info about the entire report
$result->database_name;
$algorithm_type = $result->algorithm;
# get info about the first hit
$hit = $result->next_hit;
$hit_name = $hit->name ;
# get info about the first hsp of the first hit
$hsp = $hit->next_hsp;
$hsp_start = $hsp->query->start;

For more details there is a good description of how to use SearchIO at

http://www.bioperl.org/HOWTOs/html/SearchIO.html or in the docs/howto
subdirectory of the distribution. Additional documentation can be found in the
Bio::SearchIO::blast manpage, the Bio::SearchIO::psiblast manpage, the
Bio::SearchIO::blastxml manpage, the Bio::SearchIO::fasta manpage, and the
Bio::SearchIO manpage. There is also sample code in the examples/searchio
directory which illustrates how to use SearchIO. And finally, there's a section with
SearchIO questions in the FAQ ( http://bioperl.org/Core/Latest/faq.html#3 ).

Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq

Bioperl's older BLAST report parsers - BPlite, BPpsilite, BPbl2seq and Blast.pm - are
no longer supported but since legacy Bioperl scripts have been written which use
these objects, they are likely to remain within Bioperl for some time.
A complete description of the module can be found in the Bio::Tools::BPlite
manpage.

RefSeq Accession abbreviation key:
http://www.ncbi.nlm.nih.gov/projects/RefSeq/key.html - accessions

Common questions

Seq objects in Bioperl are designed to store sequence data along with associated annotations and sequence features, making them suitable for detailed analyses involving annotated genomic data. In contrast, PrimarySeq objects are a stripped-down version that contain only the basic sequence data and a few identification labels without annotations or features. This makes PrimarySeq objects more appropriate for handling large datasets where speed and reduced memory usage are priorities, such as batch processing large numbers of sequences without needing detailed annotations .

The Bio::Restriction module in Bioperl facilitates the analysis of nucleic acid sequences by providing tools to identify restriction enzyme cutting sites. It includes objects like Bio::Restriction::Enzyme, EnzymeCollection, and Bio::Restriction::Analysis which allow users to select specific enzyme objects, compute restriction sites, and obtain fragment patterns. This functionality is crucial in cloning, sequence mapping, and other molecular biology applications as it allows precise manipulation and characterization of DNA molecules based on enzyme recognition sites .

Custom codon tables in Bioperl's translate method allow users to tailor the translation of nucleotide sequences according to specific genetic code variants, which is crucial for accurately representing translation in non-standard organisms or experimental conditions. This flexibility ensures that researchers can accommodate unique codon usages, such as in mitochondrial genomes or synthetic biology applications, providing more accurate protein sequence predictions that reflect organism-specific nuances .

Bioperl enhances genome annotation system development by providing Seq objects that store not only sequence data but also multiple annotations and associated sequence features. This capability allows for the automated extraction and manipulation of functional and structural annotations from Genbank and EMBL files, which are essential in genome annotation processes. The system's ability to manage vast arrays of biological data efficiently supports sophisticated tasks in genome annotation including feature extraction, result in analysis, and database integrations .

AlignIO in Bioperl is used to transform alignment files between different formats, similar to how SeqIO works with sequences. It reads alignment data from a file or stream in one format, such as clustalw, and writes it to another format like fasta. AlignIO supports many common alignment formats but is limited to handling only a single alignment at a time, unlike SeqIO which can handle multiple sequences in a stream. Additionally, AlignIO supports fewer output formats compared to SeqIO, which limits its versatility .

The main advantage of using the SeqIO object in Bioperl is its ability to simplify the conversion between various sequence formats, which is typically a tedious bioinformatics task. SeqIO can read a stream of sequences from different formats such as Fasta, EMBL, GenBank, Swissprot, among others, and write them to another format, thus making data converters easy to implement. This flexibility allows bioinformaticians to manipulate and transform data without manually handling different file types .

The Bio::Tools::SeqStats module in Bioperl provides functionalities such as calculating the molecular weight of a sequence, counting the occurrences of component residues, and counting codons in nucleic acid sequences. These functionalities are important because they allow researchers to perform basic sequence analyses that inform experimental design, data interpretation, and subsequent biological research. By providing methods to quantify sequence characteristics, SeqStats aids in understanding the chemical and functional properties of sequences .

The Bio::Tools::Run::RemoteBlast module in Bioperl performs sequence similarity searches by allowing users to submit sequences to NCBI's BLAST for analysis. This involves setting up a RemoteBlast object with desired parameters like program type, database selection, and expectation threshold, and then submitting the sequence for BLASTing. After the submission, results are retrieved and parsed using tools like Bio::SearchIO. The advantage of RemoteBlast is that it leverages NCBI's computational resources, reducing the need for local installation and maintenance of BLAST software, and provides access to up-to-date databases .

Bioperl ensures efficient parsing of large BLAST reports through the use of the Bio::SearchIO module, which provides a uniform interface for parsing sequence similarity search results across different BLAST formats, including XML. It simplifies handling complex data by allowing users to access top-level results through methods like next_result() and drill down into detailed alignments with next_hit() and next_hsp() methods. This modular approach reduces the cognitive load on users by providing streamlined access to data and enabling automation of report processing in scripts .

Bio::SeqFeature::Generic aids researchers in annotating sequence data by serving as a versatile feature object that can represent various sequence annotations such as gene locations, regulatory motifs, and other DNA elements. It supports the addition of custom annotations and can be integrated into larger automated annotation pipelines. However, its level of abstraction might not capture highly specific biological contexts or complex relationships between features, requiring additional customization or integration with more specialized tools or databases for comprehensive functional annotation .

Introduction to Perl Programming
No ratings yet
Introduction to Perl Programming
81 pages
Linux Shell Programming Overview
No ratings yet
Linux Shell Programming Overview
7 pages
Unit 4
No ratings yet
Unit 4
105 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
R Programming Lesson Plan
No ratings yet
R Programming Lesson Plan
7 pages
PL/SQL Basics for Beginners
No ratings yet
PL/SQL Basics for Beginners
47 pages
R Language
No ratings yet
R Language
59 pages
Matplotlib Line and Scatter Plot Guide
No ratings yet
Matplotlib Line and Scatter Plot Guide
32 pages
Chapter 5: Queues: Bcs1223: Data Structures & Algorithms
No ratings yet
Chapter 5: Queues: Bcs1223: Data Structures & Algorithms
43 pages
Bca-Iv Sem Dar Imp Questions
100% (1)
Bca-Iv Sem Dar Imp Questions
1 page
R Programming Essentials
No ratings yet
R Programming Essentials
9 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Bacterial Classification Techniques
No ratings yet
Bacterial Classification Techniques
33 pages
Overview of R Programming Language
No ratings yet
Overview of R Programming Language
7 pages
BioJava Book
No ratings yet
BioJava Book
101 pages
ML CT Question Paper 2023 24
No ratings yet
ML CT Question Paper 2023 24
2 pages
R Programming Lab: Key Operations
No ratings yet
R Programming Lab: Key Operations
57 pages
R Programming Detailed Notes
No ratings yet
R Programming Detailed Notes
3 pages
Data Mining Lab Manual for GTU
No ratings yet
Data Mining Lab Manual for GTU
52 pages
Statistical Methods in Data Mining
No ratings yet
Statistical Methods in Data Mining
35 pages
R Programming for BCA Students
No ratings yet
R Programming for BCA Students
40 pages
UNIT 5 Time Series Analysis
No ratings yet
UNIT 5 Time Series Analysis
17 pages
Standard Template Library
No ratings yet
Standard Template Library
24 pages
Unit 3 - Basic Search and Traversal Techniques
100% (2)
Unit 3 - Basic Search and Traversal Techniques
113 pages
DLRL Module 1
No ratings yet
DLRL Module 1
20 pages
JNTU KAKINADA - B.Tech - STATISTICS WITH R PROGRAMMING R16 R1621051102017 FR 200
No ratings yet
JNTU KAKINADA - B.Tech - STATISTICS WITH R PROGRAMMING R16 R1621051102017 FR 200
5 pages
Computational Methods and Techniques
No ratings yet
Computational Methods and Techniques
15 pages
R Factor Variables and Data Frames Guide
No ratings yet
R Factor Variables and Data Frames Guide
6 pages
IDS Notes Unit 5
No ratings yet
IDS Notes Unit 5
7 pages
Perl Programming Essentials
100% (1)
Perl Programming Essentials
52 pages
R Programming: Vectors and Operations
No ratings yet
R Programming: Vectors and Operations
100 pages
A Statistical Perspective On Data Mining
No ratings yet
A Statistical Perspective On Data Mining
25 pages
Ada Lab Manual (Mca Iv Sem Vtu)
No ratings yet
Ada Lab Manual (Mca Iv Sem Vtu)
64 pages
UCSC Genome Browser Overview and Features
No ratings yet
UCSC Genome Browser Overview and Features
9 pages
Introduction to MapReduce Framework
No ratings yet
Introduction to MapReduce Framework
107 pages
Statistical Computing and R Programming
No ratings yet
Statistical Computing and R Programming
2 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Travelling Salesman Problem: Using Branch and Bound Method
No ratings yet
Travelling Salesman Problem: Using Branch and Bound Method
15 pages
Operating System BCS 401 - Important Questions With Solutions
No ratings yet
Operating System BCS 401 - Important Questions With Solutions
54 pages
File Handling in R: Reading & Writing Data
No ratings yet
File Handling in R: Reading & Writing Data
46 pages
BCA Program Syllabus Overview
No ratings yet
BCA Program Syllabus Overview
28 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
Backtracking & Branching Guide
No ratings yet
Backtracking & Branching Guide
4 pages
R Programming Lab Programs
No ratings yet
R Programming Lab Programs
16 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Data Mining for CSE Students
No ratings yet
Data Mining for CSE Students
11 pages
13 - Data Structures - B Tree B+ Tree
No ratings yet
13 - Data Structures - B Tree B+ Tree
13 pages
Backtracking ADA
No ratings yet
Backtracking ADA
20 pages
Module 1-1
No ratings yet
Module 1-1
38 pages
DS Module 4 Trees
No ratings yet
DS Module 4 Trees
21 pages
R Programming Unit - 2 Complete Notes
No ratings yet
R Programming Unit - 2 Complete Notes
27 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
Dbms Aicte Lab
No ratings yet
Dbms Aicte Lab
42 pages
R Lab Programs
No ratings yet
R Lab Programs
65 pages
C# and VB.NET Windows Forms Guide
No ratings yet
C# and VB.NET Windows Forms Guide
61 pages
Structure vs Union in C Programming
No ratings yet
Structure vs Union in C Programming
4 pages
IDS Notes Unit 2
No ratings yet
IDS Notes Unit 2
20 pages
B Perl: Submitted To:S .N
No ratings yet
B Perl: Submitted To:S .N
8 pages
Bioperl: Perl Modules for Life Sciences
No ratings yet
Bioperl: Perl Modules for Life Sciences
47 pages
What Is Coding? 15 Facts For Beginners
No ratings yet
What Is Coding? 15 Facts For Beginners
5 pages
Deeksharambh SIP (Student Induction Programme) : Click On
No ratings yet
Deeksharambh SIP (Student Induction Programme) : Click On
2 pages
Fourier Transform Theorems and Applications
50% (2)
Fourier Transform Theorems and Applications
46 pages
Visual Basic Brief Intro
No ratings yet
Visual Basic Brief Intro
48 pages
Anskey28.8 Paper-III
No ratings yet
Anskey28.8 Paper-III
48 pages
J-87-16-III - Computer SC PDF
No ratings yet
J-87-16-III - Computer SC PDF
24 pages
J-87-16-III - Computer SC PDF
No ratings yet
J-87-16-III - Computer SC PDF
24 pages
J-87-16-III - Computer SC PDF
No ratings yet
J-87-16-III - Computer SC PDF
24 pages
Basic HTML
No ratings yet
Basic HTML
3 pages
Dos Commands
No ratings yet
Dos Commands
4 pages
Circle Drawing Algorithm
No ratings yet
Circle Drawing Algorithm
7 pages
Introduction To HTML Lesson 3: Images
No ratings yet
Introduction To HTML Lesson 3: Images
20 pages
Perl Programming Exercises 1 - 'A B C'
No ratings yet
Perl Programming Exercises 1 - 'A B C'
29 pages
Global Warming
No ratings yet
Global Warming
23 pages
What Is A Cell: Organelle Function Factory Part
No ratings yet
What Is A Cell: Organelle Function Factory Part
8 pages
Overview of Molecular Biology Concepts
No ratings yet
Overview of Molecular Biology Concepts
40 pages
How To See Dna With Naked Eye
50% (2)
How To See Dna With Naked Eye
13 pages
Control of Gene Expression
No ratings yet
Control of Gene Expression
15 pages
Section 8.5 Enzymes Can Be Inhibited by Specific Molecules
No ratings yet
Section 8.5 Enzymes Can Be Inhibited by Specific Molecules
8 pages
Overview of Hemostasis Mechanisms
No ratings yet
Overview of Hemostasis Mechanisms
16 pages
Third Assessmen Exam Random Selected 40 MCQs by Committee 12 06
No ratings yet
Third Assessmen Exam Random Selected 40 MCQs by Committee 12 06
2 pages
Proposal - Saurav
No ratings yet
Proposal - Saurav
23 pages
Supplementary Info For Xinjiang Burials
No ratings yet
Supplementary Info For Xinjiang Burials
58 pages
Computational Biology Answers Dec2023
No ratings yet
Computational Biology Answers Dec2023
3 pages
Apoptosis PPT Final
No ratings yet
Apoptosis PPT Final
12 pages
Bio 220 Syllabus+2013 - 8 22 13 - MC
No ratings yet
Bio 220 Syllabus+2013 - 8 22 13 - MC
4 pages
Study Guide For Biology Test
No ratings yet
Study Guide For Biology Test
2 pages
Excitable Tissues Term Paper
No ratings yet
Excitable Tissues Term Paper
6 pages
Molecular Markers
No ratings yet
Molecular Markers
48 pages
Media 4
No ratings yet
Media 4
1,878 pages
PyMOL Molecular Visualization Guide
No ratings yet
PyMOL Molecular Visualization Guide
11 pages
Replacement Method in Animal Research
No ratings yet
Replacement Method in Animal Research
6 pages
Samuel A. Gum: Population Genetics Lab Technician
No ratings yet
Samuel A. Gum: Population Genetics Lab Technician
2 pages
Biology Midsem SET 1
No ratings yet
Biology Midsem SET 1
3 pages
Shotgun & Bottom-Up
No ratings yet
Shotgun & Bottom-Up
52 pages
Understanding Enzymes: Properties & Functions
No ratings yet
Understanding Enzymes: Properties & Functions
3 pages
Bio Cell Transport Cornell Notes
100% (2)
Bio Cell Transport Cornell Notes
3 pages
NGA, Frontiers in Plant Science, 2014
No ratings yet
NGA, Frontiers in Plant Science, 2014
11 pages
Active Transport and Ion Pumps
No ratings yet
Active Transport and Ion Pumps
4 pages
Class 10 Biology - Unit 1 - Genetics of Life - Simplified Class Notes by Rasheed Odakkal
0% (1)
Class 10 Biology - Unit 1 - Genetics of Life - Simplified Class Notes by Rasheed Odakkal
5 pages
Ecology: Systems, Life Elements, Trophic Levels
No ratings yet
Ecology: Systems, Life Elements, Trophic Levels
37 pages
Meiosis and Chromosome Separation Errors
No ratings yet
Meiosis and Chromosome Separation Errors
4 pages
CV Rana Ashraf
No ratings yet
CV Rana Ashraf
2 pages
AP Biology Unit 6 Guide
No ratings yet
AP Biology Unit 6 Guide
7 pages

BioPerl Tutorial

Uploaded by

BioPerl Tutorial

Uploaded by

BIOPERL

Getting started in Bio::Perl

1) Simple script to get a sequence by Id and write to specified format

Bio::Perl has a limited number of functions to retrieve and manipulate sequence

(Seq, PrimarySeq, RichSeq, LargeSeq, LocatableSeq, RelSegment, LiveSeq,

Might be more appropriately called an ``AlignedSeq'' object. It is a Seq object which

sequences as LocatableSeqs. Other sources of information include the

Accessing sequence data from local and remote databases

Accessing remote databases

$gb = new Bio::DB::GenBank();

Can you make it print all the sequence descriptions?

Transforming sequence files (SeqIO)

= Bio::SeqIO->newFh(-file => "inputfilename" ,

Transforming alignment files (AlignIO)

Manipulating sequence data with Seq methods

the human read-able id of the sequence

It is worth mentioning that some of these values correspond to specific fields of

# The 'top level' sequence features

For a reference annotation, you can use:

A general description of the object can be found in the Bio::SeqFeature::Generic

annotations ( http://bioperl.org/HOWTOs/html/Feature-Annotation.html ) and there's

# truncation from 5 to 10 as new object

The codontable_id argument to translate() makes it possible to use alternative genetic

-start => "atg" );

The -start argument only applies when -orf is set to 1.

-complete => 1);

the sequence tttttatgccctaggggg will be translated to MP, not MP*.

Obtaining basic sequence statistics (SeqStats,SeqWord)

Identifying restriction enzyme sites (Bio::Restriction)

Bioperl's default Restriction::EnzymeCollection object comes with data for more

my $analysis = Bio::Restriction::Analysis->new(-seq => $seq);

Miscellaneous sequence utilities: OddCodes, SeqPattern

Converting coordinate systems (Coordinate::Pair, RelSegment)

Coordinate system conversion is a common requirement, for example, when one

Searching for similar sequences

For a description of the many CGI parameters see:

Parsing BLAST and FASTA reports with Search and SearchIO

For more details there is a good description of how to use SearchIO at

Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq

Common questions

What is the difference between Seq and PrimarySeq objects in Bioperl, and when would it be appropriate to use each?

What is the difference between Seq and PrimarySeq objects in Bioperl, and when would it be appropriate to use each?

How does the use of Bio::Restriction in Bioperl facilitate the analysis of nucleic acid sequences for restriction enzyme sites?

How does the use of Bio::Restriction in Bioperl facilitate the analysis of nucleic acid sequences for restriction enzyme sites?

Discuss the role and importance of custom codon tables in the translation of sequences using Bioperl's translate method.

Discuss the role and importance of custom codon tables in the translation of sequences using Bioperl's translate method.

How does Bioperl's method for handling sequence feature annotations enhance genome annotation system development?

How does Bioperl's method for handling sequence feature annotations enhance genome annotation system development?

Explain how the Bioperl module AlignIO is used to transform alignment files and discuss its limitations compared to SeqIO.

Explain how the Bioperl module AlignIO is used to transform alignment files and discuss its limitations compared to SeqIO.

What is the main advantage of using the SeqIO object in Bioperl for data transformation between different sequence formats?

What is the main advantage of using the SeqIO object in Bioperl for data transformation between different sequence formats?

What are the functionalities of the Bio::Tools::SeqStats module in Bioperl, and why are they important in bioinformatics analysis?

What are the functionalities of the Bio::Tools::SeqStats module in Bioperl, and why are they important in bioinformatics analysis?

Describe the process and advantages of using the Bio::Tools::Run::RemoteBlast module in Bioperl to perform sequence similarity searches.

Describe the process and advantages of using the Bio::Tools::Run::RemoteBlast module in Bioperl to perform sequence similarity searches.

What strategies does Bioperl employ to ensure efficient parsing and handling of large sequence similarity search reports from BLAST?

What strategies does Bioperl employ to ensure efficient parsing and handling of large sequence similarity search reports from BLAST?

In what ways can Bioperl's Bio::SeqFeature::Generic aid researchers in annotating sequence data, and what are its limitations?

In what ways can Bioperl's Bio::SeqFeature::Generic aid researchers in annotating sequence data, and what are its limitations?

You might also like