Lecture1 BIOF242 Shuvadeep
Lecture1 BIOF242 Shuvadeep
BIOF 242
N-MET-PRO-ARG-TYR-LEU-C Protein
The initial conversion of RNA to DNA — going in reverse of the central dogma — is
called reverse transcription, and viruses that use this mechanism are classified as
retroviruses.
• The format also allows for sequence names and comments to precede the sequences.
• The format originates from the FASTA software package (a program for fast alignment
by W.R. Pearson.) but has now become a near universal standard in the field of
bioinformatics.
• The simplicity of FASTA format makes it easy to manipulate and parse sequences using
text-processing tools and scripting languages like the R programming
language, Python, Ruby, Haskell, and Perl.
FASTA
In this format the sequence name and additional information is provided on one line. It is
called header/identifier line, which begins with '>', gives a name and/or a unique
identifier for the sequence, and may also contain additional information.
The sequence itself is represented in following lines either in interleaved format, which
means a fixed number of characters per line, or non-interleaved.
Interleaved FASTA (three sequences): Non-interleaved FASTA (of the above example):
>human >human
ACCGTGATGT ACCGTGATGTAGAGACCACGGGCCC
AGAGACCAC >mouse
GGCCC CCCAGTGTGTAACA
>mouse >cat
CCCAGTGTGT AGTGTGTGTTGTGCCCG
AACA
>cat
AGTGTGTGTT
GTGCCCG
Identifier
NCBI identifiers
The NCBI defined a standard for the unique identifier used for the sequence
(SeqID) in the header line. This allows a sequence that was obtained from a
database to be labelled with a reference to its database record.
Can be open in
any text editor
Sequence Identifiers
Many sequences have two types of identification numbers, GI and VERSION .
The two identifier types differ in format , and were implemented at different times.
GI numbers
A GI number (for GenInfo Identifier, sometimes written in lower case, " gi ") is a simple
series of digits that are assigned consecutively to each sequence record processed by
NCBI.
The GI number bears no resemblance to the Version number of the sequence record. Each
time a sequence record is changed, it is assigned a new GI number.
Sequence Versions
A sequence Version groups all of the gi numbers for a specific sequence into an ordered
series.
A sequence version number consists of a base Accession number, a dot, and a version
suffix that starts with 1 1 . (This identifier is often referred to as an " accession dot
version ".)
The base Accession number identifies the sequence record, and the version suffixes form
the series of versions, starting with 1 1 . A sequence Accession number without a version
suffix always refers to the latest version of the sequence.
reference chromosome (NC__), transcript (NM__) and protein (NP___) records for your
gene.
NZ_ incomplete
Fasta = FA
Vs
PHYLIP
It is a plain text format containing exactly two sections: a header describing the
dimensions of the alignment, followed by the multiple sequence alignment itself.
The first line contains the number of sequences and their length – since sequences
are aligned, they will all have the same length.
The following lines each contain the name of the sequence followed by one or more
spaces and the sequence.
In interleaved format the sequence is represented across several lines along with
the name as well.
2.
Categories of Database
Data Type (Data heterogeneity)
Maintainer Status
3. Technical Design
4. Data Source
5. Data Access
6. And/or other parameter
Databases---Verities
l Taxonomy Database
l Genome Database
l Sequence database
l Structure Database
l Proteomic Database
l Micro-array Database
l Enzyme Database
l Disease Database
l Pathway Database
l Literature Database… Many More
Entrez is a window into the molecular biology subset.
It is a molecular biology database system that provides integrated access to
nucleotide and protein sequence data, gene-centered and genomic mapping
information, 3D structure data, PubMed MEDLINE, and more.
Entrez covers over 20 databases including the complete protein sequence data
from PIR-International, PRF, Swiss-Prot, and PDB and nucleotide sequence data
from GenBank that includes information from EMBL and DDBJ.
ENTREZ
(It has tables with rows (tuples or record) and columns (field) supports by RDBMS
like SQL, Oracle, DB2)
l Object-oriented database management system
l ASN.1 (abstract syntax notation)
Structure Advantages Disadvantages
Comparison
Flat File Fast data retrieval, Simple
structure, easy programming
Difficult to process
multiple value, adding
new data require
reprogramming, slow
without the key
Hierarchical Addition and deletion easy, fast Pointer require large
retrieval through higher level computer storage,
records, multiple association with pointer path restricts
like records access, each
association requires
repetitive data