0% found this document useful (0 votes)
14 views38 pages

Lecture1 BIOF242 Shuvadeep

The document provides an introduction to bioinformatics, covering key concepts such as the central dogma of molecular biology, reverse transcription, and the FASTA format for sequence representation. It discusses various database types, including GenBank and Entrez, and highlights the importance of sequence identifiers like GI and version numbers. Additionally, it outlines the structure and advantages of different database designs used in bioinformatics.

Uploaded by

f20230976
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views38 pages

Lecture1 BIOF242 Shuvadeep

The document provides an introduction to bioinformatics, covering key concepts such as the central dogma of molecular biology, reverse transcription, and the FASTA format for sequence representation. It discusses various database types, including GenBank and Entrez, and highlights the importance of sequence identifiers like GI and version numbers. Additionally, it outlines the structure and advantages of different database designs used in bioinformatics.

Uploaded by

f20230976
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Bioinformatics

BIOF 242

Date: 23/03/23 Dr. Shuvadeep Maity


Department of Biological Sciences
cDNA/complementary DNA
5’-ATGCCTAGGTACCTATGA-3’ DNA
3’-TACGGATCCATGGATACT-5’
Reverse
Central dogma Transcription Transcription
(reverse of the
the flow of genetic central dogma)
information within a 5’-AUGCCUAGGUACCUAUGA-3’ mRNA
biological system.
decoded as
It is often stated as
"DNA makes RNA, 5’-AUG CCU AGG UAC CUA UGA-3’
and RNA makes
protein"
Translation

N-MET-PRO-ARG-TYR-LEU-C Protein

The initial conversion of RNA to DNA — going in reverse of the central dogma — is
called reverse transcription, and viruses that use this mechanism are classified as
retroviruses.

Sequences are written in a particular manner --------- FORMAT


FASTA

• In bioinformatics and biochemistry, the FASTA format is a text-based format for


representing either nucleotide sequences or amino acid (protein) sequences, in which
nucleotides or amino acids are represented using single-letter codes.

• The format also allows for sequence names and comments to precede the sequences.

• The format originates from the FASTA software package (a program for fast alignment
by W.R. Pearson.) but has now become a near universal standard in the field of
bioinformatics.
• The simplicity of FASTA format makes it easy to manipulate and parse sequences using
text-processing tools and scripting languages like the R programming
language, Python, Ruby, Haskell, and Perl.
FASTA

In this format the sequence name and additional information is provided on one line. It is
called header/identifier line, which begins with '>', gives a name and/or a unique
identifier for the sequence, and may also contain additional information.

The sequence itself is represented in following lines either in interleaved format, which
means a fixed number of characters per line, or non-interleaved.

Interleaved FASTA (three sequences): Non-interleaved FASTA (of the above example):
>human >human
ACCGTGATGT ACCGTGATGTAGAGACCACGGGCCC
AGAGACCAC >mouse
GGCCC CCCAGTGTGTAACA
>mouse >cat
CCCAGTGTGT AGTGTGTGTTGTGCCCG
AACA
>cat
AGTGTGTGTT
GTGCCCG
Identifier

NCBI identifiers
The NCBI defined a standard for the unique identifier used for the sequence
(SeqID) in the header line. This allows a sequence that was obtained from a
database to be labelled with a reference to its database record.
Can be open in
any text editor
Sequence Identifiers
Many sequences have two types of identification numbers, GI and VERSION .

The two identifier types differ in format , and were implemented at different times.

GI numbers
A GI number (for GenInfo Identifier, sometimes written in lower case, " gi ") is a simple
series of digits that are assigned consecutively to each sequence record processed by
NCBI.
The GI number bears no resemblance to the Version number of the sequence record. Each
time a sequence record is changed, it is assigned a new GI number.

Sequence Versions
A sequence Version groups all of the gi numbers for a specific sequence into an ordered
series.
A sequence version number consists of a base Accession number, a dot, and a version
suffix that starts with 1 1 . (This identifier is often referred to as an " accession dot
version ".)
The base Accession number identifies the sequence record, and the version suffixes form
the series of versions, starting with 1 1 . A sequence Accession number without a version
suffix always refers to the latest version of the sequence.
reference chromosome (NC__), transcript (NM__) and protein (NP___) records for your
gene.

NZ_ incomplete
Fasta = FA
Vs
PHYLIP

A popular format that is used in phylogeny (evolutionary tree) reconstruction is


PHYLIP.

It is a plain text format containing exactly two sections: a header describing the
dimensions of the alignment, followed by the multiple sequence alignment itself.

The first line contains the number of sequences and their length – since sequences
are aligned, they will all have the same length.
The following lines each contain the name of the sequence followed by one or more
spaces and the sequence.
In interleaved format the sequence is represented across several lines along with
the name as well.

Non-interleaved PHYLIP (three sequences)


3 44
Databases---features
1.

2.
Categories of Database
Data Type (Data heterogeneity)
Maintainer Status
3. Technical Design
4. Data Source
5. Data Access
6. And/or other parameter

Databases---Verities
l Taxonomy Database
l Genome Database
l Sequence database
l Structure Database
l Proteomic Database
l Micro-array Database
l Enzyme Database
l Disease Database
l Pathway Database
l Literature Database… Many More
Entrez is a window into the molecular biology subset.
It is a molecular biology database system that provides integrated access to
nucleotide and protein sequence data, gene-centered and genomic mapping
information, 3D structure data, PubMed MEDLINE, and more.

Entrez covers over 20 databases including the complete protein sequence data
from PIR-International, PRF, Swiss-Prot, and PDB and nucleotide sequence data
from GenBank that includes information from EMBL and DDBJ.
ENTREZ

DB of different kind merged together and become global hubs of


knowledge.
OMIM (Online Mendelian Inheritance in Man)
is a comprehensive, authoritative compendium of human genes and genetic
phenotypes that is freely available and updated daily.

OMIM is authored and edited at the McKusick-Nathans Institute of Genetic


Medicine, Johns Hopkins University School of Medicine, under the direction of Dr.
Ada Hamosh. Its official home is omim.org.

PubMed is a free search engine accessing primarily the MEDLINE database of


references and abstracts on life sciences and biomedical topics. The United
States National Library of Medicine at the National Institutes of Health maintain
the database as part of the Entrez system of information retrieval.
Nucleotide Databases

GenBank® is the NIH genetic sequence database, an annotated collection of all


publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-
42).
GenBank is part of the International Nucleotide Sequence Database
Collaboration, which comprises the DNA Data Bank of Japan (DDBJ), the European
Nucleotide Archive (ENA), and GenBank at NCBI.

These three organizations exchange data on a daily basis.


Categories of Databases: Maintainer Status
l NCBI (Federal Govt. agency of USA)
(http://www.ncbi.nlm.nih.gov/)
l EBI/EMBL(Non-profit academic organization)
(http://www.ebi.ac.uk/)
l SIB (Quasi-academic non-profit foundation)
(http://www.isb-sib.ch) Swiss Institute of Bioinformatics

Categories of Databases: Technical Design


l Flat file (Information store in text files)
l XML (Extensible markup language)
(Hierarchical semi-structured model)
l Relational model (Highly structured model)

(It has tables with rows (tuples or record) and columns (field) supports by RDBMS
like SQL, Oracle, DB2)
l Object-oriented database management system
l ASN.1 (abstract syntax notation)
Structure Advantages Disadvantages
Comparison
Flat File Fast data retrieval, Simple
structure, easy programming
Difficult to process
multiple value, adding
new data require
reprogramming, slow
without the key
Hierarchical Addition and deletion easy, fast Pointer require large
retrieval through higher level computer storage,
records, multiple association with pointer path restricts
like records access, each
association requires
repetitive data

Relational Easy access, minimal training for Sequential access is


users, flexible for unforeseen slow, prone to logical
enquiry, easy modification, mistakes, method of
physical storage of data can be storage impact
changed without affecting the processing time, new
relationship relation require
considerable
processing
Database Data Data format Data type

GenBank DNA/RNA seq, Text file/ASN.1 Text, Numeric


OMIM phynotype, Text file/ASN.1 Text file
genotype
GDB Genetic map Relational/MySQL Text, Numeric
AceDB Object oriented Text, Numeric
Medline Literature ASN.1 Text
NCBI Seq, str, ASN.1 Text, Numeric
literature

PDB Structure Oracle 3D Image


BLAST Seq, Analysis Fasta Text, Numeric
ClustalW Seq, Analysis Fasta Text, Numeric
KEGG Metabolic path HTML text, binary Images, text
Microarray Microarray RDBMS, Excel Images, text
data

You might also like