0% found this document useful (0 votes)
5 views16 pages

Module_3_Reference Course content

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Module_3_Reference Course content

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MIT School of Bioengineering Sciences and Research

(A Constituent unit of MIT ADT University)

Basic Concepts In Bioinformatics / BI301

Module03

Performing data storage methods and


various formats.

Course Coordinator: Dr. Sanket P. Bapat / Dr. Priyanka Nath


Mail ID: sanket.bapat@mituniversity.edu.in | priyanka.nath@mituniversity.edu.in
Disclaimer:

The content delivered here should be considered of utmost importance. However, it is


to be noted that, this material is not Stand-alone material for the fulfilment of the
course syllabus. The content in this presentation should only be used as an aid to
learning.
Books and other resources provided are suggested to be referred for exhaustive
understanding.

MITBIO/MITADT University
Syllabus:

Module 3:
Data storage and retrieval and Interoperability

Flat files, relational, object oriented databases and controlled


vocabularies. File Format (Genbank, DDBJ, FASTA, PDB, SwissProt).
Introduction to Metadata and search; Indices, Boolean, Fuzzy,
Neighbouring search. The challenges of data exchange and integration.
Ontologies, interchange languages and standardization efforts.

MITBIO/MITADT University
Objective/Learning Outcome:

CO1 Understanding the basics of bioinformatics and its Applications

CO2 Difference between databases and various biological databases

CO3 Performing data storage methods and various formats.

CO4 Understanding sequence alignment and types of sequence alignment

Discuss about the basics of gene expression and understanding the difference between pattern finding
CO5
and regular expression

CO6 Deduce the evolutionary relationships between the sequences by generating a phylogenetic tree.

MITBIO/MITADT University
Common File Formats

IG/Stanford Fitch Plain/Raw

GenBank/GB Fasta/Pearson PIR/CODATA

NBRF Zuker MSF

EMBL Olsen ASN 1.8

GCG Phylip 3.2 PAUP/NEXUS

DNAStrider Phylip Pretty

Sanket Bapat
PLAIN SEQUENCE FORMAT

A sequence in plain format may contain only IUPAC characters and


spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.

An example sequence in plain format is:

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA
CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC
CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG
CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC
TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT

Sanket Bapat
PLAIN SEQUENCE FORMAT

A sequence in plain format may contain only IUPAC characters and


spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.

An example sequence in plain format is:

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA
CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC
CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG
CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC
TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT

Sanket Bapat
EMBL FORMAT

A sequence file in EMBL format can contain several sequences.

One sequence entry starts with an identifier line ("ID "), followed by further
annotation lines. The start of the sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two slashes ("//").

An example sequence in EMBL format is:


ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
Sanket Bapat
GENBANK FORMAT

A sequence file in GenBank format can contain several sequences.

One sequence in GenBank format starts with a line containing the word LOCUS
and a number of annotation lines. The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").

An example sequence in GenBank format is:

LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995


DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a 77 c 67 g 52 t
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//

Sanket Bapat
Information storage and retrieval

• Systematic process of collecting and cataloging data so that they


can be located and displayed on request. Computers and data
processing techniques have made possible to access the high-
speed and large amounts of information for government,
commercial, and academic purposes.
• A branch of computer or library science relating to storage,
locating, searching and selecting, upon demand , relevant data on
a given subject.

Sanket Bapat
Information Retrieval

• Major Components of IR Information retrieval can be divided into


several major constitutes which include:
1. Database
2. Search mechanism
3. Language
4. Interface

Sanket Bapat
Metadata
• Metadata, in general, is referred to as data about
data, and provides basic information.
• Metadata exists for almost every conceivable object
or group of objects, whether stored in electronic form
or not.

A. Descriptive Metadata
B. Administrative metadata :
C. Structural metadata

Sanket Bapat
Interoperability
• Interoperability is the ability of two or more
components or systems to exchange information and
to use the information that has been exchanged.

Sanket Bapat
Disclaimer:

The content delivered here should be considered of utmost importance. However, it is


to be noted that, this material is not Stand-alone material for the fulfilment of the
course syllabus. The content in this presentation should only be used as an aid to
learning.
Books and other resources provided are suggested to be referred for exhaustive
understanding.

MITBIO/MITADT University
References:

References Book Name Library

Jin Xiong Essential Bioinformatics Ebook / Present in Library

Rastogi, S. C Bioinformatics: Concepts, Skills And Applications Present in Library

Bosu Thukral Bioinformatics: Databases, Tools and Algorithms Present in Library

Neelam Yadav Handbook to Bioinformatics Present in Library

MITBIO/MITADT University
The content is intended for internal use only, and the ownership belongs to the coordinator. It
should not be uploaded on any platform without proper authorization.

You might also like