0% found this document useful (0 votes)

14 views38 pages

Lecture1 BIOF242 Shuvadeep

The document provides an introduction to bioinformatics, covering key concepts such as the central dogma of molecular biology, reverse transcription, and the FASTA format for sequence representation. It discusses various database types, including GenBank and Entrez, and highlights the importance of sequence identifiers like GI and version numbers. Additionally, it outlines the structure and advantages of different database designs used in bioinformatics.

Uploaded by

f20230976

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views38 pages

Lecture1 BIOF242 Shuvadeep

Uploaded by

f20230976

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Introduction to Bioinformatics

BIOF 242

Date: 23/03/23 Dr. Shuvadeep Maity

Department of Biological Sciences
cDNA/complementary DNA
5’-ATGCCTAGGTACCTATGA-3’ DNA
3’-TACGGATCCATGGATACT-5’
Reverse
Central dogma Transcription Transcription
(reverse of the
the flow of genetic central dogma)
information within a 5’-AUGCCUAGGUACCUAUGA-3’ mRNA
biological system.
decoded as
It is often stated as
"DNA makes RNA, 5’-AUG CCU AGG UAC CUA UGA-3’
and RNA makes
protein"
Translation

N-MET-PRO-ARG-TYR-LEU-C Protein

The initial conversion of RNA to DNA — going in reverse of the central dogma — is
called reverse transcription, and viruses that use this mechanism are classified as
retroviruses.

Sequences are written in a particular manner --------- FORMAT

FASTA

• In bioinformatics and biochemistry, the FASTA format is a text-based format for

representing either nucleotide sequences or amino acid (protein) sequences, in which
nucleotides or amino acids are represented using single-letter codes.

• The format also allows for sequence names and comments to precede the sequences.

• The format originates from the FASTA software package (a program for fast alignment
by W.R. Pearson.) but has now become a near universal standard in the field of
bioinformatics.
• The simplicity of FASTA format makes it easy to manipulate and parse sequences using
text-processing tools and scripting languages like the R programming
language, Python, Ruby, Haskell, and Perl.
FASTA

In this format the sequence name and additional information is provided on one line. It is
called header/identifier line, which begins with '>', gives a name and/or a unique
identifier for the sequence, and may also contain additional information.

The sequence itself is represented in following lines either in interleaved format, which
means a fixed number of characters per line, or non-interleaved.

Interleaved FASTA (three sequences): Non-interleaved FASTA (of the above example):
>human >human
ACCGTGATGT ACCGTGATGTAGAGACCACGGGCCC
AGAGACCAC >mouse
GGCCC CCCAGTGTGTAACA
>mouse >cat
CCCAGTGTGT AGTGTGTGTTGTGCCCG
AACA
>cat
AGTGTGTGTT
GTGCCCG
Identifier

NCBI identifiers
The NCBI defined a standard for the unique identifier used for the sequence
(SeqID) in the header line. This allows a sequence that was obtained from a
database to be labelled with a reference to its database record.
Can be open in
any text editor
Sequence Identifiers
Many sequences have two types of identification numbers, GI and VERSION .

The two identifier types differ in format , and were implemented at different times.

GI numbers
A GI number (for GenInfo Identifier, sometimes written in lower case, " gi ") is a simple
series of digits that are assigned consecutively to each sequence record processed by
NCBI.
The GI number bears no resemblance to the Version number of the sequence record. Each
time a sequence record is changed, it is assigned a new GI number.

Sequence Versions
A sequence Version groups all of the gi numbers for a specific sequence into an ordered
series.
A sequence version number consists of a base Accession number, a dot, and a version
suffix that starts with 1 1 . (This identifier is often referred to as an " accession dot
version ".)
The base Accession number identifies the sequence record, and the version suffixes form
the series of versions, starting with 1 1 . A sequence Accession number without a version
suffix always refers to the latest version of the sequence.
reference chromosome (NC__), transcript (NM__) and protein (NP___) records for your
gene.

NZ_ incomplete
Fasta = FA
Vs
PHYLIP

A popular format that is used in phylogeny (evolutionary tree) reconstruction is

PHYLIP.

It is a plain text format containing exactly two sections: a header describing the
dimensions of the alignment, followed by the multiple sequence alignment itself.

The first line contains the number of sequences and their length – since sequences
are aligned, they will all have the same length.
The following lines each contain the name of the sequence followed by one or more
spaces and the sequence.
In interleaved format the sequence is represented across several lines along with
the name as well.

Non-interleaved PHYLIP (three sequences)

3 44
Databases---features
1.

2.
Categories of Database
Data Type (Data heterogeneity)
Maintainer Status
3. Technical Design
4. Data Source
5. Data Access
6. And/or other parameter

Databases---Verities
l Taxonomy Database
l Genome Database
l Sequence database
l Structure Database
l Proteomic Database
l Micro-array Database
l Enzyme Database
l Disease Database
l Pathway Database
l Literature Database… Many More
Entrez is a window into the molecular biology subset.
It is a molecular biology database system that provides integrated access to
nucleotide and protein sequence data, gene-centered and genomic mapping
information, 3D structure data, PubMed MEDLINE, and more.

Entrez covers over 20 databases including the complete protein sequence data
from PIR-International, PRF, Swiss-Prot, and PDB and nucleotide sequence data
from GenBank that includes information from EMBL and DDBJ.
ENTREZ

DB of different kind merged together and become global hubs of

knowledge.
OMIM (Online Mendelian Inheritance in Man)
is a comprehensive, authoritative compendium of human genes and genetic
phenotypes that is freely available and updated daily.

OMIM is authored and edited at the McKusick-Nathans Institute of Genetic

Medicine, Johns Hopkins University School of Medicine, under the direction of Dr.
Ada Hamosh. Its official home is omim.org.

PubMed is a free search engine accessing primarily the MEDLINE database of

references and abstracts on life sciences and biomedical topics. The United
States National Library of Medicine at the National Institutes of Health maintain
the database as part of the Entrez system of information retrieval.
Nucleotide Databases

GenBank® is the NIH genetic sequence database, an annotated collection of all

publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-
42).
GenBank is part of the International Nucleotide Sequence Database
Collaboration, which comprises the DNA Data Bank of Japan (DDBJ), the European
Nucleotide Archive (ENA), and GenBank at NCBI.

These three organizations exchange data on a daily basis.

Categories of Databases: Maintainer Status
l NCBI (Federal Govt. agency of USA)
(http://www.ncbi.nlm.nih.gov/)
l EBI/EMBL(Non-profit academic organization)
(http://www.ebi.ac.uk/)
l SIB (Quasi-academic non-profit foundation)
(http://www.isb-sib.ch) Swiss Institute of Bioinformatics

Categories of Databases: Technical Design

l Flat file (Information store in text files)
l XML (Extensible markup language)
(Hierarchical semi-structured model)
l Relational model (Highly structured model)

(It has tables with rows (tuples or record) and columns (field) supports by RDBMS
like SQL, Oracle, DB2)
l Object-oriented database management system
l ASN.1 (abstract syntax notation)
Structure Advantages Disadvantages
Comparison
Flat File Fast data retrieval, Simple
structure, easy programming
Difficult to process
multiple value, adding
new data require
reprogramming, slow
without the key
Hierarchical Addition and deletion easy, fast Pointer require large
retrieval through higher level computer storage,
records, multiple association with pointer path restricts
like records access, each
association requires
repetitive data

Relational Easy access, minimal training for Sequential access is

users, flexible for unforeseen slow, prone to logical
enquiry, easy modification, mistakes, method of
physical storage of data can be storage impact
changed without affecting the processing time, new
relationship relation require
considerable
processing
Database Data Data format Data type

GenBank DNA/RNA seq, Text file/ASN.1 Text, Numeric

OMIM phynotype, Text file/ASN.1 Text file
genotype
GDB Genetic map Relational/MySQL Text, Numeric
AceDB Object oriented Text, Numeric
Medline Literature ASN.1 Text
NCBI Seq, str, ASN.1 Text, Numeric
literature

PDB Structure Oracle 3D Image

BLAST Seq, Analysis Fasta Text, Numeric
ClustalW Seq, Analysis Fasta Text, Numeric
KEGG Metabolic path HTML text, binary Images, text
Microarray Microarray RDBMS, Excel Images, text
data

Nta Level 5 Assessment Plan
No ratings yet
Nta Level 5 Assessment Plan
56 pages
Emirates Pre Employment Medical Examination Form PDF
100% (2)
Emirates Pre Employment Medical Examination Form PDF
9 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bif401 Manual 2023
No ratings yet
Bif401 Manual 2023
27 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Lecture 5 Information Retrieval From Databases
No ratings yet
Lecture 5 Information Retrieval From Databases
22 pages
N100 Rle Back Massage
100% (1)
N100 Rle Back Massage
24 pages
NMPT Telephone Directory 2021
No ratings yet
NMPT Telephone Directory 2021
33 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
No ratings yet
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
22 pages
Preventive Health Checkup Receipt
100% (2)
Preventive Health Checkup Receipt
1 page
Biological Databases ODL
No ratings yet
Biological Databases ODL
31 pages
Coursera BioinfoMethods-I Lab01 PDF
No ratings yet
Coursera BioinfoMethods-I Lab01 PDF
22 pages
Brochure of Madha Dental College and Hospital, Chennai
No ratings yet
Brochure of Madha Dental College and Hospital, Chennai
6 pages
Basic Newborn Resuscitation
100% (1)
Basic Newborn Resuscitation
27 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
Biological Databases Lec 2,3
No ratings yet
Biological Databases Lec 2,3
49 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
Biol BDs Singapore
No ratings yet
Biol BDs Singapore
24 pages
Bioinformatics Lecture 1
No ratings yet
Bioinformatics Lecture 1
48 pages
Biological Database
No ratings yet
Biological Database
8 pages
CEFOXITIN
No ratings yet
CEFOXITIN
30 pages
Medoroga& Its Management
No ratings yet
Medoroga& Its Management
6 pages
Bioinformatics Lab Assignment Group 3
No ratings yet
Bioinformatics Lab Assignment Group 3
7 pages
Surveillance and Monitoring
No ratings yet
Surveillance and Monitoring
38 pages
Biological Databases: - Bio-Informatics
No ratings yet
Biological Databases: - Bio-Informatics
16 pages
2006 09 01 - Lect01 - ch1 2 PDF
No ratings yet
2006 09 01 - Lect01 - ch1 2 PDF
104 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Bioinformatics
No ratings yet
Bioinformatics
55 pages
CR Poli Obs 11-05-18 Edit
No ratings yet
CR Poli Obs 11-05-18 Edit
133 pages
Drug Utilisation Evaluation
No ratings yet
Drug Utilisation Evaluation
43 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Diabetes Mellitus in Developing Countries and Case Series
No ratings yet
Diabetes Mellitus in Developing Countries and Case Series
18 pages
4 Bioinformaticsdatabases
No ratings yet
4 Bioinformaticsdatabases
71 pages
Mathano Dukhavo
No ratings yet
Mathano Dukhavo
105 pages
Biological Sequence Databases
No ratings yet
Biological Sequence Databases
35 pages
2a.BioinfoServerDatabase (Proteomics)
No ratings yet
2a.BioinfoServerDatabase (Proteomics)
50 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
KIDNEY AND URINARY TRACT PROBLEMS Edited (Autosaved)
No ratings yet
KIDNEY AND URINARY TRACT PROBLEMS Edited (Autosaved)
29 pages
Lecture 3 Database
No ratings yet
Lecture 3 Database
81 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
Bioinformatics 1
No ratings yet
Bioinformatics 1
37 pages
Research On Spaces Related To Doh Level 01 Hospital
No ratings yet
Research On Spaces Related To Doh Level 01 Hospital
7 pages
Molecular Genetics - Lab Manual - 22 May 2021
No ratings yet
Molecular Genetics - Lab Manual - 22 May 2021
36 pages
The Effects of Functional Training, Bicycle Exercise and Exergaming On Walking Capacity of Elderly With Parkinson's Disease A P
No ratings yet
The Effects of Functional Training, Bicycle Exercise and Exergaming On Walking Capacity of Elderly With Parkinson's Disease A P
28 pages
Biological - Databases Class Work 60
No ratings yet
Biological - Databases Class Work 60
60 pages
I Am Sharing 'Document (2) ' With You
No ratings yet
I Am Sharing 'Document (2) ' With You
36 pages
Ncbi
No ratings yet
Ncbi
25 pages
LINET Multicare Bed Presentation
No ratings yet
LINET Multicare Bed Presentation
26 pages
Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
Fat Noews
No ratings yet
Fat Noews
37 pages
Ohhc Obtaining Stool Specimens For Laboratory Analysis
No ratings yet
Ohhc Obtaining Stool Specimens For Laboratory Analysis
8 pages
Lecture 5 - DataBase
No ratings yet
Lecture 5 - DataBase
18 pages
Database
No ratings yet
Database
40 pages
Lecture Bioinfo Databases
No ratings yet
Lecture Bioinfo Databases
27 pages
Lecture3 4
No ratings yet
Lecture3 4
73 pages
2024.HF BioInformatics Lec3p
No ratings yet
2024.HF BioInformatics Lec3p
11 pages
Biological Databases
No ratings yet
Biological Databases
20 pages
Part III Final CDX 2b
No ratings yet
Part III Final CDX 2b
15 pages
Microbiology Solve Paper RUHS
No ratings yet
Microbiology Solve Paper RUHS
17 pages
2nd Lec Student Copy - 2
No ratings yet
2nd Lec Student Copy - 2
19 pages
Bioinformatics 1 p2
No ratings yet
Bioinformatics 1 p2
22 pages
CH12
No ratings yet
CH12
8 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
LO4 Access To Sequenced Data and Related Information
No ratings yet
LO4 Access To Sequenced Data and Related Information
11 pages
System Biology Assignment
No ratings yet
System Biology Assignment
17 pages
Handrub Vs Handscrub
No ratings yet
Handrub Vs Handscrub
7 pages
Lecture2-DataMining For Bioinformatics
No ratings yet
Lecture2-DataMining For Bioinformatics
7 pages
Terms 333
No ratings yet
Terms 333
18 pages
Biological Sequence Databases
No ratings yet
Biological Sequence Databases
33 pages
The Viiv Healthcare Positive Action Fund: Call For Proposals, Momentum Round 1 - 2021
No ratings yet
The Viiv Healthcare Positive Action Fund: Call For Proposals, Momentum Round 1 - 2021
7 pages
Bioinformatics Glossary
No ratings yet
Bioinformatics Glossary
4 pages
Cristina Paulin: 1102 Hampstead Place Martinez, GA 30907 706-306-8200
No ratings yet
Cristina Paulin: 1102 Hampstead Place Martinez, GA 30907 706-306-8200
2 pages
Questions and Answers - Adverse Drug Reactions Reporting
No ratings yet
Questions and Answers - Adverse Drug Reactions Reporting
6 pages
Coursera 14b Unit 1-Ncbi PDF
No ratings yet
Coursera 14b Unit 1-Ncbi PDF
5 pages
Modern Medicine
No ratings yet
Modern Medicine
2 pages
Resume Trans
No ratings yet
Resume Trans
2 pages
Complete Health Insurance (Health Elite) 1
No ratings yet
Complete Health Insurance (Health Elite) 1
1 page
Rusty-Pipe Syndrome: A Rare Cause of Change in The Color of Breastmilk
No ratings yet
Rusty-Pipe Syndrome: A Rare Cause of Change in The Color of Breastmilk
2 pages
Patient Intake and Consent Form: M F Marital Status: S M D W
No ratings yet
Patient Intake and Consent Form: M F Marital Status: S M D W
3 pages
PMP SCH
No ratings yet
PMP SCH
1 page
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Neuroevolution: Fundamentals and Applications for Surpassing Human Intelligence with Neuroevolution
From Everand
Neuroevolution: Fundamentals and Applications for Surpassing Human Intelligence with Neuroevolution
Fouad Sabry
No ratings yet

Lecture1 BIOF242 Shuvadeep

Uploaded by

Lecture1 BIOF242 Shuvadeep

Uploaded by

Introduction to Bioinformatics

Date: 23/03/23 Dr. Shuvadeep Maity

Sequences are written in a particular manner --------- FORMAT

• In bioinformatics and biochemistry, the FASTA format is a text-based format for

A popular format that is used in phylogeny (evolutionary tree) reconstruction is

Non-interleaved PHYLIP (three sequences)

DB of different kind merged together and become global hubs of

OMIM is authored and edited at the McKusick-Nathans Institute of Genetic

PubMed is a free search engine accessing primarily the MEDLINE database of

GenBank® is the NIH genetic sequence database, an annotated collection of all

These three organizations exchange data on a daily basis.

Categories of Databases: Technical Design

Relational Easy access, minimal training for Sequential access is

GenBank DNA/RNA seq, Text file/ASN.1 Text, Numeric

PDB Structure Oracle 3D Image

You might also like