0% found this document useful (0 votes)

6 views69 pages

Bioinformatics Intro

Uploaded by

NICHOLAS BARASA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views69 pages

Bioinformatics Intro

Uploaded by

NICHOLAS BARASA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Introduction to bioinformatics

What is bioinformatics?

• an emerging interdisciplinary research area

• deals with the computational management and

analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...
The Core of Bioinformatics to date
•Relationships between

TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
sequence 3D structure protein functions
GYALYGSATMLV

•Properties and evolution of genes, genomes, proteins, metabolic

pathways in cells

•Use of this knowledge for prediction, modelling, and design

“The holy grail of bioinformatics”
GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGA
TCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAG > 500, 000 genes
TTAACCTAA... sequenced to date

Expected number of unique

protein structures:
~ 700-1, 000
Basic concepts

• conceptual foundations of bioinformatics:

evolution
protein folding
protein function

• bioinformatics builds mathematical models

of these processes -
to infer relationships between components of
complex biological systems
Information processing in cells

nucleic acids proteins

coding regions

regulatory
sites transcripts

One-to-many mappings!
Context-dependence!
Global approaches: Toward a new Systems Biology

Global cell state

Genome

Protein population:
Genome activation
proteomics
patterns: transcriptomics

•How does the spatial and

temporal organisation of
living matter give rise to
biological processes? Organisation:
tissue imaging EM X-ray, NMR
cells
molecular complexes
Global approaches: Toward a new Systems Biology

Perturbation Living cell Dynamic response

Biological knowledge
(computerised)
•Basic principles
Sequence information “Virtual cell”
•Practical
applications
Structural information
Bioinformatics
Mathematical
modelling
Simulation
We do not know yet whether the information in the genome is sufficient to
reconstruct an entire biological system. Information on building blocks not
enough, information on their interactions is essential.

External environment

Internal environment

Metabolic net

Genetic networks

DNA hRNA mRNAs proteins

Bioinformatics in context

Mathematics/com
Genomics puter science

Molecular
biology Bioinformatics Biophysics

Ethical, legal, and

social implications Molecular
evolution
Current challenges to users
• Potential hurdles:
Methods are in flux and not fully developed-
scattered and heterogeneous resources

• Remedies: Web resources

navigation guides
integration of tools and databanks

http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
Biological databases
The challenge

(Boguski, 1999)

In 1995, the number of genes in the database started to exceed the number of papers
on molecular biology and genetics in the literature!
Data types
primary data sequence primary database

AATGCGTATAGGC DNA
DMPVERILEALAVE amino acid

secondary data secondary protein secondary db

structure
“motifs”: regular
expressions, blocks,
profiles, fingerprints e. g., alpha-helices, beta-
strands
tertiary data tertiary protein tertiary db
structure

atomic co-ordinates domains, folding units

Primary biological databases

• Nucleic acid • Protein

EMBL
GenBank PIR
DDBJ (DNA MIPS
Data Bank of Japan) SWISS-PROT
TrEMBL
NRL-3D
International nucleotide data banks

GenBank
EMBL
USA
Europe

EMBL International NLM

EBI Advisory Meeting NCBI

Collaborative Meeting

TrEMBL DDBJ NRDB

Japan

NIG

CIB
GenBank file format
GenBank file format
Swiss-Prot
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
Other primary protein databases

• TrEMBL (translated EMBL) in SWISS-PROT format

rapid access to sequence data from genome projects
computer-annotated supplement to SWISS-PROT
translations of all coding sequences (CDS) in EMBL

• SP-TrEMBL

• REM-TrEMBL: immunoglobulins, T-cell receptors, short

fragments, synthetic and patented sequences
Other primary protein databases

The Protein Information Resource (PIR)

• integrated system of protein sequence databases and

derived related databases, e. g., alignment databases

• rapid searching, comparison, and pattern matching of

protein sequences
• retrieval of descriptive, bibliographic, feature, and
concurrent cross-reference information
• aims to be comprehensive and consistently annotated
PIR: related databases

NRL-3D Sequence-Structure Database

• produced by PIR from sequence and annotation

information extracted from three-dimensional
structures in the Protein Databank (PDB)

• allows keyword and similarity searches

PIR: related databases

PATCHX integrated with PIR

• a non-redundant database of protein sequences produced by MIPS, the

European branch of PIR-International

The PIR Protein Sequence Database and PATCHX together provide the
most complete collection of protein sequence data currently available in
the public domain.
Composite protein sequence dbs
NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL
PIR PIR PIR TrEMBL
SP SP SP SP
PDB GenBank MIPSOwn
GenPept NRL-3D NRL-3D
MIPSH
PIRMOD
MIPSTrn
EMTrans
GBTrans
Kabat
PseqIP
OWL composite database

By accession number
• By database code
• By text
• By sequence
• By title
• By author
• By query language
• By regular expression
Direct OWL access:
OWL only released every 6-8 weeks

OWL Blast server

Two other useful sites
INFOBIOGEN-The Public Catalog of Databases
http://www.infobiogen.fr/services/dbcat/

KEGG-Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.
Sequence Retrieval System (SRS)

Database browser that allows users to

•retrieve
•link
•access
entries from all interconnected resources.
Users can formulate queries across a
range of different database types.
Sequence Alignment
What is sequence alignment?
➢Sequence alignment is a way of arranging the sequences of
DNA, RNA or protein to identify regions of similarity that may be
a consequence of functional, structural or evolutionary
relationships between the sequences.
➢The procedure of comparing two (pair-wise alignment) or
more multiple sequences is to search for a series of individual
characters or patterns that are in the same order in the
sequences.
➢ There are two types of alignment: local and global.
Global alignment vs Local alignment

➢ Global alignment is attempting to match as much of the sequence as possible.

The tool for Global alignment is based on Needleman-Wunsch algorithm.

➢ Local alignment is to try to find the regions with highest density of matches. The tool
for local alignment is based on Smith-Waterman.

➢ Both algorithms are derivates from the basic dynamic programming algorithm.
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A

- - - - - - - T G K G - - - - - - - -
Local alignment
- - - - - - - A G K G - - - - - - - -
Why do sequence alignment?
➢ Sequence alignment is useful for discovering structural,
functional and evolutionary information in biological sequences.
➢ Sequences that are very much alike may have similar secondary
and 3D structure, similar function and likely a common ancestral
sequence. It is extremely unlikely that such sequences obtained
similarity by chance.
-- For DNA molecules with n nucleotides such probability is very
low P = 4-n.
-- For proteins with n nucleotides, the probability even much lower
P = 20 –n.
➢Sequence alignment makes the following tasks easy: 1.annotation
of new sequences; 2. modelling of protein structures; 3. design and
analysis of gene expression experiments
Methods of pairwise alignment

➢ Dot matrix analysis

➢ The dynamic programming (DP) algorithm
➢ Word methods
What is Dot matrix analysis
➢ A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and McIntyre
1970)
➢ The algorithm for a dot matrix:
1. One sequence (A) is listed across the top of the matrix and the
other (B) is listed down the left side
2. Starting from the first character in B, one moves across the page
keeping in the first row and placing a dot in many column where the
character in A is the same
3. The process is continued until all possible comparisons between
A and B are made
4. Any region of similarity is revealed by a diagonal row of dots
5. Isolated dots not on diagonal represent random matches
What can Dot matrix analysis do?

➢ It can detect of matching regions can be improved by

filtering out random matches and this can be achieved by
using a sliding window
➢ It can be used to assess repetitiveness in a single
sequence, such as direct and inverted repeats within the
sequences
Dynamic programming algorithm

➢ The approach compares every pair of characters in the two sequences

and generates an alignment, which is the best or optimal.
➢ The method can be useful in aligning nucleotide to protein sequences.
➢The method requires large amounts of computing power and is a highly
computationally demanding because the nature of dynamic programming
technique is recursion.
➢New algorithmic improvements as well as increasing computer capacity
make possible to align a query sequence against a large DB in a few
minutes.
➢Two approaches for dynamic programming: Top-down approach and
Bottom-up.
The procedure of the dynamic programming algorithm
➢ The alignment procedure depends upon scoring system based on
probability that:
1) a particular amino acid pair is found in alignments of related proteins
(pxy);
2) the same amino acid pair is aligned by chance (p xpy);
3) introduction of a gap would be a better choice as it increases the score.
➢ A substitution matrix is composed of the ratio of the first two
probabilities. There are many such matrices, two of them PAM and
BLOSUM will be talked in next few slides.
➢ The calculation of scores for the gap introduction and its extension is
from the matrices and represent a prior knowledge and some assumptions.
For example: one of them is quite simple, if negative cost of a gap is too
high a reasonable alignment between slightly different sequences will be
never achieved but if it is too low an optimal alignment is hardly possible.
Other assumptions are based on sophisticated statistical procedures.
Word methods
➢Word methods, also known as k-tuple methods, are heuristic
methods that are not guaranteed to find an optimal alignment
solution, but are significantly more efficient than dynamic
programming.

➢The typical tools used for this method is BLAST and FASTA.
Why do we want to compare sequences?

Evolutionary relationships
• Phylogenetic trees can be constructed based on comparison of the
sequences of a molecule (example: 16S rRNA) taken from different
species
• Residues conserved during evolution play an important role

Prediction of protein structure and function

• Proteins which are very similar in sequence generally have similar
3D structure and function as well
• By searching a sequence of unknown structure against a database
of known proteins the structure and/or function can in many cases
be predicted
BLAST

BLAST (Basic Local Alignment Search Tool)

allows rapid sequence comparison of a query
sequence against a database.

The BLAST algorithm is fast, accurate,

and web-accessible.
Why use BLAST?

BLAST searching is fundamental to understanding

the relatedness of any favorite query sequence
to other known proteins or DNA sequences.

Applications include
• identifying orthologs and paralogs
• discovering new genes or proteins
• discovering variants of genes or proteins
• investigating expressed sequence tags (ESTs)
• exploring protein structure and function
Four components to a BLAST search

(1) Choose the sequence (query)

(2) Select the BLAST program

(3) Choose the database to search

(4) Choose optional parameters

Then click “BLAST”

page 102
Types of Blast searching

• blastp compares an amino acid query sequence against a

protein sequence database

• blastn compares a nucleotide query sequence against a

nucleotide sequence database

• blastx compares the six-frame conceptual protein translation

products of a nucleotide query sequence against a protein
sequence database

• tblastn compares a protein query sequence against a

nucleotide sequence database translated in six reading frames

• tblastx compares the six-frame translations of a nucleotide

by Bob Friedman

query sequence against the six-frame translations of a

nucleotide sequence database.
Routine BlastP search

FASTA formatted
text
or Genbank ID#

Protein
database

Run
by Bob Friedman
E value Threshold
• Alignments will be reported
with E-values less than or
equal to the expect values
threshold
• Setting a larger E threshold
will result in more
reported hits
• Setting a smaller E
threshold will result in
fewer reported hits

49 49
Kerfeld and Scott, PLoS Biology 2011
BlastP parameters

Restrict by taxonomic
group

Filter repetitive regions

Statistical cut-off
Size of words in
look-up table
Similarity matrix
(cost of gaps)
by Bob Friedman
BLAST as an Experiment:
Parameters to manipulate in a BLAST search

• Expect
• Word size
• Matrix
• Gap costs
• Filter
• Mask
51 51
Kerfeld and Scott, PLoS Biology 2011
Blast databases
• EST - Expression Sequence Tags; cDNA
• wgs – whole genome shotgun reads
• Reference genome sequences
• NR - non-redundant DNA or amino acid sequence database
• NT - NR database excluding EST, STS, GSS, HTGS
• PDB - DNA or amino acid sequences accompanied by 3d
structures
• STS - Sequence Tagged Sites; short genomic markers for
mapping
• Swissprot - well-annotated amino-acid sequences

• Also, to obtain organism-specific sequence set:

ftp://ftp.ncbi.nih.gov/genomes/
by Bob Friedman
Example of web based BLAST

program: BLASTP
sequence: vma1
gi:137464

BLink provides similar

information
E – Value of a Blast Hit
Helps us to determine whether or not an alignment occurs by chance.
Blast Hit

• A match between a word and a database entry.

• A word(3 proteins, 11 nucleotides)
• Keep track of the scores using a scoring matrix (BLOSUM 62 and PAM
32).
Blast Hit
• Used as a cut-off to define a blast hit.
• The lower the e-value the more significant the hit.
• Go for a higher e-value if you are expecting higher diversity between your
sequence and the database.
• The e-value depends on the size of the database.
• An observed E value of 1 would indicate in a database of the current size
you could expect to see 1 match with a similar score (S) simply by chance.
• Also, e-value computations takes into account the length of the query
sequence, so nearly identical short alignments have inherently high e-
values since short query sequences inherently have a greater chance of
spurious(fake) hits than longer sequences.
Blast Hit
• E-values have to be interpreted in the specific context of the search
that generated them. For any given query sequence and database,
the lower the e value, the greater the probability that a score equal to
the observed score would not be expected by chance. So people float
their e-value cutoff based on what they are seeing, and how stringent
(or relaxed) they feel they need to be to extract the best information
for their particular question of interest.
• Just like a statistical p-value, there are no hard and fast rules, and
cutoffs or thresholds are inherently arbitrary, as one tries to balance
stringency with the need to return sufficient information to make use
of or sense of.
Blast Hit
• There is the old rule-of-thumb that an e-value less than 0.1 means
that the hit is significant. This was however true some 15 years ago
but it does not hold anymore when you search against a real big
databank like the NCBI nr.
• A remark : when you search a small oligonucleotide against a
complete genome or transcriptome in order to test the specificity of a
primer or siRNA it is recommended to set the cutoff high (100 or even
1000, de facto not using a cutoff). The reason is that here you really
want to find all sequences that give a 100% or close to 100% hit, since
all are really significant, the probability of finding a hit when
searching a random databank of same size is not relevant.
Blast Hit
• It is depend on your sample. If your sample is fully sequenced and
available in the database you can choose 1e -4 to 1e-10 (lower more
confidence). If it is not in the database, you can choose 1e-30. In this
case you increase the number of hit and then you can select the best
hit based on what you are seeing.
The hit list

• BLAST lists the best matches (hits)

• For each hit, BLAST provides:
• Accession number – links to Genbank flatfile
• Description
• “G” = genome link
• E-value
• An indicator of how good a match to the query sequence
• Score
• Link to an alignment
What is an E-value?

• E-value
• The chance that the match could be random

• The lower the E-value, the more significant the match

• E = 10-4 is considered the cutoff point
• E = 0 means that the two sequences are statistically identical
E values E= kmNe-λs
m= query size
k= minor constant
N= database size
λ = constant to adjust fro scoring matrix
S= score of High-scoring segment pair (HSP)

E (expect) value: Expectation value. The number of chance alignments with scores
equivalent to or better than S that are expected to occur in a database search by
chance. The lower the E value, the more significant the score.
• The E value decreases exponentially as the Score (S) that is assigned to a match between
two sequences increases.
• The E value depends on the size of database and the scoring system in use.
• When the Expect value threshold is increased from the default value of 10, more hits can
be reported.

Bit score: The bit score is calculated from the raw score by normalizing with the
statistical variables that define a given scoring system. Therefore, bit scores from
different alignments, even those employing different scoring matrices can be
compared.

Tips:
• Repeated amino acid stretches (e.g. poly glutamine) are unlikely to reflect
meaningful similarity between the query and the match.
• If those present use BLAST filters to mask low complexity regions.
• RepeatMasker can be used to mask repeats before blasting
E value Threshold
• Alignments will be reported
with E-values less than or
equal to the expect values
threshold
• Setting a larger E threshold
will result in more
reported hits
• Setting a smaller E
threshold will result in
fewer reported hits

64 64
Kerfeld and Scott, PLoS Biology 2011
Establishing a significant “hit”

Blast’s E-value indicates statistical significance of a sequence match

Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular
sequence features by using general scoring schemes. PNAS 87:2264-8

E-value is the Expected number of sequence (HSPs) matches in database of

n number of sequences
• database size is arbitrary
• multiple testing problem
• E-value calculated from many assumptions
• E-value depends on size of data bank.

Examples:
E-value = 1 = expect the match to occur in the database by chance 1x

E-value = .05 = expect 5% chance of match occurring

E-value = 1x10-20 = strict match between protein domains

BLAST search output: tabular output

High scores
low E values

Cut-off:
.05?
10-10?
Why set the E value to 20,000?

Suppose you perform a search with a short query

(e.g. 9 amino acids). There are not enough residues to
accumulate a big score (or a small E value).

Indeed, a match of 9 out of 9 residues could yield a

small score with an E value of 100 or 200. And yet, this
result could be “real” and of interest to you.

By setting the E value cutoff to 20,000 you do not

change the way the search was done, but you do
change which results are reported to you.
Sometimes a real match has an E value > 1

real
match?

…try a reciprocal BLAST to confirm

Database searching: E-values in BLAST

BLAST uses precomputed extreme value distributions to calculate E-values

from alignment scores

For this reason BLAST only allows certain combinations of substitution

matrices and gap penalties.

This also means that the fit is based on a different data set than the one you
are working on.

A word of caution: BLAST tends to overestimate the significance of its matches

E-values from BLAST are fine for identifying sure hits

One should be careful using BLAST’s E-values to judge if a marginal hit can be
trusted (e.g., you may want to use E-values of 10-4 to 10-5).

S.C. Rastogi Parag Rastogi, Namita Mendiratta - Bioinformatics - Methods and Applications - Genomics, Proteomics and Drug Discovery-PHI (2022)
100% (1)
S.C. Rastogi Parag Rastogi, Namita Mendiratta - Bioinformatics - Methods and Applications - Genomics, Proteomics and Drug Discovery-PHI (2022)
626 pages
Bio in For Matics
100% (1)
Bio in For Matics
160 pages
Introduction To Bioinformatics (Databases)
No ratings yet
Introduction To Bioinformatics (Databases)
28 pages
The Cell Cycle
100% (1)
The Cell Cycle
72 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Bif401 Manual 2023
No ratings yet
Bif401 Manual 2023
27 pages
Biological Databases Lec 2,3
No ratings yet
Biological Databases Lec 2,3
49 pages
Bioinformatics-An Introduction and Overview
No ratings yet
Bioinformatics-An Introduction and Overview
12 pages
Tools in Bioinformatics
100% (1)
Tools in Bioinformatics
17 pages
Day 1
No ratings yet
Day 1
38 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Bioinformatic Paper WPS Office
No ratings yet
Bioinformatic Paper WPS Office
20 pages
Bioinformatics Past Paper-WPS Office
No ratings yet
Bioinformatics Past Paper-WPS Office
19 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
First Lecture
No ratings yet
First Lecture
89 pages
Unit V DM
No ratings yet
Unit V DM
96 pages
Biological - Databases Class Work 60
No ratings yet
Biological - Databases Class Work 60
60 pages
Lecture Bioinfo Databases
No ratings yet
Lecture Bioinfo Databases
27 pages
Plant Biotechnology
No ratings yet
Plant Biotechnology
44 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Lecture2-DataMining For Bioinformatics
No ratings yet
Lecture2-DataMining For Bioinformatics
7 pages
Databases Class Work
No ratings yet
Databases Class Work
48 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Lec (1) - Introduction
No ratings yet
Lec (1) - Introduction
41 pages
Introduction To Databases
No ratings yet
Introduction To Databases
21 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
Sequence Alignment
No ratings yet
Sequence Alignment
8 pages
Latthika
No ratings yet
Latthika
21 pages
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
100% (3)
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
23 pages
BCH 428 Slide
No ratings yet
BCH 428 Slide
32 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
PB Bioinfo L1 2023
No ratings yet
PB Bioinfo L1 2023
21 pages
Unit 1
No ratings yet
Unit 1
24 pages
Review of Cellular Division: Cell Division Is The Process by Which A Parent Cell Divides Into Two or More
No ratings yet
Review of Cellular Division: Cell Division Is The Process by Which A Parent Cell Divides Into Two or More
7 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
Test For Upload
No ratings yet
Test For Upload
25 pages
Bioinformatics Overview
100% (1)
Bioinformatics Overview
18 pages
Bio Informatics
No ratings yet
Bio Informatics
46 pages
IInd Sem Class1
No ratings yet
IInd Sem Class1
56 pages
Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
76 pages
Science Grade 8 Four 4 Days
No ratings yet
Science Grade 8 Four 4 Days
4 pages
Mitotic Cell Cycle p2
No ratings yet
Mitotic Cell Cycle p2
103 pages
Introduction To Bioinformatics: Tolga Can
No ratings yet
Introduction To Bioinformatics: Tolga Can
21 pages
Introduction To Bioinformatics Presentation
No ratings yet
Introduction To Bioinformatics Presentation
13 pages
Regulation of Gene Expression in Prokaryotes: © John Wiley & Sons, Inc
No ratings yet
Regulation of Gene Expression in Prokaryotes: © John Wiley & Sons, Inc
65 pages
Bioinformatics Definition
No ratings yet
Bioinformatics Definition
11 pages
Bioinformatics: Major Research Areas
No ratings yet
Bioinformatics: Major Research Areas
2 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Test-7 Cell Cycle (Key)
No ratings yet
Test-7 Cell Cycle (Key)
4 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
54 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Bio in For Matics
No ratings yet
Bio in For Matics
17 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Bioinformatics - Group21 - Report - Application of Bioinformatics in Agriculture
No ratings yet
Bioinformatics - Group21 - Report - Application of Bioinformatics in Agriculture
11 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Chapter 6 - Molecular Basis of Inheritance: Transcription
No ratings yet
Chapter 6 - Molecular Basis of Inheritance: Transcription
5 pages
(First Author) 2005 Journal-Of-Biotechnology
No ratings yet
(First Author) 2005 Journal-Of-Biotechnology
189 pages
Biology Quiz and Answers
100% (1)
Biology Quiz and Answers
7 pages
Chapter 8
No ratings yet
Chapter 8
4 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
7 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
8 pages
Genbio 4th Quarter Reviewer
No ratings yet
Genbio 4th Quarter Reviewer
6 pages
DNA Repair
No ratings yet
DNA Repair
34 pages
Cell Cycle and Cell Division Practice Sheet Prachand NEET 2024 PDF
No ratings yet
Cell Cycle and Cell Division Practice Sheet Prachand NEET 2024 PDF
8 pages
Immune System Song - Ben Kany and John Barback
No ratings yet
Immune System Song - Ben Kany and John Barback
3 pages
Bioinformatics: Tina Elizabeth Varghese
No ratings yet
Bioinformatics: Tina Elizabeth Varghese
9 pages
Q4W2 Digestive System Act and Cell Cycle Intro
No ratings yet
Q4W2 Digestive System Act and Cell Cycle Intro
5 pages
Mitosis and Meiosis AP BIO Study Guide
100% (1)
Mitosis and Meiosis AP BIO Study Guide
9 pages
Grade 7 Quarter 2 (Module 6) : Asexual Versus Sexual Reproduction
No ratings yet
Grade 7 Quarter 2 (Module 6) : Asexual Versus Sexual Reproduction
7 pages
Differences Between Mitosis and Meiosis
No ratings yet
Differences Between Mitosis and Meiosis
8 pages
Lessons Plans Week 2 6-2 10 BC
No ratings yet
Lessons Plans Week 2 6-2 10 BC
3 pages
Take Home Exam Problem
No ratings yet
Take Home Exam Problem
2 pages
G7 Science Q2-Week 6-Biotic and Abiotic
No ratings yet
G7 Science Q2-Week 6-Biotic and Abiotic
37 pages
Bioinformatics Assignment 1
No ratings yet
Bioinformatics Assignment 1
9 pages
Theories of Ageing Roll No. 32
No ratings yet
Theories of Ageing Roll No. 32
8 pages
Carandang Lab #11.2 Body Defense Against Non-Self Answer Sheet
No ratings yet
Carandang Lab #11.2 Body Defense Against Non-Self Answer Sheet
18 pages
Environmental Science - Notes Module1
No ratings yet
Environmental Science - Notes Module1
3 pages
Editable Biogeochemical Cycles Project
No ratings yet
Editable Biogeochemical Cycles Project
3 pages
Soal Immune
No ratings yet
Soal Immune
2 pages
Human Excretory System - Google Search
No ratings yet
Human Excretory System - Google Search
1 page
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Systems Biology: A Textbook
From Everand
Systems Biology: A Textbook
Edda Klipp
No ratings yet
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet

Bioinformatics Intro

Uploaded by

Bioinformatics Intro

Uploaded by

Introduction to bioinformatics

• an emerging interdisciplinary research area

• deals with the computational management and

•Properties and evolution of genes, genomes, proteins, metabolic

•Use of this knowledge for prediction, modelling, and design

Expected number of unique

• conceptual foundations of bioinformatics:

• bioinformatics builds mathematical models

nucleic acids proteins

Global cell state

•How does the spatial and

Perturbation Living cell Dynamic response

DNA hRNA mRNAs proteins

Ethical, legal, and

• Remedies: Web resources

secondary data secondary protein secondary db

atomic co-ordinates domains, folding units

• Nucleic acid • Protein

EMBL International NLM

EBI Advisory Meeting NCBI

TrEMBL DDBJ NRDB

• TrEMBL (translated EMBL) in SWISS-PROT format

• REM-TrEMBL: immunoglobulins, T-cell receptors, short

The Protein Information Resource (PIR)

• integrated system of protein sequence databases and

• rapid searching, comparison, and pattern matching of

NRL-3D Sequence-Structure Database

• produced by PIR from sequence and annotation

• allows keyword and similarity searches

PATCHX integrated with PIR

• a non-redundant database of protein sequences produced by MIPS, the

OWL Blast server

KEGG-Kyoto Encyclopedia of Genes and Genomes

Database browser that allows users to

➢ Global alignment is attempting to match as much of the sequence as possible.

➢ Dot matrix analysis

➢ It can detect of matching regions can be improved by

➢ The approach compares every pair of characters in the two sequences

Prediction of protein structure and function

BLAST (Basic Local Alignment Search Tool)

The BLAST algorithm is fast, accurate,

BLAST searching is fundamental to understanding

(1) Choose the sequence (query)

(2) Select the BLAST program

(3) Choose the database to search

(4) Choose optional parameters

Then click “BLAST”

• blastp compares an amino acid query sequence against a

• blastn compares a nucleotide query sequence against a

• blastx compares the six-frame conceptual protein translation

• tblastn compares a protein query sequence against a

• tblastx compares the six-frame translations of a nucleotide

query sequence against the six-frame translations of a

Filter repetitive regions

• Also, to obtain organism-specific sequence set:

BLink provides similar

• A match between a word and a database entry.

• BLAST lists the best matches (hits)

• The lower the E-value, the more significant the match

Blast’s E-value indicates statistical significance of a sequence match

E-value is the Expected number of sequence (HSPs) matches in database of

E-value = .05 = expect 5% chance of match occurring

E-value = 1x10-20 = strict match between protein domains

Suppose you perform a search with a short query

Indeed, a match of 9 out of 9 residues could yield a

By setting the E value cutoff to 20,000 you do not

…try a reciprocal BLAST to confirm

BLAST uses precomputed extreme value distributions to calculate E-values

For this reason BLAST only allows certain combinations of substitution

A word of caution: BLAST tends to overestimate the significance of its matches

E-values from BLAST are fine for identifying sure hits

You might also like