0% found this document useful (0 votes)
13 views38 pages

IBT DNA Seq Analysis

The document outlines the learning objectives and outcomes of an online course on Bioinformatics, focusing on DNA sequence analysis. It covers key topics such as extracting DNA sequences, identifying sequence features, primer design, and gene prediction. The course also discusses the use of biological databases, sequence formats, and various tools for data manipulation and analysis.

Uploaded by

Edilita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views38 pages

IBT DNA Seq Analysis

The document outlines the learning objectives and outcomes of an online course on Bioinformatics, focusing on DNA sequence analysis. It covers key topics such as extracting DNA sequences, identifying sequence features, primer design, and gene prediction. The course also discusses the use of biological databases, sequence formats, and various tools for data manipulation and analysis.

Uploaded by

Edilita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Bioinformatics online course: IBT

Bioinformatics resources and databases:


Lecture 3: DNA sequence analysis
Nicola Mulder

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Learning Objectives

• Objective: Basic DNA sequence analysis – finding


sequence features
• Sub objectives:
– Understand how to extract a DNA sequence from
the database
– Use online or local tools for simple DNA sequence
analysis -finding features on the sequence and
their applications

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Learning Outcomes

• Understand how to find a DNA sequence and


save it in the correct format
• Identify features on the sequence such as
coding regions, restriction enzyme sites, etc.
• Design primers for amplification of a DNA
sequence
• Interpret sequence analysis results and
understand the biological impact of functional
regions
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: N Mulder
Two major components to Bioinformatics
• Storing and retrieving data:
– Biological databases
– Querying these to retrieve data
• Manipulating the data –tools e.g:
– Finding features on sequences
– Sequence similarity searches
– Protein families and function prediction
– Comparing sequences –phylogenetics
– Etc.

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Aspects of sequence analysis

Regulatory region Promoter


Protein coding (CDS) DNA sequence

Transcription Stop codon


Gene and promoter start
RNA sequence
prediction
Protein sequence
RNA secondary structure,
gene expression
Protein sequence
analysis
Restriction mapping
for cloning, primer
design for PCR

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Overview
• Assume sequence is retrieved from the database
• General text/format manipulation and accession
numbers
• DNA sequences
– Restriction analysis
– Primer design
– Finding features –coding and non-coding
– Gene prediction
• RNA sequence analysis
– Summary of kinds of analyses possible

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Sequence formats: Fasta

> [title]
[sequence]

>seq1
GGAAAATTAGATGCATGGGAAAAAATTA
GGAAAATTAGACAAATGGGAAAAAATTA
>seq2
AAGTCCCTGGATTTACCCAATGCAGTCGA
CATCGCATTT

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Sequence formats: GenBank
LOCUS 525-42 1588 bp
DEFINITION 525-42 1588 bp
TITLE 525-42
FEATURES Location/Qualifiers
exon 39..70
/note="exon1 is believed to have an alternative splice donor site"
ORIGIN

1 ATGTT AAGAG GGGGA AAATT AGATG CATGG GAAAA AATTA GGTTA AGGCC
51 AGGGG GAAAG AAATG CTATA NGATA AAACA CCTAG TATGG GCAAG CAGGG
101 AGCTG GAAAG ATTTG CACTT AACCC TGGCC TTTTA GAGAC ATCAG ANGGC
151 TGTAA ACAAA TAATG NAACA GATAC AACCA GCTCT TCAGA CAGGA ACAGA

Converting between sequence formats (save options)

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
DNA sequence composition

• Nucleotide composition (% GC vs AT content)


• GC bonds are stronger than AT bonds
• Applications:
– Horizontal gene transfer analysis
– Gene prediction
– Primer design

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Accession numbers

• GenBank/EMBL/DDBJ: 1 letter & digits, e.g.:


U12345 or 2 letters & 6 digits, e.g.: AY123456
• GenPept Sequence Records -3 letters & 5 digits,
e.g.: AAA12345
• UniProt -All 6 characters: [A,B,O,P,Q] [0-9] [A-Z,0-
9] [A-Z,0-9] [A-Z,0-9] [0-9], e.g.:
P12345 and Q9JJS7

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Cross-referencing identifiers

• So many different IDs for same thing, e.g.


Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID,
etc.
• Need mapping files to move between them to
avoid having to parse every entry
• UniProt website mapper (www.uniprot.org)
• PICR (http://www.ebi.ac.uk/Tools/picr/) enables
mapping between IDs

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Example conversion

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
DNA sequence analysis

• Restriction analysis e.g. for cloning –looks for


recognition sites
• Primer design
• Finding features on a sequence
• Gene prediction:
– Translation
– Promoter prediction

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Bioinformatics and cloning

• Retrieving sequence of interest


• Identifying restriction enzyme sites
• Matching these to RE sites in cloning vector

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Restriction enzyme analysis

• Restriction enzymes recognize specific or


defined 4 to 8 base pair sequences on DNA and
cut

Microorganism Enzyme Sequences Notes 5’ 3’


……..GG CC….….
Haemophilus HaeIII 5’…GG CC..3’ Blunt end
………CC GG...….
aegitius 3’…CC GG..5’
Haemophilus HhaI 5’…GC G C..3’ 3’ single ……..GCG C….
haemolytica 3’…CG C G..5’ strand ………C GCG...….
Escherichia coli EcoRI 5’…G AATT C..3’ 5’ single …G AATTC.….
3’…C TTAA G..5’ strand …CTTAA G.….

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Restriction map

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Removing vector sequence

• Vector contamination can be identified by


searching your sequence against a database of
vector sequences (UniVec) e.g.
http://www.ncbi.nlm.nih.gov/VecScreen/VecS
creen.html –uses BLASTN
• Need to hope vector is only at extremities and
not in insert (contamination!)

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
PCR and primer design

• Can engineer restriction


sites
• Primers should be similar
length and Tm
• Should amplify only
required piece from
genome

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Example with Primer BLAST

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Example with Primer BLAST

Bioinformatics Resources & Databases: N Mulder


Gene Prediction
Wikipedia: A gene is a locatable region of genomic sequence, corresponding to a
unit of inheritance, which is associated with regulatory regions, transcribed regions
and/or other functional sequence regions

• Look for gene structures


• Move along sequence looking for coding regions and
intergenic regions
• Check reading frame -translate
• Look for promoters and poly-adenylation signals
• In eukaryotes look for introns and exons
• Use EST or BLAST support (reduce pseudogenes)

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Translation

• Can choose frame if you know it


• Otherwise 6-frame translation:
– Choose start codon ATG
– Otherwise lists all codons between stop codons
• Results –usually the longest ORF starting with
Met and ending in stop, & no stop codons
inside
• Can confirm this with promoter prediction
• Should use appropriate codon usage table

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Open reading frame

• String of in-frame combinations/triplets of


bases that specify an amino acid
• Starts with ATG (Meth) or Val
• Ends with stop codon
• One base insertion or deletion –out of
frame/frameshift

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Genetic code

• Each amino
acid is specified
by a triplet of 3
bases
• 4 bases:
A,C,G,T = 64
possible
codons.
Actually 61
codons + 3
stop codons

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Translating sequences

• 6 possible reading frames, 3 in each direction

Ser Arg Leu

AGTCGGCTGACTGCGTTTACGAATGCGATTACTCCCTT
+1

Reverse complement

AAGGGAGTAATCGCATTCGTAAACGCAGTCAGCCGACT

-1

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Translating sequences

• 6 possible reading frames, 3 in each direction

Val Gly Stop

AGTCGGCTGACTGCGTTTACGAATGCGATTACTCCCTT
+2

AAGGGAGTAATCGCATTCGTAAACGCAGTCAGCCGACT

-2

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Translating sequences

• 6 possible reading frames, 3 in each direction

Ser Ala Asp

AGTCGGCTGACTGCGTTTACGAATGCGATTACTCCCTT

+3

AAGGGAGTAATCGCATTCGTAAACGCAGTCAGCCGACT

-3

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Translating sequences

• 6 possible reading frames, 3 in each direction

Arg Leu Thr

AGTCGGCTGACTGCGTTTACGAATGCGATTACTCCCTT
+1

Reverse complement

AAGGGAGTAATCGCATTCGTAAACGCAGTCAGCCGACT

-1

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Getting the final protein

• Six-frame translation
• Find longest ORF with initiation site, start
codon and ending with stop codon

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Gene Prediction -bacteria

Promoter

Start codon

CDS

Stop codon

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Complex Eukaryotic systems
Promoter region –many
TFBS -find with pattern
matching Splice junction

Exon 1 Intron 1 Exon 2 Intron 2 Exon 3

Alternative splicing

Exon 1 Exon 2 Exon 3

Exon 2 Exon 3 Exon 1


Exon 2 Exon 3

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Human introns and exons

Introns are much larger


than exons, introns could
represent up to 95% of
gene

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Gene prediction in eukaryotes
• Identifying features (sometimes by PSSMs):
– splice sites
– start and stop sites
• Predict exons based on these signals
• Score exons based on signals and exon characteristics
(coding sequences may have compositional biases)
• Use composition and homology information
• Assemble components into predicted gene structure
• Some methods use HMMs -features are states
• Use EST info

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Using EST data: mRNA against genomic sequence
exon
CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG

exon *** ************ ** * **************

CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------
Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA

intron
************************************ ****************************

CONTIG ------------------------------------------------------------------------------------------------------------------------
Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
exon **************************************************************************
intron

exon
CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------
Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT
********************************** ********

CONTIG
intron
-------------------------------------------------------------------------------------------------------------------GNAAA
Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA
exon * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA TAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC


Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGA TAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC
******************************************* * ************** ******** ***** **** * *********** ***************************

CONTIG C-----------------------------------------------------------------------------------------------------------------------
Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA
*

Bioinformatics Resources & Databases: N Mulder


Gene Prediction software

• GeneMark –gene prediction for prokaryotes, eukaryotes


and viruses: http://opal.biology.gatech.edu/GeneMark/
• GENSCAN –for vertebrate, maize and Arabidopsis
sequences: http://genes.mit.edu/GENSCAN.html
• Microbial Gene Prediction System
http://compbio.ornl.gov/generation/
• Glimmer –bacteria, archae and viruses
http://www.tigr.org/software/glimmer/
• GRAIL –for eukaryotes, includes splice info, homology, etc.
http://compbio.ornl.gov/grailexp/

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Other translators and promoter
prediction
• NCBI ORF Finder:
(http://www.ncbi.nlm.nih.gov/gorf/gorf.htm)
• Promoter 2.0 Prediction Server
(http://www.cbs.dtu.dk/services/Promoter/)
• MCPromoter MM:II
(http://genes.mit.edu/McPromoter.html)
• BPROM -prediction of bacterial promoters,
etc.

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
RNA sequence analysis

• Many different types of RNA e.g. tRNA, rRNA,


mRNA etc.
• Some have activities e.g. ribozymes
• Many new programs for identification of non-
coding RNA, miRNAs etc and their targets
• Secondary structure of RNA is NB for stability and
often function
• RNA levels are NB for final protein levels, they
measure gene expression –ESTs, microarrays

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder
Summary and conclusions

• Basic sequence analysis is finding features on


a sequence
• This could be small features
– Restriction sites -> cloning
– Primer sites -> PCR
• Or combinations of features:
– Gene signals -> gene prediction
• Features found by nature of their
“conservation” or pattern matching

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: N Mulder

You might also like