Bioinformatics: Intended Learning Outcomes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Bioinformatics

Intended Learning Outcomes:


1. Define Bioinformatics
2. Explain the importance of Bioinformatics
3. Describe the major nucleotide and protein databases
4. Design a primer using Primer 3 and PrimerBlast
5. Describe how to search for the sequence of a given gene

Introduction
- Millions of nucleic acid sequences have been stored in data banks
-- where they’re organized by applying computer and information technology.
- Thus, the field of Bioinformatics emerged.

- biological informatics
a) application of information technology to the field of molecular
biology.
b) Storing
retrieving large amounts of biological information.
Analyzing
- Like an archive of biological information
- involves the establishment of databases, algorithms, computational and statistical
techniques.
- highly interdisciplinary field
- strives to advance our knowledge of the biological system
-- enable us to interpret biological processes
-- utilize such knowledge in various applications.

Why Do we need Bioinformatics?


- provides tools to help molecular biologists interpret data scientifically.
- helps understanding of biological processes.
-- tools help interpret data into meaningful biological information.
-- information is correlated to principles that are common in biological systems
- .in silico
-- simulated experiment on a computer
-- saves time and resources
-- Based on the results, a real experiment can be designed to validate the computational results.

Applications of Bioinformatics
- various applications and fields:
FAR EASTERN UNIVERSITY - NICANOR REYES MEDICAL FOUNDATION
SCHOOL OF MEDICAL LABORATORY SCIENCE
Joeperl C. Verdadero
MOL BIO
1. Sequence alignment and analysis 10. Gene therapy
2. Mapping and analyzing DNA, RNA, 11. Antibiotic resistance
Protein, Amino Acid, and Lipid 12. Evolutionary studies
sequences 13. Waste cleanup
3. Creation and visualization of 3-D 14. Biotechnology
structure models for biological 15. Climate change studies
molecules of significance, e.g., proteins 16. Alternative energy sources
4. Genome annotation 17. Crop improvement
5. Genetic diseases 18. Forensic analysis
6. Designer / Personalized Medicine 19. Bio-weapon creation
7. Drug development 20. Insect resistance
8. Microbial genome applications 21. Improve nutritional quality
9. Molecular medicine 22. Veterinary science

MAJOR BIOINFORMATICS DATABASES


- repositories, place for storage
- to collect, archive, visualize, and organize data
-- enable intelligent data description, interpretation, discovery, or retrieval.
Nucleic Acid Database Protein Database Other Database
- primary databases that belong - contain information about 1. Gene Expression Databases -
the International experimentally determined 3D quantity of genes transcription
Nucleotide Sequence data of macromolecules products
Database Collaboration -- proteins and nucleic acids - mostly microarray data.
(NSDC). - data obtained by
- have annotated collection of 1. X-ray 2. Phenotypic databases
all publicly available DNA 2. Crystallography - genetic variants and their
sequence data from all 3. NMR spectroscopy, associated observable
organisms. 4. cryo-electron microscopy characteristics
- Worldwide Protein -- disease-causing mutations
Sequence Data Bank -- resulting diseases of
(wwPDB) is the internationally phenotypic abnormalities.
recognized sole
Repository 3. RNA db
- manages and ensures that the 4. Amino Acid/Protein db
PDB is freely and 5. Protein-Protein and Other
publicly available Molecular Interactions
1. GenBank 1. PDBj (Japan) 6. Signal Transduction Pathway db
- genetic sequence database, 2. PDBe (Europe) 7. Metabolic Pathway and
(NIH, U.S.A.) 3. RCSB PDB (USA) Protein Function Databases
1. EMBL 8. Bacterial DNA Databases
- European Molecular
Biology Laboratory
- leading laboratory for
research in molecular biology of
Europe
2. DDBJ
- DNA Database of Japan
Bioinformatics activities

1. Gene Mapping and analyzing DNA, RNA, and protein sequences.


- locate the position of genes
- shows details of gene location relative to another in the
chromosome.

2.Sequence alignment and analysis


- way of arranging DNA, RNA, or protein sequences to identify regions of similarity
-- suggest of functional, structural, or evolutionary relationships between the
sequences.
- helps analyze an amplified the nucleotide sequence of an isolated protein
-- by comparing it with previously published sequences
A) BLAST by NIH
B) MUSCLE or MAFFT - Nucleic Acids
C) Clustal Omega - proteins.

3.. Creation and visualization of 3-D structure models for biological


molecules of significance, e.g., proteins.
- an isolated unknown protein’s structure or
function may be predicted by comparing it with an
experimental 3D of a related homologous protein by Homology Modeling
A) PyMOL by Warren Lyford De Lano

4.Genome annotation
- identifying locations of genes & all coding regions in a genome
- determine gene function
Gene consists of enough DNA to code for 1 protein
Genome is the sum total of an organism’s DNA.
Annotation
-- note added, explanation or commentary
- feature information:
1. Cds - coding sequence
2..Coding region intervals, includes start & stop codon (if present)
3..Protein name
4..Gene name
5..Amino acid sequence
Eg. Escherichia coli CFT073 (organism & gene of interest)
1. Cds
2. Coding region intervals: 190…255
3..Protein name: Threonine, leader peptide
4..Gene name: thrL
5..Amino acid sequence: "MKRISTTITTTITITTGNGAG”

FAR EASTERN UNIVERSITY - NICANOR REYES MEDICAL FOUNDATION


SCHOOL OF MEDICAL LABORATORY SCIENCE
Joeperl C. Verdadero
MOL BIO
FINDING GENE SEQUENCES USING ONLINE DATABASES
1. Access Gen Bank website by NIH (USA).
2. Choose “nucleotide” in the drop down menu.
3. If a sequence identifier is available, (e.g., GI: 26111730 for Escherichia coli CFT073, complete
genome), it can be typed in the Search box

A complete information about Escherichia coli CFT073


genome is given
- 5,231,428 bp in a circular DNA.
- Accession Number: AE014075.1,
- GI (GenInfo Identifier): 2611173.
- Annotations:
-- source and related publications
-- list of gene names
-- start and end sites (numbered) within the genome
-- gene function
-- cds (coding DNA sequence)

4. You may search for your gene of interest, and click on the gene link, a new page will appear that
contains the gene sequence and other information.
BLAS the higher the alignment score,
T the more significant the hit.
- Basic Local Alignment Search
Tool) B. Table of BLAST hits
- used if nucleotide sequence is available but - summary table where all the sequences in the
without annotation Refseq (reference sequence) database that show
- finds regions of similarity between biological significant sequence homology
sequences - shows all of the alignment blocks for each
- compares sequences (query sequence to BLAST hit.
database sequence ) - sequence alignments show how well the query
-- calculates the statistical significance sequence matches with the subject sequence
- designed to identify local regions of sequence
similarity. C. Corresponding alignments
- may report multiple discrete regions of
sequence similarity b Important Features of a Typical
BLAST alignment
1. Go to NCBI A) Score:
2. Click on“Resources” and choose “DNA & RNA” - number used to assess the biological relevance
at drop-down menu and choose “BLAST”. of a finding
3. On the page that appears, click on “Nucleotide - numerical value that describes the overall
BLAST” . quality of an alignment.
4. Paste your sequence at the box for Enter higher numbers correspond
Query Sequence. FASTA sequence or to higher similarity
Accession number may be entered.
B) Expect value / E-value
FASTA : text-based format for a nucleic acid - describes the number of hits one can “expect”
or protein sequence to see by chance when searching a database of a
- nucleotides or amino acids are particular size.
represented using single-letter codes. The lower the E-value, or the closer it is to “0”,
Accession number : unique identifier the higher is the “significance” of the match.
given to a DNA or protein sequence record
C) Gap.
- allows tracking of different versions of
that sequence record and the associated - A space introduced into an alignment to
sequence over time in a single data repository compensate for insertions and deletions
D) Identity.
- The extent to which two (nucleotide or amino
5. Click ‘BLAST’ box below
acid) sequences have the same residues
-- Some parameters may be modified or
at the same positions in an alignment, often
specified.
expressed as a percentage.
-- line between the bases of the two sequences
Default BLAST Report
A. Graphical Summary
indicate identity.
- color of the boxes corresponds to the score (S) of the E) Query sequence
alignment - input sequence to which all of the entries in a
- Red bar: colored boxes, represent alignments database are to be compared.
in the database that match to the Query F) Subject or Matching sequence.
sequence - A subject or matching sequence is the sequence
-- highest alignment scores present in the database.

How do we interpret this?


FAR EASTERN UNIVERSITY - NICANOR REYES MEDICAL FOUNDATION
SCHOOL OF MEDICAL LABORATORY SCIENCE
Joeperl C. Verdadero
MOL BIO
- top five hits show much more significant alignments compared to the other hits

-- eg. our unknown hemoglobin beta genomic DNA has sequence homology to the hemoglobin
beta gene in Homo sapiens.
-- Homo sapiens hit has an accession number that begins with ‘NM’
-- other hits (Gorilla gorilla, Pan troqlodytes and Pan paniscus) begin with ‘XM’.
- primary difference between the two prefixes is the type of information available to
support each of the Refseq mRNAs.
-- NM: confirmed by experimental evidence
-- XM: based solely on computational predictions
- NM > XM: more favorable as it's based on experimental evidence
OTHER SEQUENCE ALIGNMENT TOOLS

Pairwise Sequence
Alignment
- identify regions of similarity between 2 biological sequences
EMBOSS
Water
- eg. compare the DNA sequence of normal beta hemoglobin gene (Hgb A) with the sickle cell
hemoglobin beta gene (Hgb S)
1. Access EMBOSS Water Pairwise Sequence Alignment using a browser.
Note:
-- Enter the gi number of normal Homo sapiens hemoglobin beta which is preceded by the
character “>”, then paste the DNA sequence on the next line.
- GI number is a simple series of digits that are assigned consecutively to each sequence record
processed by NCBI. (GenInfo Identifier)
-- Next, enter the gi number of Sickle cell Hgb-beta: >Anemic Homo sapiens hemoglobin, beta
(HBB), mRNA in another Notepad file.
- “>”, precedes the name of your sequence
-- sequence is found as FASTA
2. Enter the pair of sequences on the boxes indicated(
3. Set your pairwise alignment options.
4. Submit your job.
5. The Result will show the characteristics of the alignment such as as:
- length of the sequences (626 bp)
- percent identity (99.8%)
- percent similarity (99.8%)
- gaps (0.0%)

- line (ǀ): indicate identity,


- dot (•) indicates differences, Non-complementary base pair

FAR EASTERN UNIVERSITY - NICANOR REYES MEDICAL FOUNDATION


SCHOOL OF MEDICAL LABORATORY SCIENCE
Joeperl C. Verdadero
MOL BIO
Muscle Sequence
Alignment
- alignment of 3 or more biological sequences, multiple sequences, of similar length
- homology can be inferred & the evolutionary relationships between the sequences analyzed
MUSCLE MAFFT
(MUltiple Sequence Comparison (Multiple Alignment using Fast Clustal Omega
by Log- Expectation Fourier Transform)
DNA DNA alignments protein alignments

Some Applications of DNA sequence alignments:


a. PCR products
b. sequences obtained from database in order differentiate them
e.g. 4 serotypes of dengue virus

MUSCL
E

Prepare the sequences you wish to align in a Notepad file.


• Example: 4 serotypes of Dengue virus
• Each sequence is given an identifier (gi) preceded by “>”. For example: >gi|224383591|gb|FJ744702.1|
Dengue virus 1 isolate DENV-1/KH/BID-V2003/2006, complete genome
• You may combine the 4 sequences in one Notepad file. Provide a space after each sequence.
• Save as text file or fasta.
1. Launch MUSCLE (Multiple Sequence Comparison by Log- Expectation)
2. Enter your input sequences.
o 3 or more sequences can be entered directly
3. Set your Parameters
• Output Format: ClustalW
• The default settings will fulfil the needs of most users. You may change the default.
4.Submit your job.
• You may identify the tool result by giving it a name, for example, 4 Dengue Serotypes.
5.The Result will appear after a few seconds or minute
- shows the identifiers of 4 dengue virus serotype
- Dash lines: Gaps
- Asterisks: identical bases
• eg. several bases that are non-identical among the 4 serotypes
• A Phylogenetic Tree may also be obtained.
- also called evolutionary tree: branching diagram or "tree" showing the evolutionary relationships
among organisms.

FAR EASTERN UNIVERSITY - NICANOR REYES MEDICAL FOUNDATION


SCHOOL OF MEDICAL LABORATORY SCIENCE
Joeperl C. Verdadero
MOL BIO

You might also like