100% found this document useful (2 votes)
165 views54 pages

Bioinformatics: Nadiya Akmal Binti Baharum (PHD)

This document provides information about the BSM 4301 Bioinformatics course at UPM, including: - The course outcomes which are to choose suitable bioinformatics tools for DNA/protein analysis, apply skills to analyze sequences online, and describe bioinformatics applications. - The course is 3 credit hours and taught by two instructors. - Assessments include tests, lab and project assessments evaluating different professional and learning outcomes. - References include two textbooks on bioinformatics fundamentals and databases.

Uploaded by

Nur Razinah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
165 views54 pages

Bioinformatics: Nadiya Akmal Binti Baharum (PHD)

This document provides information about the BSM 4301 Bioinformatics course at UPM, including: - The course outcomes which are to choose suitable bioinformatics tools for DNA/protein analysis, apply skills to analyze sequences online, and describe bioinformatics applications. - The course is 3 credit hours and taught by two instructors. - Assessments include tests, lab and project assessments evaluating different professional and learning outcomes. - References include two textbooks on bioinformatics fundamentals and databases.

Uploaded by

Nur Razinah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

BSM 4301

BIOINFORMATICS

Nadiya Akmal binti Baharum (PhD)


[email protected]
010-5754311
COURSE OUTCOMES (CO)
• To choose database and bioinformatics program software suitable
for analysis of DNA and proteins sequences (C4)

• To apply appropriate bioinformatic skills/techniques for analyzing


DNA and protein sequence online. (P4, LS)

• To describe bioinformatics applications in daily life. (A4, LL, TS)


COURSE INFORMATION
• Credit hour: 3 (2+1)
• Instructors:

1. Assoc. Prof. Dr. Adam Leow Thean Chor (Co-Ordinator)


2. Dr. Nadiya Akmal Baharum
COURSE OUTLINE
COURSE OUTLINE
ASSESMENTS
PO1 (Knowledge): Test 1 (15%), Test 2 (15%), Final (30%)
PO2 (Practical skills/Psychomotor skills): Lab Assessment (10%)
PO5 (Social skills and responsibilities): Reflective journal from interview with NSG-related industry/research
institutes (10%) (TBC)
PO7 (Information management and life long learning skills): Leader’s lab report (10%)
PO9 (Leadership): Peer assessment (5%)
PO10 (Numerical skills): Group assignment (5%)

TOTAL : 100%
REFERENCES & TEXTBOOKS

• Pevsner, J. (2015). Bioinformatics and Functional Genomics, 3rd


Edition. Wiley Blackwell Inc.

• Choudhori, S. (2014). Bioinformatics for Beginners: Genes, Genomes,


Molecular Evolution, Database and Analytical Tools. 1st Edition. Oxford:
Academic Press.

• Bioinformatics websites.
Introduction to • What is bioinformatics?

Bioinformatics • Bioinformatics : The BIG Picture


Lecture 01-A • Aims of bioinformatics
Learning Outcomes (LO):

By the end of this lecture, students should be able to:

1. Define bioinformatics, as a field of sciences.


2. Summarise the three perspectives of bioinformatics.
3. Explain the final aims of bioinformatics to complement the study of
biological sciences.
What is
bioinformatics?
THREE WORDS TO DESCRIBE
BIOINFORMATICS

www.menti.com
Code:1163 7806
https://www.menti.com/al2ac9orq9fo
A. What is
bioinformatics?
A field of study that uses computation to
process knowledge from biological data.

• It includes the collection, storage, retrieval, manipulation and


modelling of data for analysis, visualization or prediction through
the development of algorithms and software.
A. What is Bioinformatics?
• Management of information systems (databases) provided through
experimental acquisition of molecular biology, to complement many
practical applications.

Computer
Databases
(software)

Integration Fil the gaps to


answer biological
Experiments
questions

http://www.youtube.com/watch?v=dJrpSvsFXFI
A. What is Bioinformatics?
Another definition adopted by Luscombe et al.:

• a union of biology and informatics: bioinformatics involves


the technology that uses computers for storage, retrieval,
manipulation, and distribution of information related to
biological macromolecules such as DNA, RNA, and proteins.
BIOINFORMATICS
http://www.youtube.com/watch?v=42DJPDb-hRU

8
Scopes of Bioinformatics
Development of computational tools and
1 databases.

Application of these tools and


2 databases in generating biological
knowledge to better understand
living systems.
W h y use bioinformatics?
• Pre-bioinformatics : in vivo, in vitro; Post-bioinformatics : in silico
• Systematic organization for huge amounts of data.
• Collect and integrate to make it accessible and usable
• Effective utilization of all data
• Faster analysis through prediction and simulations
• Shorter time to run analyses simultaneously (automation)
• Drawback :
• The quality of bioinformatics predictions depends on the quality of data and the
sophistication of tools being used.
• Bioinformatics and experimental biology are independent, but complementary
activities.
Current status of genomics data
1st Genome : 1995 (H.influenzae, 1.83 million bases)
Human genome : >3 billion bases (GB), ~20,500 coding genes
Rice genome : 400-430 million bases (MB), ~ 38, 000 coding genes
Chicken genome : ~1.21 billion bases, ~20,000 coding genes
Bovine genome : ~2.7 billion bases, ~21,880 coding genes

Growth of reference sequences Growth of GenBank


C. Bioinformatics :The BIG Picture

Hi!

The cell
DNA, RNA, protein
The organism The tree of life
Central dogma of molecular
biology Genome-wide analysis of RNA and protein Genome analysis

Changes in organism across different 3 major branches of bacteria,


developmental stages, and across archaea and eukaryotes
different regions of the body (multicellular
organisms)
• Central dogma of molecular biology : DNA > RNA > Protein > cellular phenotype
• Bioinformatics : complete collection and utilisation of DNA(genome), RNA(transcriptome), and
protein sequences (proteome) to elucidate protein and gene function .
• Application of computer algorithms and computer databases to molecular and cell biology.

• Broadening perspective from cell-level, to organism level phenotype.


• Gene and protein expression changes throughout different developmental stages or different
region of an organism - in response to intrinsic or environmental signals.
• Utilize a collection of genes/protein products (e.g DNA microarrays) to explain changes through
developmental time, changes across body regions, and changes in a variety of physiological or
pathological states.

• DNA sequence analysis data in bioinformatic databases is accumulating for over 150,000
different organisms.
• Complete genome sequences : help categorize organisms into three major branches in the
Treeof Life : bacteria, archaea, and eukaryotes
• Fundamental unity of life and comparative genomics : learn how chromosomes evolved through
duplications, deletions, and rearrangements.
Food for thought...
The cow genome is comprised of a sequence of 2.86 billion
letters (2,860,000,000 bases) - able to fill millions of pages of
a normal book.

How can you detect anomalies (at the gene level) between
cows, or between cows possessing defective phenotypes?

make sense of the letters?


what gene corresponds to a particular protein?
what sequence corresponds to a specific gene?
D. Aims of bioinformatics
1. To store the biological data organized in a database
2. To develop tools and resources to aid in analysis of data.
3. To analyze and interpret accumulated data in a biologically
meaningful manner.
1. Store the biological data organized in a database
• Database is used to store and organize data
• allow an easy access to existing data and submit new entries

• Data are annotated to assign their functional characteristics


• Prevent redundancy and multiplicity of similar data
• Identify gene and proteins by sequence and structure similarity
• Orthologs – gene in different species, evolved from a common ancestral gene. Usually
retains the same function.
• Paralogs – gene duplication within a genome, evolved to distinct protein but related
function (sometimes do not).
• Analogs – different protein sequences, but similar structures.

• Databases must be able to correlate between different hierarchies of information, e.g;


• Genbank – gene and protein sequences
• Protein Data Bank – 3D macromolecules
2. Develop tools and resources to aid in analysis of data.

Homology searching (BLAST)


Sequence alignment (ClustalW)
Primer design (Primer3)
Phylogenetic tree (MEGA 5.0)
RNA structure modeling (mfold)
Protein structure modeling (PSIPREP, Swiss Model)
Signal peptide prediction (SignalP)
Physiochemical properties (ProtParam)
Transmembrane prediction (TMHMM)
Promoter prediction (Neural Network Promoter Prediction)
Many more….
3. Analyze and interpret the accumulated data in a
biologically meaningful manner

Structure analysis Sequence analysis Function analysis


Nucleic acid structure Genome comparison Metabolic pathway
prediction modeling
Phylogeny
Protein structure Gene expression profiling
prediction Gene & promoter prediction
Protein interaction
Protein structure Motif discovery prediction
classification
Sequence database Protein subcellular
Protein structure searching localization prediction
comparison
Sequence alignment
Store Access
General flow of
bioinformatic
information

Manipulate Analyze
E. Application of bioinformatics

Rational drug design


Medical therapy
Forensic DNA analysis
Agricultural biotechnology
Rational drug design
How bioinformatics
help to develop
drugs/inhibitors
that can
preferentially bind
to specific proteins
Medical therapy
Genome sequences to detect potential harmful mutation for early diagnosis and effective treatment

The right medicine can be tailored to the right patient based on biomarker-based
diagnosis.
Forensic DNA analysis
DNA sequencing for legal and investigative purposes.
Molecular phylogenetic analysis as evidence in criminal courts.

Yang et al. (2014). Genomics Proteomics Bioinformatics, 12:190-197.


Agricultural biotechnology
Development of new crop varieties with higher productivity

The deployment of genomic selection breeding will help in achieving higher genetic
gains in less time
Pandey etl al. (2016). Front. Plant. Sci. 7:455.
Access to • Publicly accessible databases
Sequence Data • Database operators
• Access to information
and literature • Access to biomedical
information literature

Lecture 01-B
Learning Outcomes (LO)

• To identify the types of data stored in biological databases.

• To analyze the main features of NCBI and other biological databases.


• To access sequence data information in biological databases,and
biomedical literature.
Store Access
General flow of
bioinformatic
information

Manipulate Analyze
WEBSITES FOR BIOINFORMATIC-BASED
REPOSITORIES

• GenBank - National Center for Biotechnology Information (NCBI)


• The European Molecular Biology Laboratory database (EMBL)
- European Bioinformatics Institute (EBI)

• DNA Database of Japan (DDBJ) - National Institute ofGenetics (NIG)

∴Collectively known as the International Nucleotide Sequence


Database (INSD)
Types of available data
• DNA/RNA sequences
• Protein sequence and structure
• Protein function database
• Organism-specific databases
• Molecular pathway
• Scientific literature
• Genomes
The most sequenced organisms in GenBank

August 2018

Note: Bacteria, archeae and viruses are absent from the list
because of relatively small genomes
8
Types of data in GenBank
• A part of large fragment of DNA:
- bacterial artificial chromosome
(BACs)
- yeast artificial chromosome (YACs)
DNA RNA Protein
• A gene:
- Prokaryote: non-coding and coding
regions cDNA
- Eukaryote:regulatoryregions, protein ESTs
Genomic DNA databases UniGene
coding exons and introns
*non-redundant (NR)
• cDNA databases
- RNA converted to more stable cDNA.
• Expressed Sequence Tags (ESTs)
- Partial DNA sequence of a cDNA clone.
- Assume 220,000 human genes ➠300 ESTs to
each gene.
National Center for Biotechnology Information
(NCBI)

https://www.ncbi.nlm.nih.gov/
NCBI key features: ①PubMed
• National Library of
Medicine’s search service.
• >20 million citations
in MEDLINE
- Medical Literature,
Analysis and Retrieval
System Online.
• Links to participating
online journals.

https://pubmed.ncbi.nlm.nih.gov
NCBI key features: ②Entrez
Also known as The Entrez Global Query Cross-
Database Search System.

A search and retrieval system that integrates:

• Scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• Population study datasets; and
• Assemblies of complete genomes.
NCBI key features: ③BLAST

• Basic Local Alignment Search Tool.


• NCBI’s sequence similarity search tool – analysis of DNA and protein
databases.
• Holds approximately 200,000-300,000 searches daily.
• Comprised of programs: nucleotide/protein blast, blastx, tblastx, tblastn.
NCBI key features: ④OMIM

• Online Mendellian Inheritance (in) Man.


• A catalog of all known diseases linked to human genes and genetic disorders.
• Comprehensive characterization of entries; autosomal dominant, autosomal
recessive, X-linked, mode of inheritance, phenotype, etc.
How do you start looking?
You begin with a query search:
Name of a specific sequence, or;
Information from literature (accession number):
X02775 GenBank genomic DNA sequence
DNA { NT_030059 Genomic contig
Rs7079946 dbSNP (single nucleotide polymorphism)

RNA { N91759.1 An expressed sequence tag (1 of 170)


NM_006744 RefSeq DNA sequence (from a transcript)
Retinol-binding protein,
RBP4
NP_007635 RefSeq protein
Protein
{ AAC02945
Q28369
1KT7
GenBank protein
SwissProt protein
Protein Data Bank structure record
Access to information:
Accession numbers for sequence identification

• DNA and protein sequences are tagged with accession numbers.

• Examples from literature: AY260764.3 - T1 lipase, 3rd version.

• Accession number: a 4-12 string of numbers and/or


alphabetics associated with a molecular sequence record. Like a barcode!

• Can tell whether entry contains nucleotide or protein data.

• One typical molecule can contain many accession numbers- ESTs and DNA
fragment matching that particular molecule.

• Accession numbers of molecules have different formats according to different


databases.
• NCBI - also assigns GI numbers; unique
sequence identification numbers to a
sequence within a record.
E.g NM_000518.4 = human β-globin DNA sequence, GI:28302128.
Suffix [4] refers to version number. But NM_000518.3 has a different
GI: 13788565.

• Many of sequence entries may contain errors,


discrepancies derived from comparison
between mRNA and genomic data.

• So how do you assess quality of a sequence


or entry?
HEADER

A sequence file in
GenBank/GenPept
format

FEATURE

SEQUENCE
The Reference Sequence ( RefSeq ) Project
(Accessible through NCBI main page)

• Goal: To provide best representative sequence for each normal, non-mutated transcript
produced by a gene, and normal protein product.
• One RefSeq entry per given gene or gene product, OR several RefSeq entries - splice
variants or distinct loci.
• RefSeq best representative sequences: provide an expertly curated accession number
that corresponds to the most stable, agreed upon “reference” version of a sequence.

• Formats: Complete genome NC_######


Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_######
Protein NP_######
NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences;
Accession Molecule Method Note
AC_123456 Genomic Mixed Alternate complete genomic
AP_123456 Protein Mixed Protein products; alternate
NC_123456 Genomic Mixed Complete genomic molecules
NG_123456 Genomic Mixed Incomplete genomic regions
NM_123456 mRNA Mixed Transcript products; mRNA
NM_123456789 mRNA Mixed Transcript products; 9-digit
NP_123456 Protein Mixed Protein products;
NP_123456789 Protein Curation Protein products; 9-digit
NR_123456 RNA Mixed Non-coding transcripts
NT_123456 Genomic Automated Genomic assemblies
NW_123456 Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome shotgun data
XM_123456 mRNA Automated Transcript products
XP_123456 Protein Automated Protein products
XR_123456 RNA Automated Transcript products
YP_123456 Protein Auto. & Curated Protein products
ZP_12345678 Protein Automated Protein products
UniGene: an NCBI organized resource to describe
where genes are expressed (i.e. from which library)
and how abundantly

DNA RNA protein

complementary DNA Cluster of sequences


(cDNA)

UniGene One gene


HomologoGene:
an excellent NCBI resource
that groups homologous
eukaryotic genes

https://www.ncbi.nlm.nih.gov/homologene/?term=68066
Access to Biomedical Literature

• Pubmed - NCBI gateway to MEDLINE.


• MEDLINE contains bibliographic citations and author abstracts
from over 4,600 journals published in the US, and in 70 countries.
• Has >20 million records dating back to the 1950s.
PubMed search strategies
• Use boolean queries (capitalize AND, OR, NOT)
lipocalin AND disease

• Try using limits (see Advanced search)

• There are links to find Entrez entries and external


resources
1 AND 2 1 2 lipocalin AND disease
(504 results)

1 OR 2 1 2 lipocalin OR disease
(2,500,000 results)

1 NOT 2 1 2 lipocalin NOT disease


(2,370 results)

You might also like