0% found this document useful (0 votes)
19 views93 pages

Lec 01

Bioinformatics is the computational branch of molecular biology, focusing on the management, integration, and analysis of biological data driven by high-throughput technologies. It encompasses various applications, including the creation of databases, sequence analysis, and modeling dynamic life processes, and is crucial for understanding genome evolution, disease susceptibility, and protein interactions. The document also discusses the structure and function of proteins, DNA, RNA, and the significance of the Human Genome Project in genomic research.

Uploaded by

istiake zahan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views93 pages

Lec 01

Bioinformatics is the computational branch of molecular biology, focusing on the management, integration, and analysis of biological data driven by high-throughput technologies. It encompasses various applications, including the creation of databases, sequence analysis, and modeling dynamic life processes, and is crucial for understanding genome evolution, disease susceptibility, and protein interactions. The document also discusses the structure and function of proteins, DNA, RNA, and the significance of the Human Genome Project in genomic research.

Uploaded by

istiake zahan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Bioinformatics

Md. Fazlul Karim Patwary


Molecular Biology
Molecular biology is the study of biology at a molecular level. The
field overlaps with other areas of
• Biology
• Chemistry
• Genetics
• Biochemistry

Interactions between various systems of a


• Cell
• DNA
• RNA
• protein biosynthesis

Also how these interactions are regulated.


Introduction
Organic chemistry is the chemistry of carbon
compounds, Biochemistry is the study of carbon
compounds that crawl -- Mike Adam

There are so many fields related with Bio like


Bio-chemistry
Bio-metrics
Bio-physics
Bio-technology
Bio-hazards
Bio-terrorism etc. Now Bio-informatics
Definition
Brief definition:
• Computational branch of molecular biology is
Bioinformatics

• Applications of computers in molecular biology is


Bioinformatics
Definition
Definition:
Bioinformatics is the interface between biological
and computational sciences driven by advances in
high throughput technologies that result in an
ever increasing variety and volume of
experimental data to be managed, integrated,
and analyzed.

Technology -> large data set -> manage, integrate,


analysis
Bioinformatics has become a part of modern
biology and often dictates:
• new fashions
• enables new approaches
• drives further biological developments.

Using bioinformatics as a toolkit without


understanding the main computational ideas
is not very different than
using a CT-SCAN without knowing how it works.
Computation in Bioinformatics is based on
algorithms. These algorithms are rapidly
developed by various developers and
computational biology is now an increasing
source of new algorithms than any other
computational sciences.

For the development and efficient use of algorithms


computer science students need to know
Bioinformatics.
Use of Bioinformatics
Recently, many developments in the field are novel
algorithmic techniques have been done that promise to
provide the answers to key challenges in post-genomic
biomedical sciences. These algorithms use to

• To understand mechanisms of genome evolution


• To understand the structure of regulatory and protein-
interaction networks
• To determine the genetic basis of disease susceptibility
• To elucidate of historical patterns of population
migration.
Area of Bioinformatics
• Creation and Maintenance of Databases

• Analysis of Sequence Information

• Prediction of Three-Dimensional Structure

• Expression Analysis

• Modelling Dynamic Life Processes


Creation and Maintenance of Databases

A huge magnitude and complexity of the data being


collected has led to the creation of large relational
databases to store, organize, and index such data.

At the moment DNA sequences and protein


sequences derived from them comprise the
majority of such catalogues.
Creation and Maintenance of Databases

Some well-known examples are:

• GenBank - a database that contains the totality of


public DNA and protein sequence data

• SWISS-PROT - a protein sequence database

• PDB - a database of three-dimensional biological


macromolecular structure data.
Analysis of Sequence Information

In parallel with the development of large sequence


databases, specialized tools (e.g., BLAST) are
being devised to efficiently search, view, and
analyze the data in these databases.
Analysis of Sequence Information

• Development of methods for finding the genes in


the DNA sequences of various organisms

• Clustering sequences into families of related


sequences

• Aligning similar genes and proteins

• Examining evolutionary relationships.


Prediction of Three-Dimensional Structure

Knowledge of physics and chemistry, and


information gathered from similar molecules, is
being used to deduce the three-dimensional
structure of proteins and other large molecules.
Expression Analysis
Pattern analysis of gene expression data using
statistical and data mining tools is a major effort
in bioinformatics.
Modelling Dynamic Life Processes
The ultimate challenge in bioinformatics is to
develop ways of putting together the information
gathered from all the diverse areas of research in
order to understand fundamental life processes.
Molecular Biology for Bioinformatics
• Amino Acid
• Protein
• Gene
• DNA
• Nucleotide
Covalent Bond

A covalent bond is the chemical bond that involves the


sharing of pairs of electrons between atoms.

When share is equal then its is called nonpolar (H2)


otherwise polar (Hcl).
Peptide Bond
• Chemical bond formed between two molecules
carboxyl group +amino group -> CO-NH + H2O.

• CO-NH bond is called peptide bond


• This group is called peptide group.
• Resulting molecule is an amide.
• This is a dehydration synthesis reaction
• Also known as condensation reaction
Amino Acid

• Central carbon atom


• Amino group (NH2),
• Carboxyl group (COOH), and
• Side chain (R)
• Single hydrogen (H) atom
• About 500 amino acids are known
Amino Acid
Amino Acid
• Only twenty of them serve as standard building
blocks of proteins.
• Some of them have affinity to water and some are
not.
– Hydrophilic (P): Amino acids that are polar and thus
have the property of establishing hydrogen bonds with
water.
– Hydrophobic (H): Rest of them are Nonpolar amino
acids.
Amino Acid
18.2 The α-Amino Acids Hydrophobic Amino Acids
18.2 The α-Amino Acids Polar Neutral Amino Acids
18.2 The α-Amino Acids Charged Amino Acids
Amino Acid
Polypeptide Bond
Peptide bonds lead to a linear ordering of the amino
acids, forming a polypeptide chain. The backbone of
this chain has the following pattern:

This chain/polypeptide of amino acid form a protein.


Protein
• Most important of the molecule classes in living
organisms.
• Functions are
– Catalysis of metabolic processes in the form of
enzymes;
– Play an important role in signal transmission
– Defence mechanisms
– Molecule transportation
– Used as building material (for example in hair).
Protein
• Polypeptide chains (Protein) have a tendency to fold up
into complex three-dimensional structures.
• A protein’s particular function in the cell is determined
not only by its amino acid sequence but also by the
specific structure into which it folds.
• It is likely to be affected by other proteins present in the
same cell at the same time.
Thus proteins are much harder to study
Primary Structure of Proteins
• Primary structure is the amino acid sequence of the
polypeptide chain
– A result of covalent bonding between the amino acids –
the peptide bonds
• Each protein has a different primary structure with
different amino acids in different places along the
chain
Secondary Structure of Proteins
• When the primary sequence of the polypeptide folds
into regularly repeating structures, secondary
structure is formed
• Secondary structure results from hydrogen bonding
between the amide hydrogens and carbonyl oxygens
of the peptide bonds
• Not all regions have a clearly defined secondary
structure, some are random or nonregular
Tertiary Structure
• The three-dimensional structure, which is distinct
from secondary structure is classified as tertiary
structure
• Globular tertiary structure forms spontaneously and
is maintained by interactions among the side chains
or R groups
• Tertiary structure defines the biological function of
proteins
Quaternary Structure of Proteins
• The functional form of many proteins is not that of a
single polypeptide chain, but actually an aggregate
of several globular peptides.
• Quaternary structure: the arrangement of subunits or
peptides that form a larger protein.
• Subunit: a polypeptide chain having primary,
secondary, and tertiary structural features that is a
part of a larger protein.
• Quaternary structure is maintained by the same
forces which are active in maintaining tertiary
structure.
Protein
• Multitude of all proteins generated by a genome of
an organism is called its proteome.
• Study of protein structure and behaviour is called
proteomics.
• Proteomics encompasses:
– Identification of proteins in tissues,
– Characterization of their physicochemical properties
– Description of their behaviour:
• what functions they perform
• how they interact with one another
• how is their environment.
DNA
4 Nucleic Acids essential for chromosome in the
human cell.
– adenine (A)
– cytosine (C)
– guanine (G)
– thymine (T)
DNA
• DNA bases pair up with
– A with T and
– C with G
• Each base is also attached to
– a sugar molecule and
– a phosphate molecule.
• Together, a base, sugar, and phosphate are called a
nucleotide.
• Nucleotides are arranged in two long strands that
form a spiral called a double helix.
DNA
• DNA is the organic molecule that carries the information
used by a cell to build the proteins that carry out most of
the biological processes in a cell.
• Double helix
• Pair: G ≡ C,A = T
• Example sequence: ATGCTGATCGATGCAGAATCGATC
• Length of human DNA is about 3 × 109 base pair (bp)
• Between us, DNA 99.9% the same,
• Nearly every cell in a person’s body has the
same DNA
• Our DNA 99 % the same chimpanzees.
Gene
• The full DNA sequence of an organism is called its genome
• A segment that specifies the sequence of a protein.
• Length: 1000--‐3000 bases
• Approximately around 20,000--‐25,000 genes
• Every person has two copies of each gene, one inherited
from each parent.
Chromosome
Chromosome
• In the nucleus of each cell, the DNA molecule is packaged
into thread-like structures called chromosomes.
• Each chromosome is made up of DNA tightly coiled many
times around proteins called histones that support its
structure.
• Each chromosome has a constriction point called the
centromere, which divides the chromosome into two
sections.
• The location of the centromere on each chromosome gives
the chromosome its characteristic shape, and can be used to
help describe the location of specific genes.
• In a cell, there are 13 pairs of chromosomes.
RNA
• Ribonucleic acid.
• RNA is similar to DNA.
• RNA is single-stranded.
• Two types of nucleic acids found in all cells: DNA and RNA
• RNA transmits genetic information from DNA to proteins
produced by the cell.
• Nucleic acids of RNA are adenine (A), uracil (U), cytosine
(C), or guanine (G).
• Different types of RNA exist in the cell:
– messenger RNA (mRNA),
– ribosomal RNA (rRNA),
– transfer RNA (tRNA).
Translation: DNA to Protein

• Introns: Non-coding regions of DNA


Translation: DNA to Protein
• If you know the DNA sequence of Nucleic acids then its
easy to know what protein will be produced.
• RNA polymerase - an enzyme which reads DNA and
makes a complementary messenger RNA strand (mRNA)
during transcription.
• mRNA - the RNA product of transcription.
• ribosome - reads mRNA message and aids in synthesis of
protein.
• tRNA - used in synthesis of protein - carries amino acid.
• Promoter - Region of DNA where RNA polymerase
attaches and initiates transcription. Promoter includes the
transcription start point (AUG)
Nucleic Acids to Protein
Nucleic Acids to Protein
The Human Genome Project
International research effort to characterize

– Genomes of human
– Selected model organisms

through complete mapping and sequencing of their


DNA.
The Human Genome Project
Purposes:
• to develop technologies for genomic analysis,
• to examine the ethical, legal, and social
implications of human genetics research, and
• to train scientists who will be able to utilize the
tools and resources developed through the HGP
• to pursue biological studies that will improve
human health
The Human Genome Project
• Started in 1988

• National Human Genome Research Institute,


• National Institute of Health (USA)
• Now also Europe and Japan
• Many national projects
Data Banks
International DNA Sequence Database
Collaboration

– NCBI (GenBank) – USA (1982)


– EMBL – Europe (1982)
– DDBJ – Japan (1988)

• Began in 1982. Slow first 10 years.


• First complete genome in 1995
• First billion base pair 1997
• Current – >10 B bp.
• Double every 6 months.
NCBI
National Center for Biotechnological Information
– Fully established in USA in 1988 as a national resource
for molecular biology information,
– NCBI creates public databases,
– conducts research in computational biology,
– develops software tools for analyzing genome data
– disseminates biomedical information

all for the better understanding of molecular


processes affecting human health and disease.
GenBank
A data bank of NCBI
• PubMed – publication in life sci.
• Taxonomy – Tree of Life
• Structure – 3D struct of proteins
• Entrez – databank
A Genome
• Entrez
– Genome
– Bacteria
• Haemophilus influenzae
– Complete genome
A Gene
• Haemophilus influenzae

–First contig
–First gene
Protein Data Bank
• PDB homepage: www.rcsb.org/pdb/
• Search for protein
glyceraldehyde-3-phosphate dehydrogenase
– Result 1A7K
• View
LINKS
• Human Genome Project -
www.nhgri.nih.gov/HGP/
• The three main DNA banks:
– GenBank - www.ncbi.nlm.nih.gov
– EMBL - www.embl-heidelberg.de
– DDBJ – www.ddbj.nig.ac.jp
Important protein database-
www.expasy.ch/sprot/sprot-top.html
Pdb: protein databank (USA) -www.rcsb.org/pdb/
Definition
• Index: An index is a set of pointers to information
in a database.
In searching the entire World Wide Web, or a
specialized database in molecular biology, you
submit one or more search terms, and a program
checks for them in its tables of indices.
• Information retrieval software identifies entries
with contents relevant to your interest.
• Example: If you submit the term 'horse' and the
program returns a list of entries that contain the
term horse.
Contents of Databank
Primary data collections related to biological
macromolecules include:
• Nucleic acid sequences, including whole-genome
projects
• Amino acid sequences of proteins
• Protein and nucleic acid structures
• Small-molecule crystal structures
• Protein functions
• Expression patterns of genes
• Publications
Data Bank
• NCBI: National Center for Biotechnology Information
(USA)
• EMBL: EMBL Data Library (European Bioinformatics
Institute, UK)
• DDBJ: DNA Data Bank of Japan (National Institute of
Genetics, Japan).

• The groups exchange data daily. As a result the raw data are
identical, although the format in which they are stored, and
the nature of the annotation, vary slightly among them.

• Similar conditions apply to amino acid sequences, and to


nucleic acid and protein structures.
Searching a gene from EMBL
• Bovine pancreatic trypsin inhibitor (BPTI) gene
• http://www.ebi.ac.uk/ena/data/view/X03365
Definition
• Authors: Anderson S., Kingston I.B., Kingston
IB
• Length: 3,998 bp
• Accession number: X03365.1
• Organism: Bos taurus (cattle)
• First published: 18-NOV-1986
• Last Update: 17-NOV-2004
• Publications:
– Isolation of a genomic clone for bovine pancreatic trypsin inhibitor by using
a unique-sequence synthetic DNA probe.
– Sequences encoding two trypsin inhibitors occur in strikingly similar
genomic environments.
Definition
• Protein name: Pancreatic trypsin inhibitor
• Sequence Length: 100 AA

• Lets see the results from WEB


A portion of searching
Portion of Gene
misc_feature
• A miscellaneous feature is a component of the
annotation of an entry that reports properties of
specific regions.
• The feature table may indicate regions that
– perform or affect function
– interact with other molecules
– affect replication
– are involved in recombination
– are a repeated unit
– have secondary or tertiary structure
– are revised or corrected
Protein sequence databases
• SWISS-PROT
• PIR International:
– National Biomedical Research Foundation,
Georgetown University, Washington, DC, USA.
– Munich Information Center for Protein Sequences
(MIPS), Munich, Germany.
– Japan International Protein Information Database
Tsukuba, Japan.
SWISS-PROT
• Swiss Institute of Bioinformatics (SIB) - an independent
non-profit foundation recognised of public utility.
• 31 research and service groups
• coordinates bioinformatics research and education activities
• Includes
– major Swiss Universities
– Swiss Federal Institutes of Technology
– independent research institutes
• Ludwig Institute for Cancer Research
• Friedrich Miescher Institute for Biomedical Research.
The Swiss Institute of Bioinformatics collaborates with the
EMBL Data Library to provide an annotated database of
amino acid sequences called SWISS-PROT.
SWISS-PROT
• Annotated sequence database established in 1986
• Consists of sequence entries of different line
formats
• Similar format to European Bioinformatics Institute
Nucleotide Sequence Database (EMBL)
• http://us.expasy.org/sprot/sprot-top.html
Distinguishing Features of Swiss-Prot
• Annotation
• Minimal Redundancy
• Integration with other databases
• Documentation
Annotation:CORE DATA
• The sequence data
• The citation information (bibliographical references)
• The taxonomic data (description of the biological
source of the protein)
Annotation- Additional Data
Descriptions include:
• Functions of the protein
• Posttranslational modifications: carbohydrates,
phosphorylation, acetylation and GPI-anchor
• Domains and sites: for example, calcium-binding
regions, ATP-binding sites, zinc fingers,
homeoboxes, and SH2 and SH3 domains
• Secondary structure: alpha helix, beta sheet
• Quaternary structure: homodimer, heterotrimer, etc.
• Similarities to other proteins
• Diseases associated with any number of deficiencies
in the protein
• Sequence conflicts, variants, etc.
Minimal Redundancy
• Much of data comes from more than one literature
report
• Data condensed and merged to appear more
concise and coherent
• Conflicts in data are listed for each entry
Integration with other databases
• 50+ databases for cross-reference
• Nucleic acid sequences, protein tertiary structure,
protein 3-D models, etc.
• Allows Swiss-PROT to play a major role as the focal
point for biomolecular interconnectivity

• http://swissmodel.expasy.org/
Documentation
• All files documented and indexed
• Documentation kept up-to-date

• Application of SwissProt:
– Provides highly organized data and information on a
wide variety of proteins
– Can be used as a starting point for protein research
– Allows searches to be conducted starting with various
search strings
– Biochemical encyclopedia
SWISS-PROT
UniPort
• Mission: Provide the scientific community with a
comprehensive, high quality and freely accessible
resource of protein sequence and functional
information.
Comprised of four components:
• UniProt Knowledgebase (UniProtKB)
• UniProt Reference Clusters (UniRef)
• UniProt Archive (UniParc)
• UniProt Metagenomic and Environmental
Sequences (UniMES)
UniProt Knowledgebase
• UniProt Knowledgebase (UniProtKB) - Central
access point for extensive curated protein
information, including function, classification, and
cross-reference.
– UniProtKB/Swiss-Prot - manually annotated and is
reviewed
– UniProtKB/TrEMBL - automatically annotated and is
not reviewed.
UniProt Reference Clusters
• UniProt Reference Clusters (UniRef) - databases
provide clustered sets of sequences from the
UniProtKB and selected UniProt Archive records to
obtain complete coverage of sequence space at
several resolutions while hiding redundant
sequences.
• UniProt Archive
UniProt (UniParc)
Archive - comprehensive
repository, used to keep track of sequences and their
identifiers.
• UniProt Metagenomic and Environmental
Sequences (UniMES) - database is a repository
specifically developed for metagenomic and
environmental data.

• http://pir.georgetown.edu/
• http://www.uniprot.org/uniprot/P00974
• http://www.rcsb.org
Protein Data Bank (PDB)
The best-established database for biological macromolecular
structures.
• Contains:
– Structures of proteins
– Nucleic acids
– a few carbohydrates
• Founder: Walter Hamilton
Brookhaven National Laboratories
Long Island, New York, USA
• Time: 1971
Protein Data Bank (PDB)
• Current Manager:

– Research Collaboratory for Structural Bioinformatics (RCSB),


Rutgers University, New Jersey
– San Diego Supercomputer Center, California
– National Institute of Standards and Technology, Maryland

• Parent web site: http://www.rcsb.org


• Mirror sites: Europe, Singapore, Japan and Brazil
Protein Data Bank (PDB)
• the San Diego Supercomputer Center, in California; and the
National Institute of Standards and Technology, in
• Maryland, all in the USA. The parent web site of the Protein
Data Bank is at http://www.rcsb.org. Official mirror
• sites exist in Europe, Singapore, Japan and Brazil; others are
distributed around the world.
Protein Data Bank (PDB)
• What protein is the subject of the entry, and what species it
came from
• Who solved the structure, and references to publications
describing the structure determination
• Experimental details about the structure determination,
including information related to the general quality of the
result: resolution of an X-ray structure determination,
stereochemical statistics
• The amino acid sequence
• What additional molecules appear in the structure, including
cofactors, inhibitors, and water molecules
• Assignments of secondary structure: helix, sheet
• Disulphide bridges
E. coli thioredoxin
• Click on http://www.rcsb.org
• Search: E. coli thioredoxin
• Click on different links
• Specially click on Blast
E. coli thioredoxin
Searching from Different databases
• Click on http://srs.ebi.ac.uk/
• Open an SRS session -> select Protein -> Enter HUMAN
ELASTASE -> Click SEARCH
BLAST
• BLAST- Basic Local Alignment Search Tool
• How works: This program uses a strategy based on
matching sequence fragments by employing a powerful
statistical model.
• Developed by: Samuel Karlin and Stephen Altschul

• BLASTP: an NCBI BLAST program to compare a protein


query sequence to a protein database.
ftp://ftp.ncbi.nih.gov/blast/executables/

• WU-BLAST: Available from Washington University.


http://blast.wustl.edu/
BLAST
• BLASTN: Nucelotide vs. Nucleotide Sequence Similarity
Search
• BLASTX: Nucleotide vs. Protein Sequence Similarity Search
• PSIBLAST: Protein vs. Protein Iterative Sequence Similarity
Search
• FASTA: Protein vs. Protein Sequence Similarity Search
• SSEARCH: Protein vs. Protein Sequence Similarity Search
• MPsrch: MPsrch is a sequence comparison tool that
implements the true Smith and Waterman algorithm. It
does protein-protein database searches.
Protein structure
Estimation Procedures of protein structure are:

– X-ray crystal structure analysis: estimates of the


positions and effective sizes of the atoms in a molecule,
known as B-factors.

– Nuclear Magnetic Resonance (NMR): It produces


structures that are generally correct in topology but not
as precise as a good X-ray structure determination, and
therefore less useful for the study of fine structural
details.
Classification of Protein structure
Hierarchical classifications of protein according to the folding
patterns are as follows:
– SCOP: Structural Classification of Proteins
– CATH: Class/Architecture/Topology/Homology
– DALI: Based on extraction of similar structures from distance
matrices
– CE: A database of structural alignments

• Structural classification of proteins:


– http://scop.mrc-lmb.cam.ac.uk/scop/
Gateways to archives
Databases of nucleic acid and protein sequences maintain
facilities for a very wide variety of information retrieval and
analysis operations:

1. Retrieval of sequences from the database


2. Sequence comparison.
3. Translation of DNA sequences to protein sequences.
4. Simple types of structure analysis and prediction
5. Pattern recognition
6. Molecular graphics
Pattern recognition
It is possible to search for all sequences containing
• same pattern
• same combination of patterns
• certain sets of residues at consecutive positions.

Short and localized patterns sometimes identify molecules that


share a common function even if there is no obvious overall
relationship between their sequences.
Molecular graphics
Typical applications of molecular graphics include:
– mapping residues believed to be involved in function, onto the
three dimensional framework of a protein.
– classifying and comparing the folding patterns of proteins,
– analysing changes between closely-related structures, or between
two conformational states of a single molecule
– Studying the interaction of a small molecule with a protein, in
order to attempt to assign function, or for drug development
– interactive fitting of a model to the noisy and fuzzy image of the
molecule that arises initially from the measurements in solving
protein structures by X-ray crystallography
– design and modelling of new structures.

You might also like