Introduction to Bioinformatics - Notes
Introduction to Bioinformatics - Notes
BioInformatics
Branches in BioInformatics
Questions
Introduction
The word “bioinformatics” is a shortened form of “biological informatics”. The huge demand
for the analysis and interpretation of the biological data is being managed by the evolving
https://canvas.instructure.com/courses/4675110/assignments/29562991 1/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Bioinformatics is often focused on obtaining biologically oriented data such as nucleic acid
(DNA/RNA) and protein sequences, structures, functions, pathways, and interactions
organizing these data into databases, developing methods to get useful information from
these databases, and devising methods to integrate the related data from disparate
sources. These computer databases and algorithms are developed to speed up and enhance
biological research.
Data Intensive, Large-Scale Biological Problems are addressed from a Computational Point
of View.
The Most Common Problems are Modelling Biological Processes at The Molecular Level And
Making Inferences From Collected Data.
DNA (DeoxyriboNucliec Acid): It is the hereditary material in humans and almost all
other organisms. Nearly every cell in a person’s body has the same DNA.
RNA (RiboNucliec Acid): It is a molecule similar to DNA. RNA is single-stranded. An
RNA strand has a backbone made of alternating sugar (ribose) and phosphate groups.
Gene: A gene is the basic physical and functional unit of heredity. Genes are made up
of DNA.
Amino Acid: Amino acids are molecules that combine to form proteins. Amino acids
and proteins are the building blocks of life.
DNA, the carrier of information of inheritance, which consists of only four alphabets A, T,
G, and C.
Precisely, the human genome contains several thousand genes, distributed between the 23
pairs of chromosomes in a cell.
The genes are the recipes for proteins, the building blocks and workers in the body.
Different genes are active in different types of cells, e.g., a liver cell does not express the
same genes as a brain cell.
Some proteins are vital for the survival of a cell and their corresponding genes are therefore
active in all cell types and are known as “Housekeeping Genes”.
The gene regulatory segment, which contains structures involved in the initiation and
regulation of transcription.
Every cell must contain genetic information, so the DNA is duplicated before a cell divides;
thi i k R li ti
https://canvas.instructure.com/courses/4675110/assignments/29562991 3/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
this process is known as Replication.
In all eukaryotic cells, DNA never leaves the nucleus; instead, the genetic recipe (the genes)
is copied into RNA, which in turn is decoded (translated) into proteins in the cytoplasm.
The DNA itself is not translated into proteins directly for several reasons:
Security: The daily transcription of genes to proteins would be harmful to the DNA, which
has to stay intact to maintain life.
Regulate The Rate of Protein Synthesis: Speed at which the rate of Conversion Takes
Place.
The journey from gene to protein is complex and tightly controlled within each cell.
It Consists of Two Major Steps:
Transcription
Translation
RNA (mRNA) because it carries the information, or message, from the DNA out of the
nucleus into the cytoplasm.
Translation, the second step in getting from a gene to a protein, takes place in the
cytoplasm.
The mRNA interacts with a specialized complex called a ribosome, which "reads"
th f RNA l tid
https://canvas.instructure.com/courses/4675110/assignments/29562991 4/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
the sequence of mRNA nucleotides.
Each sequence of three nucleotides, called a codon, usually codes for one particular amino
acid.
Amino acids are the building blocks of proteins.
A type of RNA called transfer RNA (tRNA) assembles the protein, one amino acid at a time.
Protein assembly continues until the ribosome encounters a “stop” codon (a sequence of
three nucleotides that does not code for an amino acid).
The flow of information from DNA to RNA to proteins is one of the fundamental principles
of molecular biology. It is so important that it is sometimes called the “central dogma.”
DNA makes RNA makes protein.
Amino Acids:
Amino acids are molecules that combine to form proteins. Amino acids and proteins are
the building blocks of life.
When proteins are digested or broken down, amino acids are left. The human body uses
amino acids to make proteins to help the body:
Nonessential means that our bodies can produce the amino acid, even if we do not get it
from the food we eat.
Nonessential amino acids include: alanine, arginine, asparagine, aspartic acid, cysteine,
glutamic acid, glutamine, glycine, proline, serine, and tyrosine.
Conditional amino acids are usually not essential, except in times of illness and stress.
Conditional amino acids include: arginine, cysteine, glutamine, tyrosine, glycine,
ornithine, proline, and serine.
Branches in BioInformatics
A living cell is a system where cellular components such as genome, the gene transcript,
and the proteins interact with each other, and these interactions determine the fate of the
cell, e.g., whether a stem cell is going to become a liver cell or a cancer cell. The
characterization of these three types of components and the associated development of
analytical methods lead to the establishment of the three closely related branches of
bioinformatics: genomics, transcriptomics, and proteomics.
Genomics:
Genomics is the study of all of a person's genes (the genome), including interactions of
those genes with each other and with the person's environment.
It Studies the Mapping of Nucleotide Sequences of all the Chromosomes of an Organism
and the Location of Different Genes and their Sequences are thereby Determined.
Thi i l t i l i f th l i
https://canvas.instructure.com/courses/4675110/assignments/29562991
id th h l l bi l t h i 6/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
This involves extensive analysis of the nucleic acids through molecular biology techniques
before the data are ready for processing by computers.
It is a science that attempts to describe a living organism in terms of the sequence of its
genome (its constituent genetic material).
Genomics uses the techniques of molecular biology and bioinformatics to identify cellular
components such as proteins, rRNA, tRNA, etc., and analyse the sequences attributed to
the structural genes, regulatory sequences, and even non-coding sequences.
Genomics is closely related to, and sometimes considered a branch of genetics, the study of
genes and heredity.
The first automatic DNA sequencer was developed in 1986 by Leroy Hood. This paved the
way for the official beginning of the HGP in 1990, which gave a boost to genomics.
A large number of bacterial genomes have already been fully sequenced and put in the
public domain.
Haemophilus influenzae was the first bacterium to be sequenced in 1995. The sequencing
of bacterial genomes was followed by the first sequenced eukaryotic organism, the
unicellular genetic model system Saccharomyces cerevisiae (commonly known as baker’s
yeast).
In December 1998, the first multicellular organism was added to the list, the nematode
Caenorhabditis elegans, which is now considered as a model organism to provide us with
information about unique functions in organisms of greater complexity.
The sum of all these information is enormous and its potential in our understanding of life
processes can be explored with the help of genomics, almost synonymous with
bioinformatics.
Transcriptomics:
It is the study of the transcriptome - the complete set of RNA transcripts that are produced
by the genome, under specific circumstances or in a specific cell - using high-throughput
methods, such as microarray analysis.
Transcriptomics is the study of the transcriptome, which includes the whole set of mRNA
molecules (or transcripts) in one or a population of biological cells for a given set of
environmental circumstances.
This study helps us to depict the expression level of genes, often using techniques such as
DNA microarrays, that is capable of sampling tens of thousands of different mRNAs at a
time.
Limitation of Transcriptomics:
Proteomics:
It is the systematic, large-scale analysis of proteins. It is based on the concept of the
proteome as a complete set of proteins produced by a given cell or organism under a
defined set of conditions.
Proteomics represents the earliest attempt to identify a major sub-class of cellular
t th t i d th i i t
https://canvas.instructure.com/courses/4675110/assignments/29562991
ti 7/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
components - the proteins - and their interactions.
Proteomics involves the sequencing of amino acids in a protein, determining its 3D
structure and relating it to the function of the protein.
Metabolic proteins such as haemoglobin and insulin have been subjected to intensive
proteomic investigation.
The term ‘proteomics’ was coined to make an analogy with genomics, and while it is often
viewed as the ‘next step’, proteomics is much more complicated than genomics.
A single organism has radically different protein expressions in different parts of its body,
in different stages of its life cycle and in different environmental conditions.
The complete set of proteins existing in an organism throughout its life cycle or, on a
smaller scale, the set of proteins found in a particular cell type under a particular type of
stimulation, is referred to as the proteome of the organism or cell type, respectively.
Data Acquisition
Tool and Database Development
Data Analysis
Data Integration
Data Acquisition:
Data acquisition is primarily concerned with accessing and storing data generated
directly from the biological experiments.
The data generated by various sequencing projects have to be retrieved in the
appropriate format, and be capable of being linked to all the information related to the
DNA samples, such as the species, tissue type, and quality parameters used in the
experiments.
The data are organized in different databases so that the researchers can access
existing information and submit new entries as and when they are produced.
Examples of such database are the Entrez Genome of NCBI (for genome data) and the
Protein Data Bank (for 3D macromolecular structures data).
Many laboratories generate large volumes of data such as DNA sequences, gene
expression information, 3D molecular structure, and high-throughput screening.
Consequently, they must develop effective databases for storing and quickly accessing
data.
The other aim is to develop tools and resources that aid in the analysis of data.
F l h i d ti l
https://canvas.instructure.com/courses/4675110/assignments/29562991
t i it i f i t tt it ith8/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
For example, having sequenced a particular protein, it is of interest to compare it with
previously characterized sequences.
Programs such as FASTA and PSIBLAST must consider what comprises a biologically
significant match.
Data Analysis:
The third aim is to use these tools to analyze the data and interpret the results in a
biologically meaningful manner.
Traditionally, biological studies examined individual systems in detail, and compared
those with a few related systems.
In bioinformatics, we can now conduct a global analysis of all the available data with
the aim of unveiling common principles that apply across many systems and highlight
novel features.
Efficient analysis requires an efficiently designed database.
It must allow researchers to place their query effectively and provide them with all the
information they need to begin their data analysis.
Data Integration:
Once information has been analyzed, a researcher must often associate or integrate it
with the related data from other databases.
For example, a scientist may run a series of gene expression analysis experiments and
observe that a particular set of 100 genes is more highly expressed in a cancerous lung
tissue than in a normal lung tissue.
G B k fl tfil (GBF) f ti f th
https://canvas.instructure.com/courses/4675110/assignments/29562991
t l fil f t b 9/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
GenBank flatfile (GBF) format is one of the most popular sequence file formats because
of its detailed sequence features and ease of readability.
To use the data in the file by a computer, a parsing process is required and is
performed according to a given grammar for the sequence and the description in a GBF.
Each GenBank entry includes a concise description of sequence, its scientific name and
taxonomy of the source organism, a table of features that identifies the coding regions
and other sites of biological significance (such as transcription units, sites of mutations
or modifications or repetitions).
GenBank Flat File Format has Three Sections:
Header
Features
Sequence
https://canvas.instructure.com/courses/4675110/assignments/29562991 10/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
FASTA Format:
Multi-FASTA Format:
A text file file containing several DNA sequences in fasta format. Every fasta entry has 2
fundamental blocks.
The first one is a single text line starting by '>' character following by a sequence
description. The second block is the sequence and may contain several lines.
For Example:
https://canvas.instructure.com/courses/4675110/assignments/29562991 11/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
GCG-MSF Format:
We can combine multiple sequences in a single file, called a Multiple Sequence Format
(MSF) file. MSF files include not only the sequence name but also the sequence itself,
which is usually aligned with the other sequences in the file.
We can specify a single sequence within an MSF file, a subset of sequences, or all
sequences. Like other sequences, those in an MSF file can be used with other GCG
programs.
For Example:
EMBL Format:
European Molecular Biology Laboratory (EMBL) File Format stores sequence and its
annotation together.
The start of the annotation section is marked by a line beginning with the word “ID”.
The start of sequence section is marked by a line beginning with the word “SQ”.
The “//” (terminator) line also contains no data or comments and designates the end of an
entry
For Example:
https://canvas.instructure.com/courses/4675110/assignments/29562991 12/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Clustal Format:
A clustal-formatted file is a plain text format. It can optionally have a header, which
states the clustal version number.
This is followed by the multiple sequence alignment, and optional information about the
degree of conservation at each position in the alignment.
Each sequence in the alignment is divided into subsequences each at most 60
characters long.
The sequence identifier for each sequence precedes each subsequence.
Each subsequence can optionally be followed by the cumulative number of non-gap
characters up to that point in the full sequence
ClustalW is a widely used system for aligning any number of homologous nucleotide or
protein sequences.
For multi-sequence alignments, ClustalW uses progressive alignment methods. In
these, the most similar sequences, that is, those with the best alignment score are
aligned first.
Then progressively more distant groups of sequences are aligned until a global
alignment is obtained.
This heuristic approach is necessary because finding the global optimal solution is
prohibitive in both memory and time requirements.
ClustalW performs very well in practice. The algorithm starts by computing a rough
distance matrix between each pair of sequences based on pairwise sequence alignment
scores.
These scores are computed using the pairwise alignment parameters for DNA and
protein sequences.
Phylip Format:
PHYLIP format is a plain text format containing exactly two sections: a header
describing the dimensions of the alignment, followed by the multiple sequence
alignment itself.
PHYLIP requires that each sequence identifier is exactly 10 characters long.
https://canvas.instructure.com/courses/4675110/assignments/29562991 13/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
The header consists of a single line describing the dimensions of the alignment. It must
be the first line in the file.
The header consists of optional spaces, followed by two positive integers (n and m)
separated by one or more spaces.
The first integer (n) specifies the number of sequences (i.e., the number of rows) in the
alignment.
The second integer (m) specifies the length of the sequences (i.e., the number of
columns) in the alignment.
The smallest supported alignment dimensions are 1*1.
Nexus Format:
NEXUS is the file format used by many popular programs like GDA, Paup*, Mesquite,
ModelTest, MrBayes, and MacClade. Nexus file names often have a .nxs or .nex extension.
The NEXUS format conveys data organized according to the character state data model, in
which the features of operational taxonomic units (OTUs) (e.g., species, individuals, genes,
genomes, etc.) are observable states of underlying homologous characters.
For instance, in a protein sequence alignment, proteins are the OTUs, alignment columns
are characters, and amino acids (or gaps) are states.
https://canvas.instructure.com/courses/4675110/assignments/29562991 14/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Each of the pre-defined types of public blocks may appear only once. The TAXA block is the
only necessary block.
RedSeq:
SeqVerter:
SeqVerter can help you to view automatic DNA sequencer chromatogram files. It is a
free sequence file format conversion utility by GeneStudio, Inc.
SeqVerter encapsulates a small subset of the features offered by the GeneStudio Pro
suite of programs.
Advanced Sequence File Format Conversion:
O f lti l
https://canvas.instructure.com/courses/4675110/assignments/29562991
fil i lt l 15/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Open sequences from multiple source files simultaneously.
View sequences,
Select a subset of sequences for conversion.
Merge sequences from different source files into one multiple sequence file.
Split sequences from multiple sequence files into individual (single) sequence files.
Trim ends of automatic sequencer-generated files.
Set your favorite default output format.
Enter file headers required by the GenBank sequence submission and update tool,
SequIn.
The PDB is a structure database that contains the three-dimensional crystal structure of
macromolecules that are experimentally determined. These experimental methods are X-
ray crystallography and NMR spectroscopy and nowadays cryo-electron microscopy is also
used. The PDB is a key in areas of structural biology, such as structural genomics. Most
major scientific journals and some funding agencies now require scientists to submit their
structure data to the PDB. Many other databases use protein structures deposited in the
PDB. For example, SCOP and CATH classify protein structures, while PDBsum provides a
graphic overview of PDB entries using information from other sources, such as Gene
Ontology. PDB provides access to 3D structure data for large biological molecules (proteins,
DNA, and RNA). These are the molecules of life, found in all organisms on the planet.
https://canvas.instructure.com/courses/4675110/assignments/29562991 16/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Questions
Uday
Kiran
https://canvas.instructure.com/courses/4675110/assignments/29562991 17/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Kiran
https://canvas.instructure.com/courses/4675110/assignments/29562991 18/18