0% found this document useful (0 votes)
28 views

Introduction to Bioinformatics - Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Introduction to Bioinformatics - Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

Unit I - Introduction to Bioinformatics - Notes

Due No Due Date Points 0 Available after Apr 12 at 12am

BioInformatics

Unit I - Introduction to BioInformatics

Topics for Discussion


Introduction

Branches in BioInformatics

Aim and Scope of BioInformatics

Sequence File Formats

Sequence Conversion Tools

Molecular Filer Formats

Molecular File Format Conversion

Questions

Introduction
The word “bioinformatics” is a shortened form of “biological informatics”. The huge demand
for the analysis and interpretation of the biological data is being managed by the evolving
https://canvas.instructure.com/courses/4675110/assignments/29562991 1/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

science of bioinformatics. Bioinformatics is defined as the application of computational and


analytical tools to capture and interpret the biological data.

Bioinformatics is often focused on obtaining biologically oriented data such as nucleic acid
(DNA/RNA) and protein sequences, structures, functions, pathways, and interactions
organizing these data into databases, developing methods to get useful information from
these databases, and devising methods to integrate the related data from disparate
sources. These computer databases and algorithms are developed to speed up and enhance
biological research.

Bioinformatics is defined as the Application of Tools of Computation and Analysis to


Capture and Interpretation of Biological Data.

Bioinformatics can be understood as Study of Biological Information through the


Computational Science and Tools. It could also be considered as “Biology as a Data
Science” Course.

Bioinformatics is an interdisciplinary field mainly involving Molecular Biology and


Genetics, Computer Science, Mathematics, And Statistics.

Data Intensive, Large-Scale Biological Problems are addressed from a Computational Point
of View.

The Most Common Problems are Modelling Biological Processes at The Molecular Level And
Making Inferences From Collected Data.

A Bioinformatics Solution usually involves the following Steps:

Collect Statistics From Biological Data.


Build a Computational Model.
Solve a Computational Modelling Problem.
Test and Evaluate a Computational Algorithm.

Bioinformatics Work Involves Applying Computer Science Techniques To Biological


Problems, Especially Related To Sequence Analysis, Alignment and Assembly.

Bioinformaticians Are Needed To Perform Tasks Such As:

Modelling: Estimation of Protein Structures and Simulation of Molecular Interactions.


D t P i g P i dA l i
https://canvas.instructure.com/courses/4675110/assignments/29562991
S i D t F E l F N 2/18
t
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Data Processing: Processing and Analyzing Sequencing Data, For Example, From Next-
generation Sequencing or Single-cell Sequencing.
Virtual Screening: Discovery of Leads (potential New Drugs) using Computational
Methods.
Data Science: Analysis and Interpretation of Data.

Few Biological Terms to be Familiar With:

DNA (DeoxyriboNucliec Acid): It is the hereditary material in humans and almost all
other organisms. Nearly every cell in a person’s body has the same DNA.
RNA (RiboNucliec Acid): It is a molecule similar to DNA. RNA is single-stranded. An
RNA strand has a backbone made of alternating sugar (ribose) and phosphate groups.
Gene: A gene is the basic physical and functional unit of heredity. Genes are made up
of DNA.
Amino Acid: Amino acids are molecules that combine to form proteins. Amino acids
and proteins are the building blocks of life.

Deoxyribonucleic Acid (DNA)

DNA, the carrier of information of inheritance, which consists of only four alphabets A, T,
G, and C.

Precisely, the human genome contains several thousand genes, distributed between the 23
pairs of chromosomes in a cell.

The genes are the recipes for proteins, the building blocks and workers in the body.

Different genes are active in different types of cells, e.g., a liver cell does not express the
same genes as a brain cell.

Some proteins are vital for the survival of a cell and their corresponding genes are therefore
active in all cell types and are known as “Housekeeping Genes”.

Gene Consists of Three Major Structures:

Gene Regulatory Segment


Exon
Intron

The gene regulatory segment, which contains structures involved in the initiation and
regulation of transcription.

Exons, the protein coding part of the gene.

Introns, the non-coding part of the gene.


The flow of information from the genes determines the protein composition and thereby the
functions of the cell.

DNA is situated in the nucleus of the cell, organized into chromosomes.

Every cell must contain genetic information, so the DNA is duplicated before a cell divides;
thi i k R li ti
https://canvas.instructure.com/courses/4675110/assignments/29562991 3/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
this process is known as Replication.

In all eukaryotic cells, DNA never leaves the nucleus; instead, the genetic recipe (the genes)
is copied into RNA, which in turn is decoded (translated) into proteins in the cytoplasm.
The DNA itself is not translated into proteins directly for several reasons:
Security: The daily transcription of genes to proteins would be harmful to the DNA, which
has to stay intact to maintain life.
Regulate The Rate of Protein Synthesis: Speed at which the rate of Conversion Takes
Place.

Information Flow from DNA to Protein Through Transcription and Translation

The journey from gene to protein is complex and tightly controlled within each cell.
It Consists of Two Major Steps:

Transcription
Translation

Together, transcription and translation are known as gene expression.


During the process of transcription, the information stored in a gene's DNA is
passed to a similar molecule called RNA (ribonucleic acid) in the cell nucleus.
Both RNA and DNA are made up of a chain of building blocks called nucleotides, but they
have slightly different chemical properties.
The type of RNA that contains the information for making a protein is called messenger

RNA (mRNA) because it carries the information, or message, from the DNA out of the
nucleus into the cytoplasm.
Translation, the second step in getting from a gene to a protein, takes place in the
cytoplasm.
The mRNA interacts with a specialized complex called a ribosome, which "reads"
th f RNA l tid
https://canvas.instructure.com/courses/4675110/assignments/29562991 4/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
the sequence of mRNA nucleotides.
Each sequence of three nucleotides, called a codon, usually codes for one particular amino
acid.
Amino acids are the building blocks of proteins.
A type of RNA called transfer RNA (tRNA) assembles the protein, one amino acid at a time.
Protein assembly continues until the ribosome encounters a “stop” codon (a sequence of
three nucleotides that does not code for an amino acid).
The flow of information from DNA to RNA to proteins is one of the fundamental principles
of molecular biology. It is so important that it is sometimes called the “central dogma.”
DNA makes RNA makes protein.

Amino Acids:
Amino acids are molecules that combine to form proteins. Amino acids and proteins are
the building blocks of life.
When proteins are digested or broken down, amino acids are left. The human body uses
amino acids to make proteins to help the body:

Break down food


Grow
Repair body tissue
Perform many other body functions

Amino acids can also be used as a source of energy by the body.

Amino acids are classified into three groups:

Essential Amino Acids


Nonessential Amino Acids
Conditional Amino Acids

Essential Amino Acids:


Essential amino acids cannot be made by the body. As a result, they must come from
food.
The 9 essential amino acids are: histidine, isoleucine, leucine, lysine, methionine,
phenylalanine, threonine, tryptophan, and valine.

Non-Essential Amino Acids:


https://canvas.instructure.com/courses/4675110/assignments/29562991 5/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

Nonessential means that our bodies can produce the amino acid, even if we do not get it
from the food we eat.
Nonessential amino acids include: alanine, arginine, asparagine, aspartic acid, cysteine,
glutamic acid, glutamine, glycine, proline, serine, and tyrosine.

Conditional Amino Acids:

Conditional amino acids are usually not essential, except in times of illness and stress.
Conditional amino acids include: arginine, cysteine, glutamine, tyrosine, glycine,
ornithine, proline, and serine.

Branches in BioInformatics
A living cell is a system where cellular components such as genome, the gene transcript,
and the proteins interact with each other, and these interactions determine the fate of the
cell, e.g., whether a stem cell is going to become a liver cell or a cancer cell. The
characterization of these three types of components and the associated development of
analytical methods lead to the establishment of the three closely related branches of
bioinformatics: genomics, transcriptomics, and proteomics.

Genomics:
Genomics is the study of all of a person's genes (the genome), including interactions of
those genes with each other and with the person's environment.
It Studies the Mapping of Nucleotide Sequences of all the Chromosomes of an Organism
and the Location of Different Genes and their Sequences are thereby Determined.
Thi i l t i l i f th l i
https://canvas.instructure.com/courses/4675110/assignments/29562991
id th h l l bi l t h i 6/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
This involves extensive analysis of the nucleic acids through molecular biology techniques
before the data are ready for processing by computers.
It is a science that attempts to describe a living organism in terms of the sequence of its
genome (its constituent genetic material).
Genomics uses the techniques of molecular biology and bioinformatics to identify cellular
components such as proteins, rRNA, tRNA, etc., and analyse the sequences attributed to
the structural genes, regulatory sequences, and even non-coding sequences.
Genomics is closely related to, and sometimes considered a branch of genetics, the study of
genes and heredity.
The first automatic DNA sequencer was developed in 1986 by Leroy Hood. This paved the
way for the official beginning of the HGP in 1990, which gave a boost to genomics.
A large number of bacterial genomes have already been fully sequenced and put in the
public domain.
Haemophilus influenzae was the first bacterium to be sequenced in 1995. The sequencing
of bacterial genomes was followed by the first sequenced eukaryotic organism, the
unicellular genetic model system Saccharomyces cerevisiae (commonly known as baker’s
yeast).
In December 1998, the first multicellular organism was added to the list, the nematode
Caenorhabditis elegans, which is now considered as a model organism to provide us with
information about unique functions in organisms of greater complexity.
The sum of all these information is enormous and its potential in our understanding of life
processes can be explored with the help of genomics, almost synonymous with
bioinformatics.

Transcriptomics:
It is the study of the transcriptome - the complete set of RNA transcripts that are produced
by the genome, under specific circumstances or in a specific cell - using high-throughput
methods, such as microarray analysis.
Transcriptomics is the study of the transcriptome, which includes the whole set of mRNA
molecules (or transcripts) in one or a population of biological cells for a given set of
environmental circumstances.
This study helps us to depict the expression level of genes, often using techniques such as
DNA microarrays, that is capable of sampling tens of thousands of different mRNAs at a
time.
Limitation of Transcriptomics:

The relative abundance of transcripts as characterized by the sequential analysis of


gene expression (SAGE) or microarray experiments is not always a good predictor of the
relative abundance of proteins.

Proteomics:
It is the systematic, large-scale analysis of proteins. It is based on the concept of the
proteome as a complete set of proteins produced by a given cell or organism under a
defined set of conditions.
Proteomics represents the earliest attempt to identify a major sub-class of cellular
t th t i d th i i t
https://canvas.instructure.com/courses/4675110/assignments/29562991
ti 7/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
components - the proteins - and their interactions.
Proteomics involves the sequencing of amino acids in a protein, determining its 3D
structure and relating it to the function of the protein.
Metabolic proteins such as haemoglobin and insulin have been subjected to intensive
proteomic investigation.
The term ‘proteomics’ was coined to make an analogy with genomics, and while it is often
viewed as the ‘next step’, proteomics is much more complicated than genomics.
A single organism has radically different protein expressions in different parts of its body,
in different stages of its life cycle and in different environmental conditions.
The complete set of proteins existing in an organism throughout its life cycle or, on a
smaller scale, the set of proteins found in a particular cell type under a particular type of
stimulation, is referred to as the proteome of the organism or cell type, respectively.

Aim and Scope of BioInformatics


The aim of bioinformatics is fourfold and includes:

Data Acquisition
Tool and Database Development
Data Analysis
Data Integration

Data Acquisition:

Data acquisition is primarily concerned with accessing and storing data generated
directly from the biological experiments.
The data generated by various sequencing projects have to be retrieved in the
appropriate format, and be capable of being linked to all the information related to the
DNA samples, such as the species, tissue type, and quality parameters used in the
experiments.
The data are organized in different databases so that the researchers can access
existing information and submit new entries as and when they are produced.
Examples of such database are the Entrez Genome of NCBI (for genome data) and the
Protein Data Bank (for 3D macromolecular structures data).

Tool and Database Development:

Many laboratories generate large volumes of data such as DNA sequences, gene
expression information, 3D molecular structure, and high-throughput screening.
Consequently, they must develop effective databases for storing and quickly accessing
data.
The other aim is to develop tools and resources that aid in the analysis of data.
F l h i d ti l
https://canvas.instructure.com/courses/4675110/assignments/29562991
t i it i f i t tt it ith8/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
For example, having sequenced a particular protein, it is of interest to compare it with
previously characterized sequences.
Programs such as FASTA and PSIBLAST must consider what comprises a biologically
significant match.

Data Analysis:

The third aim is to use these tools to analyze the data and interpret the results in a
biologically meaningful manner.
Traditionally, biological studies examined individual systems in detail, and compared
those with a few related systems.
In bioinformatics, we can now conduct a global analysis of all the available data with
the aim of unveiling common principles that apply across many systems and highlight
novel features.
Efficient analysis requires an efficiently designed database.
It must allow researchers to place their query effectively and provide them with all the
information they need to begin their data analysis.

Data Integration:

Once information has been analyzed, a researcher must often associate or integrate it
with the related data from other databases.
For example, a scientist may run a series of gene expression analysis experiments and
observe that a particular set of 100 genes is more highly expressed in a cancerous lung
tissue than in a normal lung tissue.

Sequence File Formats


The biological data stored in databases are broadly represented either as sequence or
molecular coordinates.
Each Database has its Own File Format for Storing Data.

File Formats Categorized as:

Sequence File Format


Molecular File Format

Sequence File Format:

Sequence File is a flat file consisting of binary key/value pairs.


It is extensively used in MapReduce as input/output formats.
It is also worth noting that, internally, the temporary outputs of maps are stored using
Sequence File.

GenBank Flat File Format:

G B k fl tfil (GBF) f ti f th
https://canvas.instructure.com/courses/4675110/assignments/29562991
t l fil f t b 9/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
GenBank flatfile (GBF) format is one of the most popular sequence file formats because
of its detailed sequence features and ease of readability.
To use the data in the file by a computer, a parsing process is required and is
performed according to a given grammar for the sequence and the description in a GBF.
Each GenBank entry includes a concise description of sequence, its scientific name and
taxonomy of the source organism, a table of features that identifies the coding regions
and other sites of biological significance (such as transcription units, sites of mutations
or modifications or repetitions).
GenBank Flat File Format has Three Sections:
Header
Features
Sequence

https://canvas.instructure.com/courses/4675110/assignments/29562991 10/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

FASTA Format:

FASTA format is a text-based format for representing either nucleotide sequences or


peptide sequences, in which base pairs or amino acids are represented using single-
letter codes.
A sequence in FASTA format begins with a single-line description, followed by lines of
sequence data.
The description line is distinguished from the sequence data by a greater-than (">")
symbol in the first column.
It is recommended that all lines of text be shorter than 80 characters in length.
A sequence in FASTA format consists of: One line starting with a ">" sign, followed by a
sequence identification code.
It is optionally be followed by a textual description of the sequence.
For Example:
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSF

Multi-FASTA Format:

A text file file containing several DNA sequences in fasta format. Every fasta entry has 2
fundamental blocks.

The first one is a single text line starting by '>' character following by a sequence
description. The second block is the sequence and may contain several lines.
For Example:

https://canvas.instructure.com/courses/4675110/assignments/29562991 11/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

GCG-MSF Format:

We can combine multiple sequences in a single file, called a Multiple Sequence Format
(MSF) file. MSF files include not only the sequence name but also the sequence itself,
which is usually aligned with the other sequences in the file.
We can specify a single sequence within an MSF file, a subset of sequences, or all
sequences. Like other sequences, those in an MSF file can be used with other GCG
programs.
For Example:

EMBL Format:

European Molecular Biology Laboratory (EMBL) File Format stores sequence and its
annotation together.

The start of the annotation section is marked by a line beginning with the word “ID”.

The start of sequence section is marked by a line beginning with the word “SQ”.

The “//” (terminator) line also contains no data or comments and designates the end of an
entry

For Example:

https://canvas.instructure.com/courses/4675110/assignments/29562991 12/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

Clustal Format:

A clustal-formatted file is a plain text format. It can optionally have a header, which
states the clustal version number.
This is followed by the multiple sequence alignment, and optional information about the
degree of conservation at each position in the alignment.
Each sequence in the alignment is divided into subsequences each at most 60
characters long.
The sequence identifier for each sequence precedes each subsequence.
Each subsequence can optionally be followed by the cumulative number of non-gap
characters up to that point in the full sequence
ClustalW is a widely used system for aligning any number of homologous nucleotide or
protein sequences.
For multi-sequence alignments, ClustalW uses progressive alignment methods. In
these, the most similar sequences, that is, those with the best alignment score are
aligned first.
Then progressively more distant groups of sequences are aligned until a global
alignment is obtained.
This heuristic approach is necessary because finding the global optimal solution is
prohibitive in both memory and time requirements.
ClustalW performs very well in practice. The algorithm starts by computing a rough
distance matrix between each pair of sequences based on pairwise sequence alignment
scores.
These scores are computed using the pairwise alignment parameters for DNA and
protein sequences.

Phylip Format:

PHYLIP format is a plain text format containing exactly two sections: a header
describing the dimensions of the alignment, followed by the multiple sequence
alignment itself.
PHYLIP requires that each sequence identifier is exactly 10 characters long.
https://canvas.instructure.com/courses/4675110/assignments/29562991 13/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

The header consists of a single line describing the dimensions of the alignment. It must
be the first line in the file.
The header consists of optional spaces, followed by two positive integers (n and m)
separated by one or more spaces.
The first integer (n) specifies the number of sequences (i.e., the number of rows) in the
alignment.
The second integer (m) specifies the length of the sequences (i.e., the number of
columns) in the alignment.
The smallest supported alignment dimensions are 1*1.

Nexus Format:

NEXUS is the file format used by many popular programs like GDA, Paup*, Mesquite,
ModelTest, MrBayes, and MacClade. Nexus file names often have a .nxs or .nex extension.

The NEXUS format conveys data organized according to the character state data model, in
which the features of operational taxonomic units (OTUs) (e.g., species, individuals, genes,
genomes, etc.) are observable states of underlying homologous characters.

For instance, in a protein sequence alignment, proteins are the OTUs, alignment columns
are characters, and amino acids (or gaps) are states.

In evolutionary analysis, it is typical to consider differences as the result of state


transitions that take place on branches of a tree, therefore the NEXUS file provides a
means to represent a tree (in the standard Newick (a.k.a. New Hampshire) format).

The syntactic structure of a NEXUS file is as follows:

https://canvas.instructure.com/courses/4675110/assignments/29562991 14/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

Each of the pre-defined types of public blocks may appear only once. The TAXA block is the
only necessary block.

Sequence Conversion Tools


GCG:

RedSeq:

SeqVerter:

SeqVerter can help you to view automatic DNA sequencer chromatogram files. It is a
free sequence file format conversion utility by GeneStudio, Inc.
SeqVerter encapsulates a small subset of the features offered by the GeneStudio Pro
suite of programs.
Advanced Sequence File Format Conversion:
O f lti l
https://canvas.instructure.com/courses/4675110/assignments/29562991
fil i lt l 15/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Open sequences from multiple source files simultaneously.
View sequences,
Select a subset of sequences for conversion.
Merge sequences from different source files into one multiple sequence file.
Split sequences from multiple sequence files into individual (single) sequence files.
Trim ends of automatic sequencer-generated files.
Set your favorite default output format.
Enter file headers required by the GenBank sequence submission and update tool,
SequIn.

Molecular File Formats


The 3D Structures of Proteins Obtained from X-Ray Crystalography and NMR Methods are
represented by their atomic or molecular coordinates.

Some File Formats are:

Protein Data Bank.


Tripo’s Alchemy and Sybyl Mol2 Format.
MacroMolecular Crystallographic Information File (mmCIF).

Protein Data Bank:

The PDB is a structure database that contains the three-dimensional crystal structure of
macromolecules that are experimentally determined. These experimental methods are X-
ray crystallography and NMR spectroscopy and nowadays cryo-electron microscopy is also
used. The PDB is a key in areas of structural biology, such as structural genomics. Most
major scientific journals and some funding agencies now require scientists to submit their
structure data to the PDB. Many other databases use protein structures deposited in the
PDB. For example, SCOP and CATH classify protein structures, while PDBsum provides a
graphic overview of PDB entries using information from other sources, such as Gene
Ontology. PDB provides access to 3D structure data for large biological molecules (proteins,
DNA, and RNA). These are the molecules of life, found in all organisms on the planet.

https://canvas.instructure.com/courses/4675110/assignments/29562991 16/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes

Molecular File Format Conversion


Pdb2cif: Converts a PDB File to mmCIF File.
Cif2pdb: A Program to Convert mmCIF to Psuedo-PDB Format.
Babel: A popular program that is designed to inter-convert a number of file formats
used in molecular modeling.
Mol2Mol: It is popular molecular file conversion tool, which supports the read and write
operations of the following formats:

Questions
Uday

Kiran
https://canvas.instructure.com/courses/4675110/assignments/29562991 17/18
5/5/22, 8:42 PM Unit I - Introduction to Bioinformatics - Notes
Kiran

https://canvas.instructure.com/courses/4675110/assignments/29562991 18/18

You might also like