Bioinformatics Notes
Bioinformatics Notes
What is Bioinformatics?
A marriage between Biology and Computers
The term bioinformatics was coined by Paulien Hogeweg, a Dutch Theoretical Biologist, in
conversations with her colleague Ben Hesper in the beginning of the 1970s.
Margaret Oakley Dayhoff has been called The “mother and father of bioinformatics” as she was
a pioneer of applying mathematics and computational methods to biochemistry.
• It is an interdisciplinary field that develops methods and software tools for
understanding biological data.
• Bioinformatics is the application of computer technology to the management of
biological information.
• Computers are used to gather, store, analyze and integrate biological and genetic
information which can then be applied to gene-based drug discovery and development.
• It includes biology, computer science, information engineering, mathematics and
statistics to analyze and interpret biological data.
• Bioinformatics has been used for in silico analyses ("in silicon", alluding to the mass use
of silicon for computer chips) "performed on computer “.
• in vivo, in vitro, and in situ, which are commonly used in biology and refer to
experiments done in living organisms, outside living organisms, and where they are
found in nature, respectively.
History of bioinformatics
• 1951 Pauling and Corey propose the structure for the alpha-helix and beta-sheet
• 1953 – Watson & Crick propose the double helix model for DNA based x-ray data
obtained by Franklin & Wilkins
• 1955 – The sequence of the first protein to be analysed, bovine insulin, is announed by
F.Sanger.
• 1970 – Needleman-Wunsch algorithm for sequence comparison are published.
• 1973 – The Brookhaven Protein DataBank is announeced
• 1981 – The Smith-Waterman algorithm for sequence alignment is published.
• 1985 – The FASTP/FASTN algorithm is published.
• 1988 – National Center for Biotechnology Information (NCBI) created at NIH/NLM
• 1990 – The BLAST program (Altschul,et.al.) is implemented.
• 1990 -Human Genome Project initiated.
• 2003 -Human Genome Project Completion, April 2003.
Chapter 2 Development and scope of bioinformatics
Scope of bioinformatics
1. Bioinformatics has more opportunity than Biotech because it is computer connected topic
and provides many carrier provider options like software developer, biological data
manager or analyzer, medication developing, wet lab analysis etc.
2. In India, few peoples are doing bioinformatics so it's a great opportunity in India to get a
job quickly than Biotech students because of their vacancies of knowledge.
3. The scope of a bioinformatics project (or lab) varies widely depending on the balance of
math/stats, computational/software/engineering, and biological/molecular training of
the researchers involved and the data or subject matter they deal with particularly DNA
sequences, gene expression, protein, epigenetics, or networks and systems biology.
Application of Bioinformatics
Medical
Agriculture
Evolutionary studies
Genome Sequencing
Drug discovery and drug development
Gene therapy- used to treat, cure or even prevent disease by changing the expression of a
person’s genes
Chapter 05- Operating systems
Operating Systems:
An operating System is an integrated set of programs that controls the resources (the
CPU, memory, I/O devices etc.) of a computer system and provides its users with an
interface or virtual machine that is more convenient to use than the bare machine.
An operating system is the interface between the user and the architecture
An Operating System, or OS, is low-level software that enables a user and higher-level
application software to interact with a computer’s hardware and the data and other
programs stored on the computer.
An OS performs basic tasks, such as recognizing input from the keyboard, sending output
to the display screen, keeping track of files and directories on the disk, and controlling
peripheral devices such as printers. An Operating System, or OS, is a software program
that enables the computer hardware to communicate and operate with the computer
software.
Without a computer Operating System, a computer would be useless.
• Following are some of the important activities that an Operating System performs −
• Control over system performance − Recording delays between request for a service and
response from the system.
• Job accounting − Keeping track of time and resources used by various jobs and users.
• Error detecting aids − Production of dumps, traces, error messages, and other
debugging and error detecting aids.
Hardware – any physical device or equipment used in or with a computer system (anything you
can see and touch).
External hardware
External hardware devices (peripherals) – any hardware device that is located outside the
computer.
Input device – a piece of hardware device which is used to enter information to a
computer for processing.
Examples: keyboard, mouse, trackpad (or touchpad), touchscreen, joystick, microphone,
light pen, webcam, speech input, etc.
Output device – a piece of hardware device that receives information from a computer.
Examples: monitor, printer, scanner, speaker, display screen (tablet, smartphone …),
projector, head phone, etc.
Internal hardware
Internal hardware devices (or internal hardware components) – any piece of hardware device that
is located inside the computer.
Examples: CPU, hard disk drive, ROM, RAM, etc.
Software:
Software – a set of instructions or programs that tells a computer what to do or how to
perform a specific task (computer software runs on hardware).
including computer programs and apps on your phone. Video games, photo editors, and
web browsers are just a few examples.
Main types of software – systems software and application software.
System software:
It Operates directly on hardware devices of computer. It provides a platform to run an
application. It provides and supports user functionality. Examples of system software
include operating systems such as Windows, Linux, Unix, etc.
Application Software
An application software is designed for benefit of users to perform one or more tasks.
Examples of application software include Microsoft Word, Excel, PowerPoint, Oracle,
etc.
• Chapter 8-9 Internet, www resources, FTP
Internet:
It is a global network of computer networks.
It comprises of millions of computing devices that carry and transfer volumes of
information from one device to the other.
The Internet is a massive network of networks.
It connects millions of computers together globally, forming a network in which any
computer can communicate with any other computer as long as they are both connected
to the Internet.
Information that travels over the Internet does so via a variety of languages known as
protocols.
WWW:
The World Wide Web (WWW) is an internet based service, which uses common set of
rules known as Protocols, to distribute documents across the Internet in a standard way.
The World Wide Web or simply Web is a massive collection of digital pages to access
information over the Internet.
The Web uses the HTTP protocol, to transmit data and allows applications to
communicate in order to exchange business logic.
The Web also uses browsers, such as Internet Explorer, Google Chrome etc. to access
Web pages that are linked to each other via hyperlinks.
FTP:
File Transfer Protocol is a standard protocol used on network to transfer the files from
one host computer to another host computer using a TCP based network, such as the
Internet.
To use FTP server, users need to authenticate themselves using a sign-in protocol, using a
username and password, but can connect anonymously if the server is configured to allow
it.
It is an alternative choice to HTTP protocol for downloading and uploading files to FTP
servers.
Chapter 10- 12
Biological Databases and their classification; Primary databases:
Nucleotide sequence databases (GenBank, EMBL)
Database concept
• Data is a collection of facts, such as numbers, words, measurements, observations or just
descriptions of things.
• Data contents include gene sequences, textual descriptions, attributes and classifications,
citations, and tabular data.
• A database is an organized collection of data, generally stored and accessed electronically
from a computer system.
• Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and
computational analysis.
• Information contained in biological databases includes gene function, structure,
localization (both cellular and chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
Secondary databases
• It is also called as Curated database.
• Performs a quality control and sorting of information before making accessible to the
public.
• Secondary databases contain information derived from primary databases.
• Secondary databases store information such as conserved sequences, active site
residues, and signature sequences.
• They are highly curated, often using a complex combination of computational
algorithms and manual analysis and interpretation to derive new knowledge from the
public record of science.
SWISS-PROT
• SWISS-PROT is a curated protein sequence database which strives to provide a high
level of annotation (such as the description of the function of a protein, its domains
structure, post-translational modifications, variants, etc.), a minimal level of redundancy
and high level of integration with other databases.
• SWISS-PROT is an annotated protein sequence database established in 1986 and
maintained collaboratively, since 1987, by the Department of Medical Biochemistry of
the University of Geneva and the EMBL Data Library.
• SWISS-PROT contains the information about the name and origin of the protein,
protein attributes, general information, sequence annotation, amino acid sequence,
references, cross-references with sequence, structure and interaction databases and entry
information.
• The core data consists of the sequences entered in common single letter amino acid
code, and the related references and bibliography.
• The taxonomy of the organism from which the sequence was obtained also forms part of
this core information.
TrEMBL(Translated EMBL)
• TrEMBL is a very large protein database in SwissProt format generated by computer
translation of the genetic information from the EMBL Nucleotide Sequence Database
database.
• Computer translation is not entirely perfect, so proteins predicted by the TrEMBL
database can be hypothetical, and many TrEMBL entries are poorly annotated.
• In contrast to SwissProt which contains only proteins actually found in the wild, and PIR
which is entirely unchecked.
• TrEMBL is currently being combined with the above two databases in the Uniprot
project.
Pfam
• Pfam is a database of protein families that includes their annotations and multiple
sequence alignments generated using hidden Markov models.
• The most recent version, Pfam 32.0, was released in September 2018 and contains 17,929
families.
• The general purpose of the Pfam database is to provide a complete and accurate
classification of protein families and domains.
• The Pfam website allows users to submit protein or DNA sequences to search for
matches to families in the database.
• Pfam (Protein families database of alignments and HMMs) is a large collection of
multiple sequence alignments and Hidden Markov Models covering many common
protein domains and families.
• For each family in Pfam there is a possibility to look at multiple alignments, to view
protein domain architectures, to examine species distribution, to follow links to other
databases, and to view known protein structures or domain organization of proteins
Differences between primary and secondary databases
Chapter 17
Structure databases: Retrieving information from these databases.
2. Enter the query in the textbox provided by entering PDB ID, molecule name or author
name. Click on the search button (Fig. 6.2). When the protein name is provided in the text
box, results are displayed showing data like Molecule Name, PDB text, Structural
domains, Ontology Terms etc. If one click on any of results,the particular information
about the molecule is obtained. If no information is obtained for the given uery, advanced
search can be used. A query can be defined as a request that one uses to get information
from a database.
3. From the summary page click on PDB ID 7LYJ and download the macromolecular 3D
structure in PDB format (Fig. 6.3 and 6.4).
4. Using any one of the visualizing tools PyMoL or RasMol or Swiss-PDB viewer open the
structure file to visualize. You will learn about these tools in exercise number 8 of this
course.
Chapter 20-21
Introduction to sequence alignment and its applications: Pair
wise and multiple sequence alignment
Sequence Alignment
• A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to
identify regions of similarity that may be a consequence of functional, structural, or
evolutionary relationships between the sequences.
• Sequence Alignment is a process of aligning two sequences to achieve maximum levels
of identity between them.
• This help to derive functional, structural and evolutionary relationships between them.
• Aligning sequences assigns functions to the unknown proteins, determines the
evolutionary relatedness of organisms and helps in making prediction about the 3D
structures.
Similarity search is necessary for:
Family assignment
Sequence annotation
Construction of phylogenetic trees
Learn about evolutionary relationships
Classify sequences
Identify functions
Homology Modeling
Pair wise alignment
Pair wise sequence alignment methods are concerned with finding the best-matching
piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid)
sequences.
Pair wise alignments can only be used between two sequences at a time, but they are
efficient to calculate and are often used for methods that do not require extreme precision
(such as searching a database for sequences with high homology to a query).
The three primary methods of producing pair wise alignments are dot-matrix methods,
dynamic programming, and word methods; however, most multiple sequence alignment
techniques can align only two sequences.
The purpose of this is to find homologues (relatives) of a gene in a database of known
examples.
This information is useful for answering a variety of biological questions
The identification of sequences of unknown structure or function.
The study of molecular evolution.
Smith-Waterman algorithm
The concept of local alignment was introduced by Smith-Waterman algorithm (1981).
It is for determining similar regions between two nucleotide or protein sequences.
This algorithm was designed to sensitively detect highest similarities in highly diverged
sequences.
The S-W Algorithm implements a technique called dynamic programming, which takes
alignments of any length, at any location, in any sequence, and determines whether an
optimal alignment can be found.
Based on these calculations, scores or weights are assigned to each character-to-character
comparison: positive for exact matches/substitutions, negative for insertions/deletions.
In weight matrices, scores are added together and the highest scoring alignment is
reported.
26-27
Tools of MSA: ClustalW, TCoffee; Use of these tools for MSA of DNA and
protein sequences. Save output file in phylip format.
• ClustalW
• ClustalW like the other Clustal tools is used for aligning multiple nucleotide or protein
sequences in an efficient manner.
• Clustal W was introduced by Julie D. Thompson and Toby Gibson of EMBL, EBI.
• Most closely related sequences are aligned first, and then additional sequences and
groups of sequences are added, guided by the initial alignments.
• It uses progressive alignment methods, which align the most similar sequences first and
work their way down to the least similar sequences until a global alignment is created.
• Aligns the sequences sequentially, guided by the phylogenetic relationships indicated by
the tree.
• Gap penalties can be adjusted based on specific amino acid residues, regions of
hydrophobicity, proximity to other gaps, or secondary structure.
• ClustalW is faster than T-Coffee, but T-Coffee is more accurate, especially when
sequences share less than 30% identity.
• CLUSTALW algorithm
• Calculate all possible pairwise alignments; record the score for each pair.
• Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
• Find the two most closely related sequences
Consensus Symbols:
"*" means that the residues or nucleotides in that column are identical in all sequences, in the
alignment.
":" means that conserved substitutions have been observed, according to the COLOUR table
below.
"." means that semi-conserved substitutions are observed, i.e., amino acids having similar shape.
Conserved means the amino acid is replaced by one having similar characteristics.
TCoffee
What is Phylogeny?
• Phylogeny is the representation of the evolutionary history and relationships between groups of
organisms.
• The results in a phylogenetic tree that provides a visual output of relationships based on shared or
divergent physical and genetic characteristics.
• Biologists estimate that there are about 5 to 100 million species of organisms living on Earth
today.
• Evidence from morphological, biochemical, and gene sequence data suggests that all organisms
on Earth are genetically related, and the genealogical relationships of living things can be
represented by a vast evolutionary tree, the Tree of Life.
• The Tree of Life then represents the phylogeny of organisms, i. e., and the history of organism
lineages as they change through time.
• It implies that different species arise from previous forms via descent, and that all organisms,
from the smallest microbe to the largest plants and vertebrates,
• are connected by the passage of genes along the branches of the phylogenetic tree that links all of
Life
Terminology:
Node: a node represents a taxonomic unit. This can be a taxon (an existing species) or an ancestor
(unknown species: represents the ancestor of 2 or more species).
Branch: defines the relationship between the taxa in terms of descent and ancestry.
Topology: is the branching pattern.
Branch length: often represents the number of changes that have occurred in that branch.
Root: is the common ancestor of all taxa.
Distance scale : scale which represents the number of differences between sequences (e.g. 0.1
means 10 % differences between two sequences)
Methods of phylogenetic analysis
• There are two major groups of analyses to examine phylogenetic relationships between sequences
• Phenetic methods : trees are calculated by similarities of sequences and are based on distance
methods.
• The resulting tree is called a dendrogram.
• Distance methods compress all of the individual differences between pairs of sequences into a
single number.
• Cladistic methods: trees are calculated by considering the various possible pathways of
evolution and are based on parsimony or likelihood methods.
• The resulting tree is called a cladogram.
• Cladistic methods use each alignment position as evolutionary information to build a tree.
Phenetic methods based on distances:
Starting from an alignment, pairwise distances are calculated between DNA sequences as the
sum of all base pair differences between two sequences (the most similar sequences are assumed
to be closely related). This creates a distance matrix.
All base changes can be considered equally or a matrix of the possible replacements can be used.
Insertions and deletions are given a larger weight than replacements. Insertions or deletions of
multiple bases at one position are given less weight than multiple independent insertions or
deletions.
it is possible to correct for multiple substitutions at a single site.
From the obtained distance matrix, a phylogenetic tree is calculated with clustering algorithms.
These cluster methods construct a tree by linking the least distant pair of taxa, followed by
successively more distant taxa.
UPGMA clustering (Unweighted Pair Group Method using Arithmetic averages) : this is the
simplest method
Neighbor joining: this method tries to correct the UPGMA method for its assumption that the
rate of evolution is the same in all taxa.
Maximum parsimony
• The maximum parsimony method minimizes the number of changes on a phylogenetic tree by
assigning character states to interior nodes on the tree.
• The character (or site) length is the minimum number of changes required for that site, whereas
the tree score is the sum of character lengths over all sites.
• Some sites are not useful for tree comparison by parsimony.
• For example, constant sites, for which the same nucleotide occurs in all species, have a character
length of zero on any tree.
• Singleton sites, at which only one of the species has a distinct nucleotide, whereas all others are
the same, can also be ignored, as the character length is always one.
• The parsimony-informative sites are those at which at least two distinct characters are observed,
each at least twice.
Phylogenetic tree construction software
Phylogeny.fr -is a simple to use web service dedicated to reconstructing and analysing
phylogenetic relationships between molecular sequences.It includes multiple alignment
(MUSCLE, T-Coffee, ClustalW, ProbCons), tree viewer (Drawgram, Drawtree, ) and utility
programs (e.g. Gblocks to eliminate poorly aligned positions and divergent regions). It runs and
connects various bioinformatics programs to reconstruct a robust phylogenetic tree from a set of
sequences.
MEGA is a bioinformatics tool used for genome analysis of molecular sequences to measure
evolutionary distance for the construction of phylogenies.
Clustal Omega- is a multiple sequence alignment program for aligning three or more sequences
together in a computationally efficient and accurate manner. It produces biologically meaningful
multiple sequence alignments of divergent sequences. Evolutionary relationships can be seen via
viewing Cladograms or Phylograms.
Blast-explorer- helps you building datasets for phylogenetic analysis
• Phylogenetics has many applications in medical and biological fields, including forensic science,
• conservation biology,
• Epidemiology,
• drug discovery and drug design,
• Prediction of protein structure and function, and gene function prediction.
• To study the relationship between genomes of different species.
• To predict gene or gene finding, which means locating specific genetic regions along a genome?
• It can help identify closely related members of a species with pharmacological significance.
• To identify and classify various microorganisms, including bacteria.
31-32
Introduction to BLAST and FASTA. Different BLAST Programmes: their
application in terms of nucleic acid and protein sequence. Significance of E
Value.
Introduction to BLAST
• BLAST is widely used bioinformatics programs for sequence searching.
• The algorithm it uses is much faster than other approaches, such as calculating an optimal
alignment.
• This emphasis on speed is vital to making the algorithm practical on the huge genome
databases currently available.
• Before BLAST, FASTA was developed by David J. Lipman and William R. Pearson in
1985.
• Before fast algorithms such as BLAST and FASTA were developed, doing database
searches for protein or nucleic sequences was very time consuming because a full
alignment procedure .
• BLAST is faster than any Smith-Waterman implementation for most cases, it cannot
"guarantee the optimal alignments of the query and database sequences" as Smith-
Waterman algorithm does.
• BLAST is more time-efficient than FASTA by searching only for the more significant
patterns in the sequences, yet with comparative sensitivity.
• Examples of other questions that researchers use BLAST to answer are:
• Which bacterial species have a protein that is related in lineage to a certain protein with
known amino-acid sequence
• What other genes encode proteins that exhibit structures or motifs such as ones that have
just been determined
blastn DNA DNA DNA level Seek identical DNA sequences and splicing
patterns
blastp Protein Protein Protein level Find homologous proteins
blastx DNA Protein Protein level Analyze new DNA to find genes and seek
homologous proteins
tblastn Protein DNA Protein level Search for genes in unannotated DNA
tblastx DNA DNA Protein level Discover gene structure
BLAST Programmes application in terms of nucleic acid and protein sequence
• BLAST can be used for several purposes. These include
• Identifying species
• With the use of BLAST, you can possibly correctly identify a species or find homologous
species. This can be useful, for example, when you are working with a DNA sequence
from an unknown species.
• Locating domains
• When working with a protein sequence you can input it into BLAST, to locate known
domains within the sequence of interest.
• Establishing phylogeny
• Using the results received through BLAST you can create a phylogenetic tree using the
BLAST web-page. Phylogenies based on BLAST alone are less reliable than other
purpose-built computational phylogenetic methods, so should only be relied upon for
"first pass" phylogenetic analyses.
• DNA mapping
• When working with a known species, and looking to sequence a gene at an unknown
location, BLAST can compare the chromosomal position of the sequence of interest, to
relevant sequences in the database(s).
• Comparison
• When working with genes, BLAST can locate common genes in two related species, and
can be used to map annotations from one organism to another.
FASTA format
• In bioinformatics, the FASTA format is a text-based format for representing either
nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino
acids are represented using single-letter codes.
• FASTA can carry out a dynamic sequence similarity search between the Protein and
Nucleotide sequences against the databases.
• The format also allows for sequence names and comments to precede the sequences.
• In the original format, a sequence was represented as a series of lines, each of which was
no longer than 120 characters and usually did not exceed 80 characters.
• FASTA is a pairwise sequence alignment tool which takes input as nucleotide or protein
sequences and compares it with existing databases
• It is a text-based format and can be read and written with the help of text editor or word
processor.
• Fasta file description starts with ‘>’ symbol and followed by the gi and accession number
and then the description, all in a single line.
• Next line starts with the sequence and in each row there would be 60 nucleotides/amino
acids only.
• For DNA and proteins it is represented in one letter IUPAC nucleotide codes and amino
acid codes.
• It finds the local similarity between the sequences and calculates the statistical
significance of matches.
• It can be also used to find the functional and evolutionary relationship between the
sequences.
Variants of FastA
FASTA- Compares a DNA query sequence to a DNA database, or a protein query to a
protein database, detecting the sequence type automatically. Versions 2 and 3 are in
common use, version 3 having a highly improved score normalization method. It signi
cantly reduces the overlap between the score distributions.
FASTX- Compares a DNA query to a protein database. It mayintroduce gaps only
between codons.
FASTY- Compares a DNA query to a protein database, optimizing gap location, even
within codons.
TFASTA- Compares a protein query to a DNA database.
Significance of E Value
The Expect value (E) is a parameter that describes the number of hits one can "expect" to
see by chance when searching a database of a particular size.
It decreases exponentially as the Score (S) of the match increases.
Essentially, the E value describes the random background noise.
In principle E-value lower than 0.05 can be considered as a statistically significant hit.
But usually, a lower e-value indicates a better quality in the earch/alignment/comparison.
The smaller the E-value, the better the match.
It is preferred over the score value because e-value is less sensitive to sequence length.
However, in practice one consider even more stringent E-value cut-offs.
A hit may have very low E-value but still can be a false positive.