Introduction To Bioinformatics: Tolga Can
Introduction To Bioinformatics: Tolga Can
Introduction To Bioinformatics: Tolga Can
Introduction to Bioinformatics
Tolga Can
Abstract
Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science,
mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computa-
tional point of view. The most common problems are modeling biological processes at the molecular level
and making inferences from collected data. A bioinformatics solution usually involves the following steps:
This chapter gives a brief introduction to bioinformatics by first providing an introduction to biologi-
cal terminology and then discussing some classical bioinformatics problems organized by the types of data
sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and
includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence
patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated prob-
lems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding
function, and structural alignment. Gene expression data is usually represented as matrices and analysis of
microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological net-
works such as gene regulatory networks, metabolic pathways, and protein–protein interaction networks are
usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as
construction and analysis of large-scale networks.
Key words Bioinformatics, Sequence analysis, Structure analysis, Microarray data analysis, Biological
networks
Malik Yousef and Jens Allmer (eds.), miRNomics: MicroRNA Biology and Computational Analysis, Methods in Molecular Biology,
vol. 1107, DOI 10.1007/978-1-62703-748-8_4, © Springer Science+Business Media New York 2014
51
52 Tolga Can
2 Sequence Analysis
Fig. 1 Alignment of two DNA sequences of same length. The unaligned sequences
are shown at the top and the aligned sequences are shown at the bottom. The
dashes show the deletions, i.e., gaps, in the respective sequence. Likewise, the
nucleotide in the other sequence matched against a gap can be considered
insertion. As the two events cannot be differentiated easily, they are usually
referred to as indels
54 Tolga Can
Fig. 2 The completed partial scores table for the global alignment of two
sequences. The alignment path is indicated with arrows. The table is filled by
using Eq. 1
Fig. 3 The completed partial scores table for the local alignment of two sequences.
The alignment path is indicated with arrows. The table is filled by using Eq. 2
2.2 Multiple Quite often biologists need to align multiple related sequences
Sequence Alignment simultaneously. The problem of aligning three or more DNA or
protein sequences is called the multiple sequence alignment (MSA)
problem. MSA tools are one of the most essential tools in molecu-
lar biology. Biologists use MSA tools for finding highly conserved
subregions or embedded patterns within a set of biological
sequences, for estimating evolutionary distance between sequences,
and for predicting protein secondary/tertiary structure. An exam-
ple alignment of eight short protein sequences is shown in Fig. 4.
In the example alignment (Fig. 4), conserved regions and pat-
terns are marked with different shades of gray. Extending the
dynamic programming solution for pairwise sequence alignment
to multiple sequence alignment, results in a computationally
expensive algorithm. For three sequences of length n, the running
time complexity is 7n3 which is O(n3). For k sequences, in order to
run the dynamic programming solution a k-dimensional matrix
needs to be built which results in a running time complexity of
(2k − 1)(nk) which is O(2knk). Therefore, the dynamic programming
approach for alignment between k sequences is impractical due to
exponential running time and can only be used for few or very
short sets of sequences. Because of this drawback several heuristic
solutions have been developed to solve the multiple sequence
alignment problem suboptimally in a reasonable amount of time.
Introduction to Bioinformatics 57
Fig. 5 Star alignment of four sequences. The center sequence is S2. Each of the
other sequences is aligned pairwise to S2 and these alignments are combined
one by one to get the multiple sequence alignment
Fig. 6 Multiple alignment of four sequences using a guide tree. The guide tree shows that we should align S1
to S3 and S2 to S4 and then align the resulting pairwise alignments to get the multiple sequence alignment
Fig. 7 The suffix tree for the string xabxac. The root of the tree is on the left. Each
suffix can be traced from the root to the corresponding leaf node by a unique
path. The leaf node number indicates the corresponding suffix
2.3 Efficient Pattern Some problems in sequence analysis involve searching a small
Matching Using Suffix sequence (a pattern) in a large sequence database, finding the longest
Trees common subsequence of two sequences, and finding an oligonucle-
otide sequence specific to a gene sequence in a set of thousands of
genes. Finding a pattern P of length m in a sequence S of length n can
be solved simply with a scan of the string S in O(mn) time. However,
when S is very long and we want to perform many pattern searches,
it would be desirable to have a search algorithm that could take O(m)
time. To facilitate this running time we have to preprocess S. The
preprocessing step is especially useful in scenarios where the sequence
is relatively constant over time (e.g., a genome), and when search is
needed for many different patterns. In 1973, Weiner [18] introduced
a data structure named suffix trees that allows searching of patterns in
large sequences in O(m) time. He also proposed an algorithm to con-
struct the suffix tree in O(n) time. The construction of the suffix tree
is an offline onetime cost which can be ignored when multiple
searches are performed. Below we give a quadratic-time construction
algorithm which is easier to describe and implement. But first we
formally define the suffix tree. Let S be a sequence of length n over a
fixed alphabet Σ. In biological applications the alphabet usually con-
sists of the four nucleotides for DNA sequences or the 20 amino
acids for protein sequences. A suffix tree for S is a tree with n leaves
(representing n suffixes) and the following properties:
1. Every internal node other than the root has at least two children.
2. Every edge is labeled with a nonempty substring of S.
3. The edges leaving a given node have labels starting with differ-
ent letters.
4. The concatenation of the labels of the path from the root to the
leaf i spells out the ith suffix of S. We denote the ith suffix by Si. Si
is the substring of S from the ith character to the last character of S.
Figure 7 shows an example suffix tree which is constructed for
the sequence xabxac.
60 Tolga Can
2.4 Construction of Phylogeny is a term coined by Haeckel in 1866 [20] which means
Phylogenetic Trees the line of descent or evolutionary development of any plant or
animal species or the origin and evolution of a division, group or
race of animals or plants. Phylogenetic analyses can be used for
understanding evolutionary history like the origin of species, assist
in epidemiology of infectious diseases or genetic defects, aid in
Introduction to Bioinformatics 61
Fig. 8 The UPGMA method to construct a phylogenetic tree for five sequences. The method works on the distance
matrix on the upper right and constructs the tree on the upper left
Fig. 9 The phylogenetic tree constructed by the neighbor joining method on the
five sequences. The distance matrix is shown on the right and the final NJ tree is
shown on the left
3 Structure Analysis
3.1 Prediction of There are three main approaches for predicting protein structure
Protein Structures from protein sequence: ab initio methods, homology modeling
from Primary methods, and threading methods. In the ab initio methods struc-
Sequence ture is predicted using pure chemistry and physics knowledge
without the use of other information. These techniques exhaus-
tively search the fold space and therefore can only predict structures
of small globular proteins (length < 50 amino acids). Homology
modeling, on the other hand, makes use of the experimentally
determined structures and uses sequence similarity to predict the
structure of the target protein. The main steps of homology mod-
eling are as follows:
1. Identify a set of template proteins (with known structures)
related to the target protein. This is based on sequence similar-
ity (BLAST, FASTA) with sequence identity of 30 % or more.
2. Align the target sequence with the template proteins. This is
based on multiple sequence alignment (ClustalW). Identify
conserved regions.
64 Tolga Can
3.2 The Structural Pairwise structural alignment of protein structures is the problem
Alignment Problem of finding similar subregions of two given protein structures. The
key problem in structural alignment is to find an optimal corre-
spondence between the arrangements of atoms in two molecular
structures in order to align them in 3D. Optimality of the
Introduction to Bioinformatics 65
VAST [26], CE [27], LOCK [28], and TOPS [29] developed for
structural alignment, which differ in how they represent the pro-
tein structure, extract structural features, and match matching
structural features. Some algorithms find short but very similar
regions, whereas some algorithms are able to detect larger regions
with less similarity. The statistical theory of structural alignments is
similar to that of BLAST as many methods compare the likelihood
of a match as compared to a random match. However, there is less
agreement regarding the score matrix; hence, the z-scores of CE,
DALI, and VAST may not be compatible.
Cells containing the same DNA are still different because of dif-
ferential gene expression. Only about 40 % of human genes are
expressed at any one time. A gene is expressed by transcribing
DNA into single-stranded mRNA. Then, the mRNA is translated
into a protein. This is known as the central dogma of molecular
biology [1]. The microarray technology allows for measuring the
level of mRNA expression in a high-throughput manner. Messenger
RNA expression represents the dynamic aspects of the cell. In
order to measure the mRNA levels, mRNA is isolated and labeled
using a fluorescent material. When the mRNA is hybridized to the
target, the level of hybridization corresponds to light emission
which is measured with a LASER. Therefore, higher concentration
means more hybridization which indicates more mRNA.
Microarrays can be used to measure mRNA levels in different
tissues, different developmental stages, different disease states, and
in response to different treatments. One of the main sources of
microarray data is NCBI’s Gene Expression Omnibus (GEO) [30].
As of May 2012, there are about 750,000 microarray samples per-
formed in about 30,000 different experiments in the GEO. A typi-
cal microarray sample’s raw data is about 10–30 Mb.
The main characteristic of microarray data are that they are
extremely high dimensional where the number of genes is usually
on the order of tens of thousands and the number of experiments
is on the order of tens. Microarray data is considered to be noisy
due to imperfect hybridization and off-target hybridization.
Normalization and thresholding are important for interpreting
microarray data (see Chapters 16 and 17). Some, mRNA levels may
not be read correctly and may be missing from the final output due
to a gene failing to hybridize the microarray spot. These character-
istics of microarray data make data mining on this data a challeng-
ing task and having too many genes leads to many false positive
identifications. For exploration purposes, a large set of all relevant
Introduction to Bioinformatics 67
å(X i (
- X ) Yi - Y )
PCC ( X ,Y ) = i =1
(4)
n n
å(X -X) å (Y )
2 2
i i -Y
i =1 i =1
Equation 4: the Pearson’s correlation coefficient between two
genes X and Y with n samples
where X and Y are the means of the dimensions of the vectors.
68 Tolga Can
FANCD2 H2AFX
RBBP8
ATM
TP53BP1
BRIP1
TP53
BRCA2 BRCA1 MDM4
USP7
MDM2
UIMC1
ESR1
SP1 EP300
HIF1A
SRC
Fig. 10 Functional associations around the BRCA1 gene in human. The snapshot
is created from the interactive network visualization tool from the STRING
Database at string.embl.de
References
1. Zvelebil M, Baum J (2007) Understanding 9. The Gene Ontology Consortium (2000) Gene
bioinformatics. Garland Science, New York, ontology: tool for the unification of biology.
NY. ISBN 978-0815340249 Nat Genet 25(1):25–29
2. Chatr-aryamontri A, Ceol A, Palazzi LM et al 10. Needleman SB, Wunsch CD (1970) A general
(2007) MINT: the Molecular INTeraction method applicable to the search for similarities
database. Nucleic Acids Res 35(Suppl in the amino acid sequence of two proteins.
1):D572–D574 J Mol Biol 48(3):443–453
3. Kerrien S, Aranda B, Breuza L et al (2012) 11. Smith TF, Waterman MS (1981) Identification
The IntAct molecular interaction database of common molecular subsequences. J Mol
in 2012. Nucleic Acids Res 40(D1): Biol 147:195–197
D841–D846 12. Altschul S, Gish W, Miller W et al (1990) Basic
4. Xenarios I, Rice DW, Salwinski L et al (2000) local alignment search tool. J Mol Biol
DIP: the database of interacting proteins. 215(3):403–410
Nucleic Acids Res 28:289–291 13. Lipman DJ, Pearson WR (1985) Rapid and
5. Maglott D, Ostell J, Pruitt KD, Tatusova T sensitive protein similarity searches. Science
(2010) Entrez Gene: gene-centered informa- 227(4693):1435–1441
tion at NCBI. Nucleic Acids Res 33(D1): 14. Bafna V, Lawler EL, Pevzner PA (1993)
D54–D58 Approximation algorithms for multiple
6. Flicek P, Amode MR, Barrell D et al (2012) sequence alignment. Theor Comput Sci
Ensemble 2012. Nucleic Acids Res 40(D1): 182(1–2):233–244
D84–D90 15. Chenna R, Sugawara H, Koike T et al (2003)
7. Kent WJ, Sugnet CW, Furey TS et al (2002) Multiple sequence alignment with the Clustal
The human genome browser at UCSC. series of programs. Nucleic Acids Res 31(13):
Genome Res 12(6):996–1006 3497–3500
8. Kanehisa M, Goto S, Sato Y et al (2012) 16. Saitou N, Nei M (1987) The neighbor-joining
KEGG for integration and interpretation of method: a new method for reconstructing
large-scale molecular datasets. Nucleic Acids phylogenetic trees. Mol Biol Evol 4(4):
Res 40(D1):D109–D114 406–425
Introduction to Bioinformatics 71
17. Chakrabarti S, Lanczycki CJ, Panchenko AR et 28. Singh AP, Brutlag DL (1997) Hierarchical
al (2006) State of the art: refinement of mul- protein structure superposition using both sec-
tiple sequence alignments. BMC Bioinformatics ondary structure and atomic representations.
7:499 In Proc. Fifth Int. Conf. on Intell. Sys. for
18. Weiner, P. (1973) Linear pattern matching Mol. Biol. AAAI Press, Menlo Park, CA, pp
algorithm, 14th Annual IEEE Symposium on 284–293
Switching and Automata Theory, 15–17 29. Viksna J, Gilbert D (2001) Pattern matching
October, 1973, USA, pp 1–11 and pattern discovery algorithms for protein
19. Cobbs AL (1995) Fast approximate matching topologies. Algorithms in bioinformatics: first
using suffix trees. Combinatorial pattern international workshop, WABI 2001 proceed-
matching, vol 937, Lecture notes in computer ings, vol 2149, Lecture notes in computer sci-
science. Springer, New York, NY, pp 41–54 ence. Springer, New York, NY, pp 98–111
20. Haeckel E (1868) The history of creation, vol 30. Edgar R, Domrachev M, Lash AE (2002)
1, 3rd edn. Trench & Co., London, Translated Gene expression omnibus: NCBI gene expres-
by E. Ray Lankester, Kegan Paul sion and hybridization array data repository.
21. Sokal R, Michener C (1958) A statistical Nucleic Acids Res 30(1):207–210
method for evaluating systematic relationships. 31. Tusher VG, Tibshirani R, Chu G (2001)
Univ Kans Sci Bull 38:1409–1438 Significance analysis of microarrays applied to
22. Day WHE (1986) Computational complexity the ionizing radiation response. Proc Natl
of inferring phylogenies from dissimilarity Acad Sci 98(9):5116–5121
matrices. Bull Math Biol 49:461–467 32. Lloyd SP (1982) Least squares quantization in
23. Lathrop RH (1994) The protein threading PCM. IEEE Trans Inform Theor 28(2):
problem with sequence amino acid interaction 129–137
preferences is NP-complete. Protein Eng 7(9): 33. Kohonen T (1982) Self-organized formation
1059–1068 of topologically correct feature maps. Biol
24. Subbiah S, Laurents DV, Levitt M (1993) Cybern 43(1):59–69
Structural similarity of DNA-binding domains 34. van Dongen, S. (2000) Graph clustering by
of bacteriophage repressors and the globin flow simulation. Ph.D. thesis, University of
core. Curr Biol 3:141–148 Utrecht, May 2000
25. Holm L, Sander C (1993) Protein structure 35. King AD, Pržulj N, Jurisica I (2004) Protein
comparison by alignment of distance matrices. complex prediction via cost-based clustering.
J Mol Biol 233(1):123–138 Bioinformatics 20(17):3013–3020
26. Gibrat JF, Madej T, Bryant SH (1996) 36. Blatt M, Wiseman S, Domany E (1996)
Surprising similarities in structure comparison. Superparamagnetic clustering of data. Phys
Curr Opin Struct Biol 6(3):377–385 Rev Lett 76:3251–3254
27. Shindyalov IN, Bourne PE (1998) Protein 37. Bader GD, Hogue CW (2003) An automated
structure alignment by incremental combina- method for finding molecular complexes in
torial extension of the optimum path. Protein large protein interaction networks. BMC
Eng 11(9):739–747 Bioinformatics 4:2