Lecture2-DataMining for Bioinformatics
Lecture2-DataMining for Bioinformatics
Bioinformatics
Dr. Y. V. Lokeswari
Associate Professor
SSN College of Engineering
Data Mining in Bioinformatics
• Data mining in bioinformatics implies extracting valuable information from a large amount of
incomprehensible, biological data. It is a process that leads to knowledge discovery.
• Data mining in bioinformatics deals with different techniques and algorithms to gain knowledge from
data of biological sequences, structures and microarrays.
• Biomedical Data Analysis
• Major Nucleotide Sequence Database, Protein Sequence Database, and Gene Expression
Database
• A DNA sequence consists of four components, namely, adenine (A), cytosine (C), guanine (G) and
thymine (T), specifying the genetic code of the organism.
• A protein sequence consists of 20 amino acids, coded from the coding region of a DNA sequence.
• Gene expression data measures the expression of a particular gene, whether upregulated, down-
regulated, or non-expressing, under specific conditions in a cell.
Data mining=extracting valuable info from large amt of incomprehensible biological
data (seq, structures and MicroArrays).
DNA= alphabet seq of A,G,C,T
-----> leads to knowledge discovery
there are regions in DNA that help code amino acids.
Uses diff techniques and algos
20 amino acids=1 protein seq
Data Mining in Bioinformatics
• The three major DNA sequence databases
• EMBL (http://www.ebi.ac.uk/embl/index.html) European Bioinformatics Institute (EBI), an
outstation of the European Molecular Biology Laboratory (EMBL)
• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) GenBank database is maintained by the
National Center for Biotechnology Information (NCBI),
• DDBJ (http://www.ddbj.nig.ac.jp/Welcome-e.html) DNA Data Bank of Japan at the National
Institute of Genetics (NIG) in Japan.
• The three databases have collaborated to form the International Nucleotide Sequence
Database Collaboration (http://www.ncbi.nlm.nih.gov/projects/collab/).
• The three major databases for protein sequence are:
• Swiss-Prot (http://www.ebi.ac.uk/swissprot/index.html). Swiss Institute for Bioinformatics (SIB)
• TrEMBL (http://www.ebi.ac.uk/trembl/index.html). The TrEMBL database, maintained by EBI,
contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence
Database,
• PIR (http://pir.georgetown.edu/pirwww/). The Protein Information Resource (PIR), located at
Georgetown University Medical Center, is an integrated public bioinformatics resource that supports
genomic and proteomic research and scientific studies.
Data Mining in Bioinformatics
• The Microarray Gene Expression Data (MGED) Society (http://www.mged.org/index.html) is an
international organization of biologists, computer scientists, and data analysts that aims to facilitate
the sharing of microarray data generated by functional genomics and proteomics experiments.
• The ArrayExpress at the EBI (http://www.ebi.ac.uk/arrayexpress/index.html) is a public repository
for microarray data.
• The Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) at NCBI is a gene expression
and hybridization array data repository.
Data Mining in Bioinformatics
• Software Tools for Bioinformatics Research
• The software tools that facilitate research in bioinformatics can be broadly categorized into four
classes:
• (1) data retrieval tools, (2) sequence comparison and alignment tools, (3) pattern discovery tools,
and (4) visualization tools
• A major tool for data retrieval is Entrez. Others are DBGET/ linkDB and SRS – Sequence Retrieval System
• Entrez is an integrated data retrieval system developed by NCBI that provides integrated access to a
wide range of data domains, including literature, nucleotide and protein sequences, complete
genomes, 3D structures, and more..
• One can use Entrez to:
• Identify a representative, well annotated mRNA sequence record from the millions of sequences
in the Entrez Nucleotide data domain.
• Retrieve associated literature and protein records.
• Identify conserved domains within the protein.
• Identify known mutations within the gene or protein.
• Find a resolved three-dimensional structure for the protein, or, in its absence, identify structures
with homologous sequence.
• View the genomic context of the gene and download the sequence region.
Data Mining in Bioinformatics
• Sequence comparison and alignment tools are
• BLAST (Basic Local Alignment Search Tool, available at http://www.ncbi.nlm. nih.gov/BLAST/)
• BLAST is used for comparing gene and protein sequences against others in public databases.
• FASTA (FAST Alignment, available at http://www.ebi.ac.uk/fasta33/)
• FASTA can be used for a fast protein comparison or a fast nucleotide comparison.
• Multiple sequence alignment, the tool available is ClustalW and Custal Omega
• Refer to https://www.youtube.com/watch?v=LokO-iFJdqc
• ClustalW can be used to align DNA or protein sequences in order to elucidate their relationships
as well as their evolutionary origin.
• Pattern discovery tools are used to search for patterns or features in the data.
• An important pattern discovery tool is cluster analysis
• It is used to find groupings in a given dataset such that objects in the same group are similar to each
other while objects in different groups are dissimilar.
• Cluster analysis has been used extensively in gene expression data analysis (see
http://rana.lbl.gov/EisenSoftware.htm).
• Two useful integrated tools for pattern discovery are
• Expression Profiler (http://ep.ebi.ac.uk/EP/)
• GeneQuiz (available at http://jura.ebi.ac.uk:8765/ext-genequiz/)
Data Mining in Bioinformatics
• Visualization tools allow an interactive, graphical display of genomic data.
• Most major genome analysis packages, such as Expression Profiler, and GeneQuiz, have
a visualization tool integrated in them.
• Visualization tools available for bioinformatics data are:
• TreeView (available at http://rana.lbl.gov/EisenSoftware.htm),
• BioViews
• Genes_Graph
• Protein Explorer (available at http://www.proteinexplorer.org)