0% found this document useful (0 votes)
531 views36 pages

Bioinformatics PPT Section B Data Storage and Retrival Group 3

Bioinformatics course module for biotechnology

Uploaded by

seretbaraki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
531 views36 pages

Bioinformatics PPT Section B Data Storage and Retrival Group 3

Bioinformatics course module for biotechnology

Uploaded by

seretbaraki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

ADDIS ABABA SCIENCE AND TECHNOLOGY UNIVERSITY

COLLEGE OF APPLIED AND NATURAL SCIENCE


DEPARTMENT OF BIOTECHNOLOGY
SECTION – B
BIOINFORMATICS GROUP ASSIGNMENT
DATA STORING AND RETRIEVING
Group Members ID NO
Aron Gebremedhin ETS1824/14
Blen Fekadu ETS 1852\14
Deborah Fromsa ETS1861\14
Eliab Mihret ETS0516\14
Seret Baraki ETS1967\14
Sitra Hassen ETS1970\14

Submitted to: Dr. Adugna A.


Submission date:21 May, 2024 G.G.
CONTENTS
1. Introduction
2. Biological Databases
2.1 Why are these Important?
2.2 Kinds of Biological Databases
3. Data Storage
3.1 Bioinformatics Data
3.2 Transforming Data to Knowledge
3.3 Data Warehousing
3.3.1 Data Warehouse Architecture
4. Data Quality
CNTD
4.1 The Future of Biological Databases
5. Sequence retrieval
5.1 Types of Sequence Retrieval Databases
5.1.1 Sequence retrieval in a database - NCBI
5.1.2 Sequence retrieval in a database – EMBL
6. Data mining techniques in bioinformatics
6.1 Classification
6.2 Clustering
6.3 Association
6.4 Outlier detection
6.5 Regression
6.6 Prediction
7. Conclusion
Introduction
• Bioinformatics bridges the gap between biology and computer science, empowering
researchers to unlock the secrets encoded within biological sequences
• Sequence retrieval, a fundamental bioinformatics technique, allows scientists to
access and analyze the vast databases
• This data serves as the blueprint for life, holding the key to understanding gene
function, evolutionary relationships, and potential drug targets
• Sequence retrieval tools like BLAST (Basic Local Alignment Search Tool) enable
researchers to query these databases using a known sequence.
• Bioinformatics researchers are constantly developing new algorithms and databases
to facilitate rapid and accurate sequence retrieval.
Introduction
• Bioinformatics bridges the gap between biology and computer science, empowering
researchers to unlock the secrets encoded within biological sequences
• Sequence retrieval, a fundamental bioinformatics technique, allows scientists to
access and analyze the vast databases
• This data serves as the blueprint for life, holding the key to understanding gene
function, evolutionary relationships, and potential drug targets
• Sequence retrieval tools like BLAST (Basic Local Alignment Search Tool) enable
researchers to query these databases using a known sequence.
• Bioinformatics researchers are constantly developing new algorithms and databases
to facilitate rapid and accurate sequence retrieval.
Biological database
• Biological databases emerged as a response to the huge data generated by low-cost
DNA sequencing technologies.
• One of the first databases to emerge was GenBank, which is a collection of all
available protein and DNA sequences.
• It is maintained by the National Institutes of Health (NIH) and the National Center
for Biotechnology Information (NCBI)
• GenBank paved the way for the Human Genome Project (HGP).
• The data stored in biological databases is organized for optimal analysis and
consists of two types: raw and curated (or annotated)
• Biological databases are complex, heterogeneous, dynamic, and yet inconsistent.
The inconsistency is due to the lack of standards at the ontological level.
Biological database
• Biological databases emerged as a response to the huge data generated by low-cost
DNA sequencing technologies.
• One of the first databases to emerge was GenBank, which is a collection of all
available protein and DNA sequences.
• It is maintained by the National Institutes of Health (NIH) and the National Center
for Biotechnology Information (NCBI)
• GenBank paved the way for the Human Genome Project (HGP).
• The data stored in biological databases is organized for optimal analysis and
consists of two types: raw and curated (or annotated)
• Biological databases are complex, heterogeneous, dynamic, and yet inconsistent.
The inconsistency is due to the lack of standards at the ontological level.
Why are database important ?
• Data is submitted directly to biological databases for indexing, organization, and
data optimization.
• They help researchers find relevant biological data by making it available in a
format that is readable on a computer.
• All biological information is readily accessible through data mining tools that save
time and resources.
• Biological databases can be broadly classified as sequence and structure databases.
• Structure databases are for protein structures, while sequence databases are for
nucleic acid and protein sequences.
Kinds of Biological Databases
• Biological databases can be further classified as primary, secondary, and composite
databases.
• Primary databases contain information for sequence or structure only. Examples of
primary biological databases include:
– Swiss-Prot and PIR for protein sequences
– GenBank and DDBJ for genome sequences
– Protein Databank for protein structures
• Secondary databases store information such as conserved sequences, active site
residues, and signature sequences. Protein Databank data is stored in secondary
databases. Examples include:
– SCOP at Cambridge University
– CATH at the University College of London
– PROSITE of the Swiss Institute of Bioinformatics
– eMOTIF at Stanford
CNTD
• Composite databases contain a variety of primary databases, which eliminates the
need to search each one separately.
• Each composite database has different search algorithms and data structures.
• The NCBI hosts these databases, where links to the Online Mendelian Inheritance
in Man (OMIM) is found.
Data Storage
• Among the most significant DNA databases are DDBJ, GenBank, and EMBL.
• Major protein databases are
– Swiss-Prot,
– TrEMBL,
– PIR, and Protein Data Bank.
• researchers typically utilize diverse information from multiple databases to support
planning of experiments or analysis and interpretation of results
• data warehousing, a convenient solution to managing different views of data and
ensuring data interoperability, has been recently applied in bioinformatics.
• data warehouse is to facilitate high-level analysis, summarization of information,
and extraction of new knowledge hidden in the data.
CNTD
• The types of bioinformatics databases are summarized in the two major catalogues
of molecular databases: the annual database issue of Nucleic Acid Research , and
the DBCAT, a catalog of public databases
• Bioinformatics data across diverse databases is often highly redundant.
• Frequently encountered data redundancy issues are:
– (1) fragments and partial entries of the same item (e.g. sequence) may be stored in several
source records;
– (2) databases update and crossreference one another with a negative side effect of
occasionally creating duplicates, redundant entries, and proliferating errors;
– (3) the same sequence may be submitted to more than one database without
crossreferencing those records; and
– (4) the “owners” of the sequence record may submit a sequence more than once to the same
database.
Bioinformatics
• The bioinformatics data views include :
– biological sequences (DNA, RNA, and proteins)
– gene or protein expression,
– functional proper- ties,
– molecular interactions,
– clinical data, system
• The types of bioinformatics databases are summarized in the two major catalogues
of molecular databases:
– Nucleic Acid Research
– and the DBCAT
Transforming Data to Knowledge
• Transformation of data to knowledge, also known as knowledge discovery from
databases (KDD)
• The need for KDD arises from recent progress in biotechnology that has enabled
identification of the raw DNA or protein sequences in large numbers
• This system comprises a set of welldefined and classified entries of allergens
combined with bioinformatics tools for the prediction of allergenicity and the
analysis of allergic crossreactivity.
• If a new sequence has conserved regions having either sequence identity of six or
more consecutive amino acids to any known allergens it represents a potential
allergen.
• data preparation phase involves numerous steps of data conversions and data
cleaning before the data is well organized and can be used for prediction or data
mining tasks.
Data Warehousing
• Currently, the application of data warehousing principles to bioinformatics data is
not fully explored
• The few examples of biological data warehouses are a gene expression data
warehouse,
• GIMS, a genomic data warehouse
• TMID, a test bed implementation of view maintenance of protein sequences
• These data warehouses share two features:
– 1. Data extraction and integration from disparate sources, with alternative proposals for
organizing the consolidated data.
– 2. Facilitation of data analysis, using one or more data mining techniques for the discovery
of new knowledge (patterns, explanations, concepts, etc.) from the dataset.
CNTD
• Data integration focuses on achieving interoperability of different views of the data,
and several data integration solutions have been developed.
• The selection of data mining tools depends on the purpose of the data warehouse.
• Data integration systems for bioinformatics data adopt one of two approaches:
• virtual or materialized integration
• Virtual integration systems provide a software layer on top of multiple data sources
that are independently maintained.
• The main example include:
– The Entrez retrieval system at NCBI facilitates access to and analysis of more than twenty
interlinked databases.
– include EnsEMBL and GenoMax
Data Warehouse Architecture
• A bioinformatics data warehouse requires several components for operation:
– (1) retrieval of data from databases,
– (2) mechanism for cleaning data,
– (3) flexibility of manipulating the datasets, and
– (4) integrating and designing purposeful analysis tools that can be used jointly or
independently.
• The major components of a data warehouse provide for initial and incremental data
integration, data annotation, and data mining.
• Data integration also includes subcomponents for data retrieval, data cleaning, and
data transformation.
• Data cleaning support tools are used for filtering of irrelevant and redundant
records.
CNTD
• Once the initial dataset is created, the annotation component enables the addition of
value added information from experimental results or related publications to the
dataset.
• Data cleaning is also supported at the annotation stage, facilitating removal of
erroneous data propagated from the data sources.
• The analysis of data is enabled by incorporating general or specific data mining
tools.
Data Quality
• It generally refers to the “fitness” of the data in the databases.
• Data quality can be improved using data cleaning, the process of detecting and
removing errors and discrepancies.
• Data quality can be assessed by measuring the agreements between data views
presented by a system and the real world entities
• Attaining perfect data quality is impossible, but the developers of biological
databases should strive for highest level of quality
• Using data analysis and visualization tools, curators inspect and correct the data for
consistency, accuracy, completeness, correctness, timeliness, relevance, and
uniqueness
• The process of improving data quality, involving error correction, identification and
removal of duplicates, and restructuring, is collectively known as data cleaning.
The Future of Biological Databases
• Because of high performance computational platforms, these databases have
become important in providing the infrastructure needed for biological research,
from data preparation to data extraction.
• bioinformatics tools should be streamlined for analyzing the growing amount of
data generated from genomics, metabolomics, proteomics, and metagenomics.
• The growth of biological databases will pave the way for further studies on proteins
and nucleic acids, impacting therapeutics, biomedical, and related fields.
Sequence retrieval
• This is the operation of accessing the precise order of gene in a DNA, RNA and
protein.
• Sequence information comes from many sources in which some are reliable than
others in different aspects of sequence curation.
• Many methods have been employed in determining the genomic sequence including
the Sangers, manual and automated methods
• recent knowledge of bioinformatics proffered an efficient and effective way of
accessing the sequence using computer.
• When the name of a gene or its ID number is given, it is possible to find and
retrieve its DNA sequence.
• The most consistent source of sequence data comes from sequencing centers.
Types of Sequence Retrieval
Database
• The main resources for sequence retrieval are three large databases called global
nucleotide sequence storage.
• They include the following:
– National Centre for Biotechnology Information (NCBI) database – (www.ncbi.nlm.nih.gov/)
– European Molecular Biology Laboratory (EMBL) database – (www.ebi.ac.uk/embl/)
– DNA Database of Japan (DDBJ) database – (www.ddbj.nig.ac.jp/)
• They collect all publicly available DNA, RNA and protein sequence data and make
it available for free.
CNTD
• Genome centered database includes the following
• NCBI genomes: Entrez Life Sciences Search Engine (US National Institutes of
Health) www.ncbi.nlm.nih.gov/sites/gquery
• Ensemble genome browser (European Bioinformatics Institute) www.
ensemble.org
• UCSC genome bioinformatics site (University of California at Santa Cruz)
www.genome.ucsc.edu.
• Protein database includes
– SwissProt
– TrEMBL
– PDB
Sequence retrieval in a database NCBI
• The National Centre for Biotechnology Information (NCBI) is U.S. government
funded national resource for molecular biology information
• NCBI database contains essentially the same data as in the EMBL/DDBJ databases.
• Sequences in the NCBI Sequence Database (or EMBL/DDBJ) are identified by an
accession number.
• As well as the sequence itself, for each sequence the NCBI database (or
EMBL/DDBJ databases) also stores some additional annotation data
• Some of this annotation data was added by the person who sequenced the sequence
and submitted it to the NCBI database
CNTD
• The sequence retrieval system is a homogeneous interface to over 80 biological
databases that had been developed at the European Bioinformatics Institute (EBI).
• It includes databases of sequences, metabolic pathways, transcription factors,
application results (like BLAST, SSEARCH, FASTA), protein 3D structures,
genomes, mappings, mutations, and locus specific mutations.
• The foundation of online computer database for storing and distributing sequence
data has made bioinformatics an invaluable tool for sequence retrieval.
Sequence retrieval in a database
– EMBL
• The EBI (European Bioinformatics Institute) provides a comprehensive set of
sequence similarity algorithms that can be accessed both interactively from the
EMBL–EBI World Wide Web site (http://www.ebi.ac.uk/Tools/)
• The EMBL Nucleotide Sequence Database can be searched as a whole or by
individual taxonomic division.
• The most commonly used algorithms available are Fasta (14) and WUBlast
• Fasta will find a single highscoring gapped alignment between the query nucleotide
sequence and database sequences.
• Comparisons between a nucleotide sequence and the protein databases can be made
using fastx/y
• Whilst tfastx/y allows comparisons between a protein sequence and the translated
DNA databank.
• In total, more than 200 databases are available for searching at the EBI.
Data mining techniques in
bioinformatics
• Data mining is the method of determining meaningful patterns, association and
trends by mining into enormous data stored in various data sources.
• Data mining techniques allow integrating the data in a single format that is
available in various types
• Data preprocessing is a part of data mining technique that is used to transform the
raw data in a suitable format.
• Data preprocessing consists data cleaning, data transformation and data reduction.
Steps of Data mining
Classification
• Classification is a one of the popular data mining tasks that allocates items in a
collection to target class label
• The main function of classification is to correctly predict the target class label for
each given input feature
• For instance, a classification model is helpful to predict and identify tumor class
label weather tumor is benign or malignant in case of breast cancer
• In order to create classification model, first of all training of the model is conducted
this step is also known as model building step
• Predication is the progression of recognizing the missing data for a new
observation. Decision Tree, Bayesian Network, SVM, Random Forest, KNN are
some popular for classification algorithm for model building
• Gene classification, protein structure prediction is some important biological
analysis which is performed by data mining classification techniques.
CNTD
• Classification techniques make sense of analyzing huge biological datasets and
finding interesting pattern, prediction and classification.
• This type of analysis entails protein structure prediction, gene classification, cancer
classification based on microarray data, identification of gene expression,
proteinprotein interactions etc.
Clustering
• It is used to find data clusters such that each cluster has the most closely matched
data
• Clustering is similar to classification; but the difference is that grouping of data
together is based on their similarities.
• KMean (distance based)), Hidden Markov Model (HMM), Expectation–
Maximization (EM) are some important algorithms of clustering applied in
bioinformatics experimentation
• HMM is used for biological sequence analysis.
• Clustering is broadly applied in microarray analysis to negate the limitations of
class finding where target class label are frequently not known at the time of
experiment begin.
• if a researcher is trying to find out whether a disease in a specific tissue or in a
specific condition can affect a gene expression, or gene expression differs between
two groups
• clustering can be successfully exploited to describe genes of unknown function and
uncover patterns
Association
• Association is a data mining task that discovers the possibility of the cooccurrence
of items in a data base.
• Apriori algorithm is a standard algorithm that is valuable in finding association in
frequent item sets
• medical bioinformatics, from one disease it can be predict associated codisease
• Apriori algorithm have made the results by predicting the codiseases in diabetic
patients by analyzing different data set of patients
• According to result most of the diabetic patients are subjected to get brain strokes
and heart attacks in future
• Association rule is mainly useful in finding codisease in medical data analysis,
biomedical works, protein arrangements, survey data
• Association rule helps to recognize the possibility of other illness in the presence
certain disease.
Outlier detection
• outlier detection in bioinformatics data streaming mining has recognized a
substantial consideration by research societies
• Outlier detection is way to identify outliers in data base
• ODRioVFDT applied onto bioinformatics streaming data processing for discovering
and computing the information of life phenomena
• can help to diagnose and cure the disease more efficiently
• Outlier detection is useful in order to test abnormal reactions to new medical
therapies.
Regression
• Regression used mainly as a form of planning and modeling to recognize the
possibility of a certain given variable in the presence of other variables.
• The prime objective of regression is to examine the certain relationship between
variables.
• Regression is frequently used in bioinformatics to predict the output value of a
biological process for a particular biological system under definite situation.
• The significant aspect of the linear regression model is to determine the regression
weights assigned to each gene
• Linear regression improves the feature selection for gene selection based on their
divergence level from the typical gene expression regression line
Prediction
• Prediction is the most significant data mining techniques, as it is used to project the
types of data in the future.
• In bioinformatics one can predict from DNA sequence and Amino acid sequence.
• data mining technique one can predict its function on the basis of structural
similarity that could help to predict which molecules or drugs can proficiently bind
to protein
• really prime importance for designing of drug.
Conclusion
• Bioinformatics bridges the gap between biology and computer science, empowering
researchers to unlock the secrets encoded within biological sequences
• Biological databases emerged as a response to the huge data generated by low-cost
DNA sequencing technologies.
• Data storage is an important factor of bioinformatics
• Sequence retrival is the operation of accessing the precise order of gene in a DNA,
RNA and protein
• Data mining is the method of determining meaningful patterns, association and
trends by mining into enormous data stored in various data sources
END

You might also like