We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36
ADDIS ABABA SCIENCE AND TECHNOLOGY UNIVERSITY
COLLEGE OF APPLIED AND NATURAL SCIENCE
DEPARTMENT OF BIOTECHNOLOGY SECTION – B BIOINFORMATICS GROUP ASSIGNMENT DATA STORING AND RETRIEVING Group Members ID NO Aron Gebremedhin ETS1824/14 Blen Fekadu ETS 1852\14 Deborah Fromsa ETS1861\14 Eliab Mihret ETS0516\14 Seret Baraki ETS1967\14 Sitra Hassen ETS1970\14
Submitted to: Dr. Adugna A.
Submission date:21 May, 2024 G.G. CONTENTS 1. Introduction 2. Biological Databases 2.1 Why are these Important? 2.2 Kinds of Biological Databases 3. Data Storage 3.1 Bioinformatics Data 3.2 Transforming Data to Knowledge 3.3 Data Warehousing 3.3.1 Data Warehouse Architecture 4. Data Quality CNTD 4.1 The Future of Biological Databases 5. Sequence retrieval 5.1 Types of Sequence Retrieval Databases 5.1.1 Sequence retrieval in a database - NCBI 5.1.2 Sequence retrieval in a database – EMBL 6. Data mining techniques in bioinformatics 6.1 Classification 6.2 Clustering 6.3 Association 6.4 Outlier detection 6.5 Regression 6.6 Prediction 7. Conclusion Introduction • Bioinformatics bridges the gap between biology and computer science, empowering researchers to unlock the secrets encoded within biological sequences • Sequence retrieval, a fundamental bioinformatics technique, allows scientists to access and analyze the vast databases • This data serves as the blueprint for life, holding the key to understanding gene function, evolutionary relationships, and potential drug targets • Sequence retrieval tools like BLAST (Basic Local Alignment Search Tool) enable researchers to query these databases using a known sequence. • Bioinformatics researchers are constantly developing new algorithms and databases to facilitate rapid and accurate sequence retrieval. Introduction • Bioinformatics bridges the gap between biology and computer science, empowering researchers to unlock the secrets encoded within biological sequences • Sequence retrieval, a fundamental bioinformatics technique, allows scientists to access and analyze the vast databases • This data serves as the blueprint for life, holding the key to understanding gene function, evolutionary relationships, and potential drug targets • Sequence retrieval tools like BLAST (Basic Local Alignment Search Tool) enable researchers to query these databases using a known sequence. • Bioinformatics researchers are constantly developing new algorithms and databases to facilitate rapid and accurate sequence retrieval. Biological database • Biological databases emerged as a response to the huge data generated by low-cost DNA sequencing technologies. • One of the first databases to emerge was GenBank, which is a collection of all available protein and DNA sequences. • It is maintained by the National Institutes of Health (NIH) and the National Center for Biotechnology Information (NCBI) • GenBank paved the way for the Human Genome Project (HGP). • The data stored in biological databases is organized for optimal analysis and consists of two types: raw and curated (or annotated) • Biological databases are complex, heterogeneous, dynamic, and yet inconsistent. The inconsistency is due to the lack of standards at the ontological level. Biological database • Biological databases emerged as a response to the huge data generated by low-cost DNA sequencing technologies. • One of the first databases to emerge was GenBank, which is a collection of all available protein and DNA sequences. • It is maintained by the National Institutes of Health (NIH) and the National Center for Biotechnology Information (NCBI) • GenBank paved the way for the Human Genome Project (HGP). • The data stored in biological databases is organized for optimal analysis and consists of two types: raw and curated (or annotated) • Biological databases are complex, heterogeneous, dynamic, and yet inconsistent. The inconsistency is due to the lack of standards at the ontological level. Why are database important ? • Data is submitted directly to biological databases for indexing, organization, and data optimization. • They help researchers find relevant biological data by making it available in a format that is readable on a computer. • All biological information is readily accessible through data mining tools that save time and resources. • Biological databases can be broadly classified as sequence and structure databases. • Structure databases are for protein structures, while sequence databases are for nucleic acid and protein sequences. Kinds of Biological Databases • Biological databases can be further classified as primary, secondary, and composite databases. • Primary databases contain information for sequence or structure only. Examples of primary biological databases include: – Swiss-Prot and PIR for protein sequences – GenBank and DDBJ for genome sequences – Protein Databank for protein structures • Secondary databases store information such as conserved sequences, active site residues, and signature sequences. Protein Databank data is stored in secondary databases. Examples include: – SCOP at Cambridge University – CATH at the University College of London – PROSITE of the Swiss Institute of Bioinformatics – eMOTIF at Stanford CNTD • Composite databases contain a variety of primary databases, which eliminates the need to search each one separately. • Each composite database has different search algorithms and data structures. • The NCBI hosts these databases, where links to the Online Mendelian Inheritance in Man (OMIM) is found. Data Storage • Among the most significant DNA databases are DDBJ, GenBank, and EMBL. • Major protein databases are – Swiss-Prot, – TrEMBL, – PIR, and Protein Data Bank. • researchers typically utilize diverse information from multiple databases to support planning of experiments or analysis and interpretation of results • data warehousing, a convenient solution to managing different views of data and ensuring data interoperability, has been recently applied in bioinformatics. • data warehouse is to facilitate high-level analysis, summarization of information, and extraction of new knowledge hidden in the data. CNTD • The types of bioinformatics databases are summarized in the two major catalogues of molecular databases: the annual database issue of Nucleic Acid Research , and the DBCAT, a catalog of public databases • Bioinformatics data across diverse databases is often highly redundant. • Frequently encountered data redundancy issues are: – (1) fragments and partial entries of the same item (e.g. sequence) may be stored in several source records; – (2) databases update and crossreference one another with a negative side effect of occasionally creating duplicates, redundant entries, and proliferating errors; – (3) the same sequence may be submitted to more than one database without crossreferencing those records; and – (4) the “owners” of the sequence record may submit a sequence more than once to the same database. Bioinformatics • The bioinformatics data views include : – biological sequences (DNA, RNA, and proteins) – gene or protein expression, – functional proper- ties, – molecular interactions, – clinical data, system • The types of bioinformatics databases are summarized in the two major catalogues of molecular databases: – Nucleic Acid Research – and the DBCAT Transforming Data to Knowledge • Transformation of data to knowledge, also known as knowledge discovery from databases (KDD) • The need for KDD arises from recent progress in biotechnology that has enabled identification of the raw DNA or protein sequences in large numbers • This system comprises a set of welldefined and classified entries of allergens combined with bioinformatics tools for the prediction of allergenicity and the analysis of allergic crossreactivity. • If a new sequence has conserved regions having either sequence identity of six or more consecutive amino acids to any known allergens it represents a potential allergen. • data preparation phase involves numerous steps of data conversions and data cleaning before the data is well organized and can be used for prediction or data mining tasks. Data Warehousing • Currently, the application of data warehousing principles to bioinformatics data is not fully explored • The few examples of biological data warehouses are a gene expression data warehouse, • GIMS, a genomic data warehouse • TMID, a test bed implementation of view maintenance of protein sequences • These data warehouses share two features: – 1. Data extraction and integration from disparate sources, with alternative proposals for organizing the consolidated data. – 2. Facilitation of data analysis, using one or more data mining techniques for the discovery of new knowledge (patterns, explanations, concepts, etc.) from the dataset. CNTD • Data integration focuses on achieving interoperability of different views of the data, and several data integration solutions have been developed. • The selection of data mining tools depends on the purpose of the data warehouse. • Data integration systems for bioinformatics data adopt one of two approaches: • virtual or materialized integration • Virtual integration systems provide a software layer on top of multiple data sources that are independently maintained. • The main example include: – The Entrez retrieval system at NCBI facilitates access to and analysis of more than twenty interlinked databases. – include EnsEMBL and GenoMax Data Warehouse Architecture • A bioinformatics data warehouse requires several components for operation: – (1) retrieval of data from databases, – (2) mechanism for cleaning data, – (3) flexibility of manipulating the datasets, and – (4) integrating and designing purposeful analysis tools that can be used jointly or independently. • The major components of a data warehouse provide for initial and incremental data integration, data annotation, and data mining. • Data integration also includes subcomponents for data retrieval, data cleaning, and data transformation. • Data cleaning support tools are used for filtering of irrelevant and redundant records. CNTD • Once the initial dataset is created, the annotation component enables the addition of value added information from experimental results or related publications to the dataset. • Data cleaning is also supported at the annotation stage, facilitating removal of erroneous data propagated from the data sources. • The analysis of data is enabled by incorporating general or specific data mining tools. Data Quality • It generally refers to the “fitness” of the data in the databases. • Data quality can be improved using data cleaning, the process of detecting and removing errors and discrepancies. • Data quality can be assessed by measuring the agreements between data views presented by a system and the real world entities • Attaining perfect data quality is impossible, but the developers of biological databases should strive for highest level of quality • Using data analysis and visualization tools, curators inspect and correct the data for consistency, accuracy, completeness, correctness, timeliness, relevance, and uniqueness • The process of improving data quality, involving error correction, identification and removal of duplicates, and restructuring, is collectively known as data cleaning. The Future of Biological Databases • Because of high performance computational platforms, these databases have become important in providing the infrastructure needed for biological research, from data preparation to data extraction. • bioinformatics tools should be streamlined for analyzing the growing amount of data generated from genomics, metabolomics, proteomics, and metagenomics. • The growth of biological databases will pave the way for further studies on proteins and nucleic acids, impacting therapeutics, biomedical, and related fields. Sequence retrieval • This is the operation of accessing the precise order of gene in a DNA, RNA and protein. • Sequence information comes from many sources in which some are reliable than others in different aspects of sequence curation. • Many methods have been employed in determining the genomic sequence including the Sangers, manual and automated methods • recent knowledge of bioinformatics proffered an efficient and effective way of accessing the sequence using computer. • When the name of a gene or its ID number is given, it is possible to find and retrieve its DNA sequence. • The most consistent source of sequence data comes from sequencing centers. Types of Sequence Retrieval Database • The main resources for sequence retrieval are three large databases called global nucleotide sequence storage. • They include the following: – National Centre for Biotechnology Information (NCBI) database – (www.ncbi.nlm.nih.gov/) – European Molecular Biology Laboratory (EMBL) database – (www.ebi.ac.uk/embl/) – DNA Database of Japan (DDBJ) database – (www.ddbj.nig.ac.jp/) • They collect all publicly available DNA, RNA and protein sequence data and make it available for free. CNTD • Genome centered database includes the following • NCBI genomes: Entrez Life Sciences Search Engine (US National Institutes of Health) www.ncbi.nlm.nih.gov/sites/gquery • Ensemble genome browser (European Bioinformatics Institute) www. ensemble.org • UCSC genome bioinformatics site (University of California at Santa Cruz) www.genome.ucsc.edu. • Protein database includes – SwissProt – TrEMBL – PDB Sequence retrieval in a database NCBI • The National Centre for Biotechnology Information (NCBI) is U.S. government funded national resource for molecular biology information • NCBI database contains essentially the same data as in the EMBL/DDBJ databases. • Sequences in the NCBI Sequence Database (or EMBL/DDBJ) are identified by an accession number. • As well as the sequence itself, for each sequence the NCBI database (or EMBL/DDBJ databases) also stores some additional annotation data • Some of this annotation data was added by the person who sequenced the sequence and submitted it to the NCBI database CNTD • The sequence retrieval system is a homogeneous interface to over 80 biological databases that had been developed at the European Bioinformatics Institute (EBI). • It includes databases of sequences, metabolic pathways, transcription factors, application results (like BLAST, SSEARCH, FASTA), protein 3D structures, genomes, mappings, mutations, and locus specific mutations. • The foundation of online computer database for storing and distributing sequence data has made bioinformatics an invaluable tool for sequence retrieval. Sequence retrieval in a database – EMBL • The EBI (European Bioinformatics Institute) provides a comprehensive set of sequence similarity algorithms that can be accessed both interactively from the EMBL–EBI World Wide Web site (http://www.ebi.ac.uk/Tools/) • The EMBL Nucleotide Sequence Database can be searched as a whole or by individual taxonomic division. • The most commonly used algorithms available are Fasta (14) and WUBlast • Fasta will find a single highscoring gapped alignment between the query nucleotide sequence and database sequences. • Comparisons between a nucleotide sequence and the protein databases can be made using fastx/y • Whilst tfastx/y allows comparisons between a protein sequence and the translated DNA databank. • In total, more than 200 databases are available for searching at the EBI. Data mining techniques in bioinformatics • Data mining is the method of determining meaningful patterns, association and trends by mining into enormous data stored in various data sources. • Data mining techniques allow integrating the data in a single format that is available in various types • Data preprocessing is a part of data mining technique that is used to transform the raw data in a suitable format. • Data preprocessing consists data cleaning, data transformation and data reduction. Steps of Data mining Classification • Classification is a one of the popular data mining tasks that allocates items in a collection to target class label • The main function of classification is to correctly predict the target class label for each given input feature • For instance, a classification model is helpful to predict and identify tumor class label weather tumor is benign or malignant in case of breast cancer • In order to create classification model, first of all training of the model is conducted this step is also known as model building step • Predication is the progression of recognizing the missing data for a new observation. Decision Tree, Bayesian Network, SVM, Random Forest, KNN are some popular for classification algorithm for model building • Gene classification, protein structure prediction is some important biological analysis which is performed by data mining classification techniques. CNTD • Classification techniques make sense of analyzing huge biological datasets and finding interesting pattern, prediction and classification. • This type of analysis entails protein structure prediction, gene classification, cancer classification based on microarray data, identification of gene expression, proteinprotein interactions etc. Clustering • It is used to find data clusters such that each cluster has the most closely matched data • Clustering is similar to classification; but the difference is that grouping of data together is based on their similarities. • KMean (distance based)), Hidden Markov Model (HMM), Expectation– Maximization (EM) are some important algorithms of clustering applied in bioinformatics experimentation • HMM is used for biological sequence analysis. • Clustering is broadly applied in microarray analysis to negate the limitations of class finding where target class label are frequently not known at the time of experiment begin. • if a researcher is trying to find out whether a disease in a specific tissue or in a specific condition can affect a gene expression, or gene expression differs between two groups • clustering can be successfully exploited to describe genes of unknown function and uncover patterns Association • Association is a data mining task that discovers the possibility of the cooccurrence of items in a data base. • Apriori algorithm is a standard algorithm that is valuable in finding association in frequent item sets • medical bioinformatics, from one disease it can be predict associated codisease • Apriori algorithm have made the results by predicting the codiseases in diabetic patients by analyzing different data set of patients • According to result most of the diabetic patients are subjected to get brain strokes and heart attacks in future • Association rule is mainly useful in finding codisease in medical data analysis, biomedical works, protein arrangements, survey data • Association rule helps to recognize the possibility of other illness in the presence certain disease. Outlier detection • outlier detection in bioinformatics data streaming mining has recognized a substantial consideration by research societies • Outlier detection is way to identify outliers in data base • ODRioVFDT applied onto bioinformatics streaming data processing for discovering and computing the information of life phenomena • can help to diagnose and cure the disease more efficiently • Outlier detection is useful in order to test abnormal reactions to new medical therapies. Regression • Regression used mainly as a form of planning and modeling to recognize the possibility of a certain given variable in the presence of other variables. • The prime objective of regression is to examine the certain relationship between variables. • Regression is frequently used in bioinformatics to predict the output value of a biological process for a particular biological system under definite situation. • The significant aspect of the linear regression model is to determine the regression weights assigned to each gene • Linear regression improves the feature selection for gene selection based on their divergence level from the typical gene expression regression line Prediction • Prediction is the most significant data mining techniques, as it is used to project the types of data in the future. • In bioinformatics one can predict from DNA sequence and Amino acid sequence. • data mining technique one can predict its function on the basis of structural similarity that could help to predict which molecules or drugs can proficiently bind to protein • really prime importance for designing of drug. Conclusion • Bioinformatics bridges the gap between biology and computer science, empowering researchers to unlock the secrets encoded within biological sequences • Biological databases emerged as a response to the huge data generated by low-cost DNA sequencing technologies. • Data storage is an important factor of bioinformatics • Sequence retrival is the operation of accessing the precise order of gene in a DNA, RNA and protein • Data mining is the method of determining meaningful patterns, association and trends by mining into enormous data stored in various data sources END
【糖基化修饰操控】【朱祎】Cell-2023-Dual-specificity RNA Aptamers Enable Manipulation of Target-specific O-GlcNAcylation and Unveil Functions of O-GlcNAc on B-catenin