0% found this document useful (0 votes)
13 views23 pages

Bio in For Matics Part 1

Bioinformatics

Uploaded by

Jagadeesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Bio in For Matics Part 1

Bioinformatics

Uploaded by

Jagadeesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 23
Bioinformatics - The Genomic Revolution The science of Bioinformatics or computational biology is increasingly being used to improve the quality of life as we know it, Bioinformatics has developed out of the need to understand the code of life, DNA. Massive DNA sequencing projects have evolved and added in the growth of the science of informatics. DNA the basic molecule of life directly controls the fundamental biology of life. It codes for genes, which code for proteins, which determine the biological makeup of humans or any living organism. Its variations and errors in the genomic DNA, which ultimately define the likelihood of developing diseases or resistance to these same disorders. ‘The ultimate goal of Bioinformatics is to uncover the wealth of biological information hidden in the mass of sequence data and obtains a clearer insight into the fundamental biology of organisms and to use this information to enhance the standard of life for mankind. Itis being used now and in the foreseeable future in the areas of molecular medicine to help produce better and more customized medicines to prevent or cure diseases, it has environmental benefits in, identifying waste cleanup bacteria and in agriculture it can be used for producing high yield low maintenance crops. These are just a few of the many benefits Bioinformatics will help develop. ‘The genomic era The genomic era has seen a massive explosion in the amount of biological information available due to huge advances in the fields of molecular biology and Genomics. Bioinformatics is the application of computer technology to the management and analysis of biological data, The result is that computers are being used to gather, store, analyze and merge biological data. Bioinformatics ‘computational sciences. This new knowledge could have profound impacts on fields as varied as human energy and biotechnology. an interdisciplinary research area that is the interface between the biological and health, agriculture, the environment, The greatest challenge facing the molecular biology community today is to make sense of the wealth of data that has been produced by the genome sequencing projects. Traditionally, molecular biology research was carried out entirely at the experimental laboratory bench but the huge increase in the scale of data being produced in this genomic era has seen a need to incorporate computers into this esearch process. Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent tasks. However, the molecular biology of an organism is a very complex issue with research being carried out at different levels including the genome, proteome, transcriptome and metabalome levels. Following on from the explosion in volume of genomic data, similar increase in data have been observed in the fields of proteomics, transcriptomics and metabalomic Recent development in sequencing technology has produced means of reading genes (DNA). The code to describe even the smallest of organism would fill many books. But scientists are very ambitious people. They have started to decode "themselves" in the Human Genome Project. Vast computer data bases accessible to researchers store this vast quantity of information, There are a lot of different databases where DNA and protein sequence information are stored, | 4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or a result of ‘response to an environmental stress which causes alterations in the genome (eg. cancers, heart diabetes..). ‘The completion of the human genome means that we can search for the genes directly associated with different diseases and begin to understand the molecular basis of these diseases more clearly. This new knowledge of the molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed. 1.1 More drug targets At present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise to have fewer side effects than many of today's medicines, 1.2 Personalized medicine Clinical medicine will become more personalized with the development of the field of pharmacogenomics. This is the study of how an individual's genetic inheritance affects the body's Tesponse to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA. As aresult, potentially life saving drugs never make it to the marketplace. Today, doctors have to use trial and error to find the best drug to treat a particular patient as those with the same clinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to analyze patient's genetic profile and prescribe the best available drug therapy and dosage from the beginning. 1.3 Preventative medicine With the specific details of the genetic mechanisms of diseases being unraveled, the development of diagnostic tests to measure a persons susceptibility to different diseases may become a distinct reality. Preventative actions such as change of lifestyle or having treatment at the earliest possible stages when they are more likely to be successful, could result in huge advances in our struggle to conquer disease. 1.4 Gene therapy In the not too distant future, the potential for using genes themselves to treat disease may become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a persons genes. Currently, this field is in its infantile stage with clinical trials for many different types of cancer and other diseases ongoing. icrobial genome applications ganisms are ubiquitous, that is they are found everywhere. They have been found survivi made of a yariety of microbial properties in the baking, complete genome sequences and their potential of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence genon useful in energy production, environmental cleanup, industrial processing and toxic ws By studying the genetic material of these organisms, scientists can begin to understand ata very fundamental level and isolate the genes that give them their unique abilities to surviy extreme conditions + 2.1 Waste cleanup 7 Deinococcus radiodurans is known as the world’s toughest bacteria and itis the most radiation resistant organism known, Scientists are interested in this organism because of its potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals. Microbial Genome Program (MGP) scientists are determining the DNA sequence of the genome of C- crescentus, one of the organisms responsible for sewage treatment. 2.2 Climate change Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is. to study the genomes of microbes that use carbon dioxide as their sole carbon source. Sources: -NCBI Education Page and NCBI Science Primer: Bioinformatics EBI Education Page Bibliographic Databases Taxonomic Databases Nucleotide Databases Genomic Databases Protein Databases Microarray Databases Scientific literature databases have been available since the 1960's. Services that produced abstracts of scientific literature began to make their data available in machine readable form in the early 1960's. ‘You should be aware that none of the abstracting services has a complete coverage. The best known is MEDLINE, and now PUBMED, abstracting mainly medical literature. MEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ. EMBASE is a commercial product for the medical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; the Zoological Record indexes the zoological literature. CAB Intemational maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field Beilstein Abstracts which contain abstracts from the top journals in organic and related chemistry, published from 1980 to the present and available free. The bibliographical databases are with the exception of MEDLINE, PUBMED and Beilstein Abstracts, only available through commercial database vendors. ‘Taxonomic Databases - classification of all organisms ‘The Taxonomy Browser is a prominent taxonomic database which maintained by the taxonomy is hierarchical and sequence-based, aiming to centralise the classification of all organisms represented in the databases with at least one nucleotide or protein sequence. The Ta ‘can be used to view the taxonomic position or retrieve sequence data for a particular “This collaboration is ajoint operation by EMBL-Bank at the European Bioinformatics | the DNA Data Bank of Japan (DDBI) at the Center for Information Biology (CIB) and ( National Center for Biotechnology Information (NCBI). . hitp://wwweddbj.nig.ac.jp/ ‘http://www.ncbi.nlm.nih.gov/Genbank/ In Europe, the vast majority of the nucleotide sequence data produced is collected, organi distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge ‘Outstation of the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data fro the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These data are heterogenous, they vary with respect to the source of the material (e.g. genomic yersus cDNA), the intended quality (e.g, finished versus single pass sequences), the extent of sequence annotation and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene ora genome). The nucleotide databases are distributed free of charge over the internet. DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronisation. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. ‘The EMBL Nucleotide sequence database (also known as EMBL-Bank) is divided into sections that | Teflect major taxonomic divisions. : These taxonomic divisions include: ~ Invertebrates Other Mammals Mus musculus Organelles Bacteriophage Plants ere: accession number, followed by. apcuadenl aversion numb ‘number part will be stable, but the version part will be incremented when the sequence i Although the nucleotide sequence data are checked for integrity and obvious errors by the data library — staff, the quality of the data is the responsibility of the submitter. As a consequence, there are many errors in the database: many sequence entries are either mislabelled, contaminated, incompletely or erroneously annotated, or contain sequencing errors. In addition, the database is very redundant, in the sense that the same sequence from the same organism may be included many times, simply reflecting the redundancy of the original scientific reports, Other types of nucleotide sequence databases Genomes Server - this gives access to a large number of complete genomes. UniGene - a sequence-cluster database which address the redundancy problem by coalescing sequences that are sufficiently similar that one may reasonably infer that they are derived from the same gene. STACK - the 'Sequence Tag Alignment and Consensus Knowledgebase’ another sequence-cluster database which address the same problem as UniGene. EMBL-SVA - the EMBL Sequence Version Archive' server is a repository of all entries that have been made public since release 1 of the EMBL database. It comprises more than 100 million entries and includes entries pre-dating the first electronic release of the database in 1982. Several specialised sequence databases are also available, Some of these deal with particular classes of sequence: RDP - the ‘Ribosomal Database Project’ provides ribosome related data services to the scientific community, including online data analysis, YRNA derived phylogenetic trees, and aligned and annotated rRNA sequences. . HIV-SD - the ‘HIV Sequence Database’ collects, curates and annotates HIV and SIV sequence | data and provides various tools for analysing this data. eng "+ IMGT - the ‘ImMunoGeneTies database’ is a database specialising in Immunoglob "receptors and the Major Histocompatibility Complex (MHC) of all vertebrate species. Others nucleotide sequence databases are focussing on particular features such as: __TRANSFAC - contains sequence information on transcription factors and transcrip sites. for restriction enzymes and restriction enzyme sites. pret dlese of cael Sesouag ait ns of major interest to geneticists, there is a long history of conventionally p resis poes ‘of genes or mutations. In the past few years, most of these have been made availabl "electronic form and a variety of new databases have been developed. These databases vary the classes of data captured and how these data are stored. ‘Genomes Server - this server gives access to a hundreds of complete genome sequences, including "those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms, Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases. MGD - the ‘Mouse Genome Database’ is one of the most comprehensively curated genetic databases, RGD - the'Rat Genome Database’ curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease, + SGD - the ‘Saccharomyces Genome Database’ is another major yeast database. The MIPS yeast database is an important resource for information on the yeast genome and its products. SPGP - the 'S. Pombe Genome Project’ based at the Sanger Institute data on the fungus Schizosaccharomyces pombe. the database for genetic AceDB - this is the database for genetic and molecular data concerning Caenorhabditis elegans. ‘The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute, HIV-SD - the 'HIV Sequence Database’ collects, curates and annotates HIV and STV sequence data and provides various tools for analysing this data. Other examples of genomic databases strains in the CGSC collection, gene names, properties, and linkage map, gene product information on specific mutations. The'E. coli Database collection’ (ECDC), in Giessen, ed gene-based sequence records for E, coli. EcoCye , the ‘Encyclopedia o ism’ is a database of E, coli genes and metabolic pathways. rot gets thoroughly analysed and annotated by biologists ensuring ation and the quality of the database. mains data that originates from a wide variety of organisms from more than 6,000 ecies. Half of the entries come from about 20 organisms, which are the target of many. anked by number of entries): + Homo sapiens “+ Saccharomyces cerevisiae + Escherichia coli - Mus musculus Rattus norvegicus Bacillus subtilis Caenorhabditis elegans Haemophilus influenzae Schizosaccharomyces pombe Methanococcus jannaschii Bos taurus Drosophila melanogaster ‘Mycobacterium tuberculosis Gallus gallus ‘Arabidopsis thaliana Salmonella typhimurium: Xenopus laevis ‘Synechocystis sp. (strain PCC 6803) ‘Sus scrofa Oryctolagus cuniculus More detailed statistics relating to Swiss-Prot composition can be obtained from the Swiss-Prot statistics | Page. The Swiss-Prot user manual is a comprehensive description of the database and it's entiries, _| Primary protein sequence databases - TEMBL. ‘There is a tremendous increase in the amount of sequence data available due to technological adv such as sequencing machines, the use of new biochemical methods such as PCR technology as we the implementation of projects to sequence complete genomes. 2 a a, have produced a flood of sequence information. Maintaining the high quality entry, pee coinnine pis that involves the extensive use wi SRW ee a is ‘in 1996, si ai ic, truncated, pseudogenes, patented, small which Swiss-Prot are not interested in annotating. ‘Primary protein sequence databases - The Protein Information Resource (PIR) _ PIR (Barker et al, 2001) was established in 1984 by the National Biomedical R un (NBRF) as a successor of the original NBRF Protein Sequence Database, sceieres ieieee period by the late Margaret O. Dayhoff and published as the “Atlas of Protein Sequence vu (Dayhoff etal., 1965; Dayhoff, 1979). Since 1988 the database has been maintained by PIR-Intemati a collaboration between the NBRE, the Munich Information Center for Protein Sequences and the Japan International Protein Information Database (JIPID). ‘The database is partitioned into four sections, PIRI, PIR2, PIR3 and PIR4. Entries in PIRI are fully classified by superfamily assignment, fully annotated and fully merged with respect to other entries in PIRI. The annotation content as well as the level of redundancy reduction varies in PIR2 entries. Many entries in PIR2 are merged, classified, and annotated. Entries in PIR3 are not classified, merged or annotated. PIR3 serves as a temporary buffer for new entries. PIR4 was created to include sequences identified as not naturally occurring or expressed, such as known pseudogenes, unexpressed ORFS, synthetic sequences, and non-naturally occurring fusion, crossover or frameshift mutations. PIR also provides some degree of cross-referencing to other biomolecular databases by linking to the EMBL/ DDBJ/ GenBank nucleotide sequence databases, PDB, GDB , FlyBase, OMIM, SGD, and MGD. Specialised protein sequence databases There are many specialised protein sequence databases, some of them are quite small and only contain a handfal of entries, and others are wider in scope and larger in size. As this category of databases is quite changeable, any list provided here would soon be outdated. However, a document is available which lists information sources for molecular biologists, and is kept constantly up-to-date, A brief description of some of specialised protein sequence databases follows: Proteome Analysis - The Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. GOA - GO or Gene Ontology is an international consortium of scientists with the editorial office based at the EBI. The goal of the GO consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms, even while knowledge of gene and protein roles in cells is still accumulating and changing. we nn | The Gene Ontology Annotation (GOA) project is run in conjunction with GO and’ controlled vocabulary to a non-redundant set of proteins described in the Swiss-Prot, aun MEROPS - This database (Rawlings and Barrett, 1999) provides a catalogue and st lassification of peptidases (i.e. all proteolytic enzymes). An index of the peptidases b | ‘ery often the sequence of an unknown protein is too distantly related to any protein of known structure ‘to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its ‘sequence of a particular cluster of residue types which is commonly known as a pattern, motif, signature, ‘or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein, which may be important, for example, for their binding properties or for their enzymatic activity. These Tequirements impose very tight constraints on the evolution of those limited (in size) but important Portion(s) of a protein sequence. A signature modelling such a site must be as short as possible, should detect all or most of the sequences it is designed to describe and should not give too many false positive Tesults. In other words it must exhibit both high sensitivity and high specificity. There area few databases available, which use different methodology and a varying degree of biological information on the characterised protein families, domains and sites. Examples of secondary protein databases include: ; PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and Patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. ‘The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov ct al.,1987). Generalised profiles are remarkably similar to the spe- cific type of Hidden Markov Models (HMMs) used in Pfam, PRINTS -A different approach to pattern recognition, termed "fingerprinting" is used by this data- base. Within a sequence alignment, it is usual to find not one, but several motifs that characterise the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, ‘whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique. + Pfam - Another important secondary protein database is Pfam, The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely re- lated to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly usefi yhen analysing multidomain proteins. The biggest drawback of Pfam is its lack of biological inform (annotation) of the protein families. BLOCKS - Blocks are multiply aligned ungapped segments corresponding to the red regions of proteins. The blocks for the Blocks Database are made automaticall highly conserved regions in groups of proteins documented in InterPro, _ fe SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections. Secondary protein databases ‘These secondary protein sequence databases have become vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. These databases have evolved by using signature-recognition methods to address different sequence analysis problems, resulting in rather dif ferent and independent databases. To perform a comprehensive analysis, a user therefore has to know several important things. For example, what are the resources and where can they be found? What is the difference between them in terms of diagnostic performance and family coverage? What do the different search outputs mean? Is it sufficient to use just one of the databases, and if so, which one? Diagnostically, the most commonly used secondary protein databases, PROSITE, PRINTS and Pfam, have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods such as regular expressions, profiles, HMMs and fingerprints. For example, tegular expressions are likely to be unreliable in the identification of members of highly divergent super- families whereas profiles and HMMs excel; fingerprints perform relatively poorly in the diagnosis of very short motifs whereas regular expressions do well; and profiles and HMMs are less likely to give specific sub-family diagnoses whereas fingerprints excel ‘While all of the resources share a common interest in protein sequence classification, some such as Pfam focus on divergent domains, some such as PROSITE focus on functional sites, and others such as PRINTS focus on families, specialising in hierarchical definitions from super-family down to sub-family levels in order to pin-point specific functions. A number of sequence cluster databases such as ProDom are also commonly used in sequence analysis, for example to facilitate domain identification. Secondary protein databases - InterPro Unfortunately, these secondary databases do not share the same formats and nomenclature as each other which makes the use of all of them in an automated way difficult. In response to this the Swiss-Prot and TrEMBL group at the EBI have developed the Integrated resource of Protein domains and functional sites more commonly known as InterPro (Apweiler et al., 1996). This database is an integration of the PROSITE, PRINTS, Pfam and ProDom groups databases. InterPro will allow users uccess to a wider, complementary range of site and domain recognition methods in a single package In the task of sequence characterisation, we need more reliable, concerted methods for identifying pro ‘cin family traits and for inheriting functional annotation, This is especially important piven our depen: ence on automatic methods for assigning functions to the raw sequence data issuing from genome projects, Rationalising this process by creating a single coherent resource for diagnosis and decumenti= tion of protein families is difficult, given entirely different database formats, different search tools and different search outputs. InterPro is an attempt to address some of these issues. This new resource pro- vvides an integrated view of a number of commonly used pattem databases, and provides an intuitive | interface for text- and sequence-based searches. _Fla-files submitted by each of the groups were systematically merged and dismantled. Where re ‘family ions were amalgamated, and all method-specific annotation separated out. ‘This p 16 - ‘was complicated by the relationships that can exist, both between entries in the same | between entries in different databases, Different types of parent-child relationship were din to the differentiation into 'sub-types' and ‘sub-strings. A sub-string means that a motif or motif contained within a region of sequence encoded by a wider pattern. Examples would be; a PR( pattem is typically contained within a PRINTS fingerprint; or a fingerprint might be contained Pfam domain. A sub-type means that one or more motifs are specific for a sub-set of sequences captured by another more general pattern . Examples would be; a super-family fingerprint may contain several family- and sub-family-specific fingerprints; or a generic Pfam domain may include several family fin- gerprints. Structure databases The number of known protein structures is increasing very rapidly and these are available through the Protein Data Bank (PDB). The Nucleic Acid Database (NDB) is the database for structural information about nucleic acid molecules. The EBI Macromolecular Structure Database (MSD) group is the Euro- ean Project for the management and distribution of data on macromolecular structures, they have elose ties with the Research Collaboratory for Structural Bioinformatics (RCSB) who in collaboration MSD maintain and administer the PDB. The aim of the MSD project is to build a relational database of the PDB, and use it to clean-up, maintain and distribute the PDB data. Clean-up in this context means that the data relating to a single structure is internally consistent and free of errors. The Cambridge Crystallographic Data Centre (CCDC) provides a database of structures of small molecules’, of interest to biologists concerned with protein-ligand interactions. Microarrays and gene expression databases Microarray technology makes use of the sequence resources created by the genome projects and other sequencing efforts to answer the question, what genes are expressed in a particular cell type of an organ- ism, at a particular time and under particular conditions. For instance, they allow comparison of gene expression between normal and diseased (e.g., cancerous) cells. There are several names for this technol- ogy - DNA microarrays, DNA arrays, DNA chips, gene chips, others. Sometimes a distinction is made between these names but in fact they are all synonyms as there are no standard definitions for which type ‘of microarray technology should be called by which name. Microarray technology and applications Microarrays exploit the preferential binding of complementary single-stranded nucleic acid sequences. ‘A microartay is typically a glass slide, on to which DNA molecules are attached at fixed locations (spots). There may be tens of thousands of spots on an array, each containing a huge number of identical DNA molecules (or fragments of identical molecules), of lengths from twenty to hundreds of nucle- otides. (According to quick napkin calculations by Wilhelm Ansorge and John Quackenbush in Schnookeloch in Heidelberg on 4 October, 2001, the number of DNA molecules in a microarry spot is 107-108). For gene expression studies, each of these molecules ideally should identify one gene or one exon in the genome, however, in practice this is not always so simple and may not even be generally possible due to families of sit ‘genes in a genome. Microarrays that contain all of the approximate 6000 genes of the yeast genome’ been available since 1997. The spots are either printed on the microarrays by a robot, or synt photo-lithography (similarly as in computer chip productions) or by ink-jet printing. The right shows an illuminated microarray (enlarged). A typical dimension of such an array is facie spot diameter is of the order of 0.1 mm, for some microarray types can be even smaller. There are different ways how microarrays can be used to measure the gene expression levels. One of the most popular micorarray applications allows the comparison of gene expression levels in two different samples, €-g., the same cell type ina healthy and diseased state (see picture below). ‘The total mRNA from the cells in two different conditions is extracted and labelled with two different fluorescent labels: for example a green dye for cells at condition 1 and a red dye for cells at condition 2 (to be more accurate, the labelling is typically done by synthesising single stranded DNAs that are comple- mentary to the extracted mRNA by a enzyme called reverse transcriptase). Both extracts are washed over the microarray. Labelled gene products from the extracts hybridise to their complementary sequences in the spots due to the preferential binding - complementary single stranded nucleic acid sequences tend to attract to each other and the longer the comlementary parts, the stronger the attraction. The dyes enable the amount of sample bound to a spot to be measured by the level of fluorescence emitted when itis excited by a laser. If the RNA from the sample in condition 1 is in abundance, the spot will be green, if the RNA from the sample in condition 2.is in abundance, it will be red. If both are equal, the spot will be yellow, while if neither are present it will not fluoresce and appear black. Thus, from the fluorescence intensities and colours for each spot, the relative expression levels of the genes in both samples can be estimated. ‘The raw data that are produced from microarray experiments are the hybridised microarray images, To obtain information about gene expression levels, these images should be analysed, each spot on the array identified, its intensity measured and compared to the background. This is called image quantitation, Image quantiation is done by image analysis software. To obtain the final gene expression matrix from Spot quantitations, all the quantities related to some gene (either on the same array or on arrays measur ing the same conditions in repeated experiments) have to be combined and the entire matrix has to be scaled to make different arrays comparable. Gene expression monitoring is not the only microarray application, another one is SNP detection, ArrayExpress Microarrays are already producing massive amounts of data. These data, like genome sequence data, can help us to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs, The EBI is currently establishing a public repository for microarray gene expression data ‘ArrayExpress, analogous to EMBL-bank for DNA sequence data. In many respects gene expression databases are inherently more complex than sequence databases (this does not mean that developing, ‘maintaining and curating the sequence databases are any less challenging). Conceptually, a gene ex. Pression database can be regarded as consisting of three parts - the gene expression data matrix, gene annotation and sample annotation, see picture below. Gene expression data have meaning only in the context of the particular biological sample and the exact conditions under which the samples were taken. For instance, if we are interested in finding out how different cell types react o treatments with various chemical compounds, we must record unambiguous information about the cell types and compounds used in the experiments. EBl is participating in an effort | to develop ontologies for sample annotation, this is analogous to gene ontology for gene description. ‘ F ‘annotation can be taken care of to some extent by links to sequence databases, Plicated many-to-many relationships between genes in the gene expression matrix and ‘on the array make it necessary to provide a full and detailed description of each feature on ‘one gene can relate to several features on the array. The lack of standards in gene naming is another difficulty - a table relating each array feature present in the database to the list of all synonymous name: Of the respective gene is an essential part of a gene expression database. ‘The microarray technology is still rapidly developing, therefore it is natural that currently there are no | established standards for microarray experiments and how the raw data should be processed. There are also no standard measurement units for gene expression levels. In the lack of such standards the informa- | — tion about how exactly the gene expression data matrix was obtained should be kept in the database, if the data are to be properly interpreted later. AcrayExpress is storing all this information, the details of which is called Minimum Information About 4 Microarray Experiment (MIAME) defined by the Microarray Gene Expression Database (MGED) consortium. MGED is a grass roots movement that was founded at a meeting at the EBI in 1999, is supported by most of the important players in the microarray community, and has evolved far beyond the EBL. Another repository for gene expression data GEO is being developed at NCBI in the US. DDBJ in Japan. also have plans. All three groups face similar problems and are involved in MGED to some degree. A common data exchange format MAGE-ML is being developed in collaboration between MGED (with active participation of the EBI) and some major microarray companies. Gene expression data analysis and Expression Profiler Capturing and storage of microarray data is not an end in itself. The amounts of data from even a single microarray experiment are so large, that software tools have to be used to make any sense out of it, Clustering and class prediction are typical methods currently used in gene expression data analysis (see Microarray Data Analysis). One of the popular gene ‘expression data analysis tools is Expression Profiler, developed at the EBI. The Microarray Informatics Team at the EBI is actively working in many microarray data analysis areas using this and other tools. An.example of such research is an approach to reverse engineering of gene regulatory networks, which is based on the hypothesis that genes that have similar expression profiles (i.c., similar rows in the gene expression matrix) should also have similar regulation mechanisms as there must be a reason why their ‘expression is similar under a variety of conditions. Therefore, if we cluster the genes by similarities in their expression profiles and take sets of promoter sequences from genes in such clusters, some of ‘sets of sequences may contain a ‘signal’ as a specific sequence pattern such as a particular subs | which is relevant to regulation of these genes (Vilo et al. 2000). a DNA is the main carrier of genetic information in living organisms. DNA molecules are extremely ong, large, and consist of repeated nucleotides. Nucleotides consist of nitrogenous bases, sugar moi- «ty and a phosphate molecule. Adenine (A), thymine (T), guanine (G), and cytosine (C) are the four Nitrogenous bases that make up the DNA. The structure of a DNA molecule is double stranded, con- sisting of two DNA strands wound around each other to form a double helix. The nucleotides of the two strands are complementary to each other such that adenine cross-links with thymine (A-T), and ‘guanine cross-links with cytosine (G-C).). Itis the random arrangement of these four bases that makes all the difference. ‘The goal of DNA sequencing is to determine the order of bases for a specific piece of DNA. The first Successful attempt at sequencing a small portion of a gene completed in 1971, required three years of work to determine 12 bps from the termini of lambda phage DNA. Methods of DNA Sequencing 1) MAXAM AND GILBERT METHOD 2) SANGAR METHOD 3) AUTOMATED DNA SEQUENCING 4) SEQUENCING WITH CAPILLARY GEL ELECTROPHORESIS 1. Maxam-Gilbert Chemical Sequencing For this method, in vitro DNA polymerase reaction is not required, First the DNA fragment to be sequenced has to be isolated and radioactive label at one end ONLY. Separate one strand from the other to yield a population of identical strands labeled on one end. Then divide the mixture into four samples, each of which is subjected to a different chemical reagent that destroys one’ or two specific bases. The four reagents destroy only G, only C, A and G, or T and C. The loss of a base makes the Sugar-phosphate backbone more likely to break at that point. The reagent concentration is adjusted $0. that only about one in 50 of the target bases is destroyed per DNA fragment, It results ina mixture of different sized pieces carrying the radioactive label. When these are separated in the different lanes of ‘gel, they can be arranged in order of length and the base destroyed at each site can be determined by noting in which lane or lanes the band appears. Thus, the sequence of bases in the strand can quite simply be read from the pattem of bands on the gel. In the G-specific reaction dimethyl sulphate met iylates guanine at the N-7 position which lead instability of the glycosidic linkage. Then, piperidine displaces the opened 7-methylguanine talyses the elimination of both phosphate from the sugar. Adenine also becomes methylated at N-3 ‘not N-7 with piperidine to cause strand breakage. In the G+-A reaction the bases are protonated a cosidic bond are broken by treatment with acid. Piperidine is finally used to eliminate le ine is used to attack pyrimidines in the C+T reaction a ingen s and then joins with y totals tof l only the labeled DNA fragments via Autoradiography or Fluorescence analysis. DNA sequ be read from bottom of gel to top by examination or "reading" of the ladder of DNA bi sequence 5' to 3! (5' > 3', Itis a technique of time consuming, costly and tedious, however the advantages are: 1) Used for Sequencing of DNA that is unstable when cloned and DNA with regions of secondary structure. 2) Used to analyze genomic footprinting. 3) Used in the study of DNA methylation. 2. Sanger dideoxy DNA Sequencing This is the most extensively used and well known method of determining the sequence of the bases that make up DNA. The underlying principle is that the incorporation of a dideoxy nucleotide into an extending DNA strand terminates the elongation by its inability to incorporate bases any fur- ther. Dideoxynucleotide sequencing is commonly called Sanger sequencing since Sanger devised the method. This technique utilizes 2',3-dideoxynucleotide triphospates (ddNTPs), molecules that differ from deoxynucleotides by the having a hydrogen atom attached to the 3’ carbon rather than an OH gfoup. These molecules terminate DNA chain elongation because they cannot form a phosphodiester bond with the next deoxynucleotide. In order to perform the sequencing, one must first convert double stranded DNA into single stranded DNA. This can be done by denaturing the double stranded DNA with NaOH. A Sanger reaction con- sists of the following: a strand to be sequenced (one of the single strands which was denatured using NaOH), DNA primers (short pieces of DNA that are both complementary to the strand which is to be sequenced and radioactively labeled at the 5' end), a mixture of a particular ddNTP (such as ddATP) with its normal dNTP (dATP in this case), and the other three dNTPs (CTP, dGTP, and dTTP) (Figure 1), The concentration of ddATP should be 1% of the concentration of dATP. The logic behind this ratio is that after DNA polymerase is added, the polymerization will take place and will terminate whenever a ddATP is incorporated into the growing strand. If the d4ATP is only 1% of the total concentration of dATP, a whole series of labeled strands will result. Note that the lengths of these strands are dependent on the location of the base relative to the 5' end. This reaction is performed four times using a different ddNTP for each reaction, When these reactions are completed, a polyacrylamide gel electrophoresis (PAGE) is performed. One reaction is loaded into one lane for a total of four lanes. The ge i transferred toa nitrocellulose filter and autoradiography is performed so that only the bands with the radioactive label on the 5' end will appear. In PAGE, the shortest fragments will migrate the farthest. Therefore, the bottom-most band indicates that its particu Tar dideoxynucleotide was added first to the labeled primer. For example, the band that migrate the farthest was in the ddATP reaction mixture. Therefore, dd ATP musthave been added fist tothe primer, and its complementary base, thymine, must have been the base present on the 3’ end of the seq strand. One can continue reading in this fashion, Note that if one reads the bases from the bottom up, ‘one is reading the 5' 1o 3' sequence of the strand complementary to the sequenced strand. quenced strand can be read S' to 3! by reading top to bottom the bases complementary to those Traditional methods of manual DNA sequencing utilize radioactive isotopes such as phosphorot sulfur-35, and phosphorous-33, incorporated into specific nucleotides (A,C,T,G). Radioactive lat nucleotides allow for reading the sequence by a technique known as autoradiography. The gel contains the separated DNA segments is exposed to X-ray film for a period of time. The radiation ‘causes dark spots on the film to indicate its location. Next, the film is develope to reveal the pattern of the labeled nucleotides. Since a process does not exist to discriminate the different nucleotides by the spots on the film, each labeled nucleotide must have its own lane on the gel. Therefore, four individual lanes are required for manual sequencing in order to determine the full DNA sequence. An individual ‘must interpret the results of this process and typically the results are entered into a computer for storage and linking to other results, AUTOMATED DNA SEQUENCING Automated deoxyribonucleic acid (DNA) sequencing reduces the volume of low-level radioactive ‘waste generated on campus, while providing a suitable alternative to manual DNA sequencing. Tradi- tional methods of manual DNA sequencing utilize radioactive isotopes to label the DNA. Automated DNA sequencing utilizes fluorescent tracers instead of radioisotopes to label the DNA, thereby elimi- nating or significantly reducing the use of radioactive materials in some research laboratories, Automated DNA sequencing utilizes a fluorescent dye to label the nucleotides instead of a radioactive isotope. The fluorescent dye is not an environmentally hazardous chemical and has no special han- dling or disposal requirements. Instead of using X-ray film to read the sequence, a laser is used to stimulate the fluorescent dye. The fluorescent emissions are collected on a. charge coupled device that is able to determine the wavelength, The Perkin-Elmer Applied Biosystems (ABI) DNA sequencers are designed to discriminate all four fluorescent dye wavelengths simultaneously, which allows for complete DNA sequencing in one lane on the gel. ‘Varying degrees of automation are also available. For full automation, all that is required is to load a sample tray with template DNA; the equipment performs the labeling and analysis. The other option is to perform the labeling reactions with fluorescent dyes, load the samples onto a gel, and place the gel into the DNA sequencer. The equipment performs the separation and analysis. The system automati- cally identifies the nucleotide sequence and saves the information on the computer. Thus, only a re view of the data is necessary to ensure no anomalies were misidentified by the computer, Benefits Automated DNA sequencing equipment can eliminate the need for radioactive isotopes to label DNA, thereby reducing the volume of low-level radioactive waste generated on campus. As a general. xximation, one template of manual DNA sequencing will produce 83 mL. of liquid waste and. 0.167 gallon of solid waste. As a result, every 45 templates processed by automated DNA seq reduces the amount of manual DNA sequencing. The time saved is due to not having to per ‘autoradiography or associated tasks required for working with radioactive materials such as surveys, inventory/disposal documentation, ete ea lie 1 ‘Tather than slab gel electrophoresis, thus, have separate thin capillary gel for each DN _ Blass slabs- with a series of gel- filled glass tubes each about the width of a huma ‘Sequencers use multiple tiny (capillary) tubes to run standard electrophoretic separat Tations are much faster because the tubes dissipate heat well and allow the use of much high fields to complete sequencing in shorter times. The machines automatically load Separation, detect the fluorescence and clean out the capillaries between runs. Advantages: 1. Better resolution, no running over from one lane to another. * 2. Separation of bands occurs much faster. => 10- to 15-fold increase in speed. Both advantages very important for Celera sequencing of the human genome Applications: DNA sequencing, first devised in 1975, has become a powerful technique in molecular biology, allow- ing analysis of genes at the nucleotide level. For this reason, this tool has been applied to many areas of research. For example, the polymerase chain reaction (PCR), a method which rapidly produces numerous copies of a desired piece of DNA, requires first knowing the flanking sequences of this piece. Another important use of DNA sequencing is identifying restriction sites in plasmids. Knowing these restriction sites is useful in cloning a foreign gene into the plasmid. Before the advent of DNA sequencing, molecular biologists had to sequence proteins directly; now amino acid sequences can be determined more easily by sequencing a piece of CDNA and finding an ‘open reading frame. In eukaryotic gene expression, sequencing has allowed researchers to identify conserved sequence motifs and determine their importance in the promoter region. Furthermore, a molecular biologist can utilize sequencing to identify the site of a point mutation. These are only afew examples illustrating the way in which DNA sequencing has revolutionized molecular biology.

You might also like