An Overview On Gene Expression Analysis: Dr. R. Radha, P. Rajendiran
An Overview On Gene Expression Analysis: Dr. R. Radha, P. Rajendiran
org
(Department of Computer Science, S. D .N. B. Vaishnave college of Women, chromepet, Chennai, Tamil nadu India.) 2 (Department of Computer Science, Vidyaa Vikas Educational Institutions, Tiruchengode, Namakkal, Tamilnadu - India)
Abstract: Recent advances in DNA microarray technology, also known as gene chips, allow measuring the
expression of thousands of genes in parallel under multiple experimental conditions [1]. This technology is having a significant impact on genomic studies. Disease diagnosis, drug discovery and toxicological research benefit from the microarray technology. Arrays are now widely used in basic biomedical research for mRNA expression profiling and are increasing being used to explore patterns of gene expression in clinical research. Keywords: ANN, Classification, Clustering, Gene expression, Micro Array
I.
Introduction
Various approaches have recently been used in outcome prediction using gene expression data. It has been shown that specific patterns of gene expression occur during different biological states such as cell development and during normal physiological responses in tissues and cells. There are many data mining techniques which help to analyze the gene expression data [2]. The generation of quantitative expression patterns of many genes in parallel can be achieved by using techniques based on complementary DNA micro arrays [3], [4]. II. Gene Expression Data A microarray experiment typically assesses a large number of DNA sequences (genes, CDNA clones, or expressed sequence tag [ESTs] under multiple conditions. These conditions may be a time series during a biological process (e.g: the yeast cell cycle) or a collection of different tissue samples [5]. The original gene expression matrix obtained from a scanning process contains noise, missing values, and systematic variations arising from the experimental procedure. Within a gene expression matrix, there are usually several particular macroscopic phenotypes of samples related to some diseases or drug effects. The remaining genes in the gene expression matrix are irrelevant to the division of samples of interest and thus are regarded as noise in the data set [5]. A recent effort to understand how genes contribute to disease approaches the discovery of sub-classes of diffuse large B-cell lymphoma (DLBCL) by using expression analysis [6]. It has been shown that the discovery of sub-classes in DLBCL has not been successful by relying exclusively on morphological features [3]. Alizadeh et al [6] demonstrate that the molecular profile of a tumor obtained from CDNA microarrays can indeed be interpreted as a robust and clear picture of the tumor biology. In [7], at Patrick Browns lab at Stanford has used microarrays to measure gene expression levels for the entire yeast genome (approximately 6400 distinct CDNA sequences) during the diauxic shift (transition from sugar metabolism to ethanol metabolism) , sporulation and the entire cell cycle. These data sets are publicly available. The Brown lab also has an online guide to build your own arrayed and scanner. These micro arrays have been commercialized by Incyte pharmaceuticals microarray division(formerly Synteni). Incyte Gene expression Microarrays(GEMs) are available with templates from human, rat, mouse, plant and microbial genomes. Different approaches have recently been used on outcome prediction using gene expression profiles. In the Cox proportional hazard regression method [8,9] genes most related to survival are first identified by a univariate Cox analysis, and a risk score is then defined as a linear weighted combination of the expression values of the identified genes[10,11]. Advances in techniques for high throughput data gathering, such as microarray and DNA sequencing machine have opened up new research avenues in genomics. Large-scale biological research such as genome projects are now producing enormous quantities of genomic data using these rapidly growing technologies. Transforming the massive data to useful biological knowledge is the present challenge. Different analysis tools are being developed in order to detect and understand the phenomena of gene regulation and physiological functions and assessing the quality of a genomic sequence [12].
www.iosrjournals.org
31 | Page
III.
In data mining classification is one of the most important tasks. It maps the data into predefined targets. It is a supervised learning as targets are predefined. The aim of the classification is to build a classifier based on some cases with some attributes to describe the objects or one attribute to describe the group of the objects. Then the classifier is used to predict the group of new cases from the domain based on the value of other attributes. The systematic classification of types of tumors is crucial to achieve advances in cancer treatment and research. It has been suggested that the specification of therapies according to tumor types differentiated by pathogenetic patterns may maximize the efficiency of the treatment and minimize toxicity on the patients [14, 6]. Several limitations about the conventional classification techniques based on morphological features of the tumor have been reported in the literature [15]. Moreover, by analyzing complex patterns defined by molecular markers, it has been demonstrated that there are subtypes of acute leukemia, prostate cancer and non-HodgkinsLymphomas[14]. There are two useful tasks in cancer classification, prediction of classes and discovery of classes. The prediction task consists of the assignment of particular tumor samples to known types of cancer. The discovery task refers to the unsupervised identification of relevant groups of samples and the characterization of subtypes of cancer. Their research aims to implement a discovery task based on a global expression analysis approach. Most approaches to the computational analysis of gene expression data are functionally significant classification of genes in unsupervised fashion and the discrimination of high risk patients from low risk ones. On the other hand, supervised learning techniques use training set to optimize the discrimination model. Artificial Neural Network (ANN) is one of the supervised methods and a powerful tool for accurately detecting causal relationships [13]. Tamayo et.at, have illustrated the value of Kohonens self-organizing feature maps (SOFM) [16] to interpret gene expression patterns during yeast growth cycle and hematopoietic differentiation [17]. They identify predominant gene expression patterns in those biological processes that suggested, for instance, novel hypotheses about hematopoietic differentiation useful for the treatment of acute promyelocytic leukaemia. Similarly based on a SOFM, Golub et al. [14] approaches the problem of molecular classification of cancer. Classification of biomedical data faces a special challenge because of the characteristics of the data: too few data examples with too many features. How to improve the classification performance or the generalization ability of a classifier in the biomedical domain becomes one of the active research areas. One approach is to build a fusion model to combine multiple classifiers together and result in a combined classifier which can achieve a better performance than any of its composing individual classifiers [12]. [18] proposed a sum classifier fusion model to combine multiple SVMs by applying the knowledge of fuzzy logic and genetic algorithms. The most straight fordward classifier design approach is based on the concept of similarity. In this approach, the distance between the test patterns whose class is to be decided and the known representatives or prototypes of classes are measured. Given a training set and a similarity measure or metric, to decide for the class membership of a test sample, the k-nearest neighbors (k-NN) find the class membership of the k closet samples in the training set and take a majority vote. The k-NN classifier that assigns the test samples to the class of nearest observations in the training set is often used as a benchmark for other classifiers, since it always offers reasonable classification performance [47]. In the nearest mean classifier, the prototypes are the class means / centres or centroids. Tibshirani et al., [48] suggested an enhancement for the nearest centroid classifier, called Nearest Shrunken Centroids(NSC) (The NSC is also referred to as PAM. Prediction Analysis of Microarrays, due to the name of the associated paper and software). In NSC, weak components of the class- centroids are shrunk or deleted via soft-thresholding. The classification accuracy (expressed in terms of training test, and cross validation error rates) and the number of present (or undeleted) genes are plotted against a parameter called delta that adjusts the amount of shrinkage and an optimal value for delta is selected by examining the error rates shrinkage eliminates the information that does not contribute towards class prediction, i.e noise,. The contribution or strength of each class centroid to the classification is measured by a t- statistics, where the numerator is the difference between individual class means and the overall mean and the denominator is the pooled estimate of standard deviation inflated by a fudge factor. Another popular classifier design approach is based on Artificial Neural Networks. NN consists of many interconnected processing elements, called neurons, resembling human brains structure through different structures (varying number of layers and number of neurons per layer) linear or non linear transfer functions that the individual neurons use, and training paradigms during which the weights of the connections are adjusted or tuned, the NN can model / reveal complex relationship among inputs and outputs exemplified or embedded in the training data [32]. www.iosrjournals.org 33 | Page
IV.
Clustering problems arise in many different applications such as data mining and knowledge discovery, data compression, pattern recognition and pattern classification in order to grouping similar genes in one cluster so that genes within the same cluster are similar to each other and different from genes in other cluster [19]. Clustering techniques have proven to be helpful to understand gene function, gene regulation, cellular processes, and sub types of cells. Genes with similar expression patterns (co expressed genes) can be clustered together with similar cellular function [5]. The purpose of clustering gene expression data is to reveal the nature structure inherent in the data. A good clustering algorithm should depend as little as possible on prior knowledge, for example requiring the predetermined number of cluster as an input parameter. Clustering algorithms for gene expression data should be capable of extracting useful information from noisy data. Gene expression data are often high connected and may have intersecting and embedded patterns [20]. Clustering algorithm which also provides some graphical representation of the cluster structure is much favored by biologists. There are numerous clustering techniques presently available to cluster particularly the gene expression data such as hierarchical clustering technique which is a method used commonly by many people in early days. A common problem associated with this method is visualization of clustering results in terms of dendrogram which is difficult when a data set is large [21]. In the popular k-means clustering method, the user was always uncertain to define the precise number of clusters. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In some situations, the object may belong to more than one cluster, and associated with each element is a set membership level. Clustering may be either crisp (or) fuzzy [22]. Fuzzy clustering of microarray data has an advantage over crisp partitioning because of great amount of imprecision and uncertainty related with gene expression data [23]. Fuzzy c- means [24] and genetic algorithms (GA) [25],[26] have been used effectively in clustering gene expression data. The fuzzy c-means algorithm requires the number of clusters as an input parameter. The GA based algorithms have been found to detect biologically relevant clusters but are dependent on proper tuning of the input parameters. [27] have presented a framework for the unsupervised analysis gene expression data. They developed an interrelated two-way clustering method which they applied on the gene expression matrices transformed from the new microarray data. This approach detects significant patterns within samples while dynamically selecting significant genes which manifest the conditions of actual empirical interest. Through iterative clustering the number of genes are reduced which improves the accuracy of sample class discovery. The method was proved effective by conducting experiment with two multiple sclerosis data sets and a leukemia data set. These www.iosrjournals.org 34 | Page
V.
Conclusion
Gene expression profiling has great potential for accurate cancer diagnosis. In this paper, we have discussed different types of advances in techniques for high throughput data gathering such as microarrays and DNA sequencing machine that have opened up new research in genomics . Large-scale biological research such as genome projects are now producing enormous quantities of genomic data using these rapidly growing technologies. Different analysis tools are developed in order to detect and understand the phenomena of gene regulation and physiological functions and assessing the quality of a genomic sequence.
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] M.B Eisen, P.T.Spellman,P.O.Brown, and D. Botstein, Cluster analysis and display of genome- wide expression patterns, (Proc,Natl.Acad.Sci. USA), Vol.95,pp.14863-8, (1998). P.J.Russel,Fundamentals of genetics, Second Edition, (San Francisco, Addison Wesly Longman Inc., 2000). M.Schena,D.Shalon, R.W.Davis and P.O.Brown, Quantitative monitoring of gene expression patterns with a complementary DNA Micro Array Science,270, 476-471, 1995. M.B.Eisen and P.O.Brown, DNA arrays for analysis of gene expression, Methods Enzymol.,303,179-205,(1999). Daxin Jiang, Chun Tang and Aidong Zhang,Cluster analysis for Gene Expression Data : A Survey IEEE Transactions on Knowledge and Data Engineering, vol.16, No.11, November 2004, pp 1370-1384 A.A.Alizadeh et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling Nature, 403, 503511,(2000). Dhaeselecr, Shoudan Liang and Roland Somogyi, PsB99 Tutorial Gene Expression Analysis and Modeling. Cox, D.R.Regression models and life-tables (with discussion),J.R.Stat Soc.,B34:184-220,(1972). Lunn,M., & McNeil,D.R.., Applying Cox Regression to Competing Risks, Biometrics 51: 524-532,(1995). Beer D.G.Kardia, S.L.,Huang, C.C.Giodano, T.J..,Levin, A.M.Misek, D.E.,Lin, L.,Chen.G.Gharib,T.G.,Thomas, D.G.,Lizyness, M.L.,Kuick, R.,Hayasaka, S.,Taylor, J.M.,Iannettoni, M.D.,Irringer, M.B&Hanash,S., Gene Expression Profiles predict survival of patients with lung adenocarcinoma, Nat. Med.,8(8): 816-823,2002. Rosenwald,A.et al, The use of molecular profiling to predict survival after chemotherapy for Diffuse large-B cell lymphoma, NEJM,346(25):1937-1947,2002. R. Radha., Gene Expression Analysis, International Journal of Advanced Science and Technology, Vol.33, August 2011. Khan, J. et al., Classification and diagnostic Prediction of Cancers using gene expression profiling and artificial Networks, Nat.Med., 7:673-679,2001. T.R.Golub,D.K.Slonim , P.Tamay, C.Huard, M.Gassembeek, J.P. Mesirov, H.Coller, M.L Loh, J.R.Downing, M.A.Caligiuri, C.D. Bloomfield and E.S. Lander, Molecular classification of Cancer: class discovery and class prediction by Gene expression monitoring, science, 286, 531-537, 1999. F.Azuaje, Interpretation of genome expression patterns: Computational challenges and opportunities, to be published by IEEE Engineering in Medicine and Biology, November 2000. T.Kohonem, Self-organizing Maps, (Heidelberg, Springer, 1995). P.Tamayo, D.Slonim, J.Mesirov, Qzhn, S.Kitareewan, E.Dmistrovsky, E.lander and T.R.Golub, Interpreting Patterns of gene expression with self organizing maps: methods and applications to hematopoietic differentiation, The Proceedings of the National Academy of Sciences of U.S.A.,96,2907-2912,(1999). Xiujuan chen, Yong Li, Robert Harrison, Yan-Qing Zhang, Genetic fuzzy classification fusion of multiple SVMs for biomedical data journal of intelligent & Fuzzy systems. Volume 18, issue 6, December 2007, IOS press Amsterdam. Han, Kamber,Data Mining Concepts and Techniques, (Elsevier Publications, 2006). D.jiang, J.pei, and A.Zhang, DHC: a density based hierarchical clustering method for time series gene expression data. In Proceedings of BIBE2002,:3rd IEEE International Symposium on Bio-informatics and Bio-Engineering. Bethesda Maryland 2003, p.393. Anil K.Jain and Richard C.Dubes, Alogrithms for clustering data, (Prentice Hall,New Jersey, 1988). P. Valarmathie, Dr. MV. Srinath, Dr.T.Ravichandran, K.Dinakaran, Hybrid Fuzzy C-means Clustering Technique for Gene expression data, International Journal of Research and Reviews in Applied Sciences, ISSN:2076-734X, EISSN:2076-7366, Volume 1, issue 1, October 2009. Anirban, Mukhopadhayay, Ujjuval Maulik and sanghamitra bandyopadhyay, Efficient two stage fuzzy clustering of microarray gene expression data, International Conference on information Technology (ICIT06) , 2006 IEEE. J.C. Bezdek, Pattern Recognintion With Fuzzy Objective Function Algorithms, (New York;Plenum Press, 1981). S. Bandyopadhyay, A.Mukhopadhyay, and U.Maulik, An important algorithm for clustering gene expression data,Bioinformatics, vol.23(21),pp. 2859-2865,2007.
[21]. [22].
www.iosrjournals.org
35 | Page
[29]. [30]. [31]. [32]. [33]. [34]. [35]. [36]. [37]. [38]. [39]. [40]. [41]. [42]. [43]. [44]. [45]. [46]. [47]. [48]. [49]. [50]. [51]. [52]. [53]. [54]. [55]. [56]. [57].
[58]. [59]. [60]. [61]. [62]. [63]. [64]. [65]. [66]. [67]. [68]. [69].
www.iosrjournals.org
36 | Page