Genomic Dna Analysisusing Anfis and Ann: Günay Karlı, Adem Karadağ
Genomic Dna Analysisusing Anfis and Ann: Günay Karlı, Adem Karadağ
Genomic Dna Analysisusing Anfis and Ann: Günay Karlı, Adem Karadağ
International Burch University, Faculty of Engineering and IT, Department of IT, Sarajevo, Bosnia and Herzegovina 2 Bosna Sema - Educational Institutions, Sarajevo, Bosnia and Herzegovina
Abstract:-Gene expression is governed by the promoter which frequently appears before its associated gene in DNA sequence. In order to locate a gene in a given sequence, researchers have to find the location of a promoter. Thus, the problem of promoter identification is of major significance within biology. So this issue maintains its importance. In this study, we employ ANN (Artificial Neural Network) and ANFIS(Adaptive Neuro-Fuzzy Inference System) classifiers to predict promoters of DNA sequences, and evaluate their performances. The obtained results show that the classifiers competethe existing techniques for identifying promoter regions. Keywords:-Promoter prediction, ANN, ANFIS, data mining, bioinformatics, DNA
I.
INTRODUCTION
In biology the cell is regarded as the basic unit of life in organisms, being the lowest level of structure that supports all life activities [1].Proteins are a major structural component of cells, having core responsibilities for the maintenance of shape and structure of the particular cell in addition to molecular recognition and catalysis. DNA, which forms the fundamental structural proponent for proteins, is the blueprint carrying all cell information and instructions responsible for protein synthesis and regulation. Molecular biology posits that the information is transmitted from the DNA strand through the RNA to proteins as illustrated by Fig.1. This is the initial stage where RNA polymerase holoenzyme binds a gene onto mRNA i.e. messenger RNA sequence basing the DNA sequence as a template. The entire process of transcribing one gene characteristic on the other is referred to as an expression [2] and [3]: This stage will occur in the specific cities considering the high specificity of the chromosome orientations. There exists a Blueprint that refers to a common name for genetic coding whichpossesses the instructions needed by cells in environmental adaptability. This is further collaborating with research that cites that in addition to the instructions carried by the blueprint, there exists a synthesis point for most of the molecule including the RNA and proteins. The instructions contained in the blueprint are designed in such a manner that they are only readable in transcription and translation. In this sub sequential stage, the ribosomes synthesize the proteins with regard to the carried instructions on the mRNA. Such process through which the information on a single gene is transcribed onto another for facilitation of synthesis for another gene product is generally described as the gene expression. The new gene products, [2]and [5]are majorly comprised of protein structural aspects, though they exist as non-protein coding gene forms such as the rhino and the tRNA; having the minor non-coding RNAs such as mRNA and piRNA and the several diverse long non-coding RNAs that are later responsible for protein regulatory functions. Such specificity is as a result of the recognition by the RNA polymerase from the DNA sequence generally referred to as Promoter.
Fig. 2: Position of the promoter in a dna sequence. This has had a very significant impact on transcription as these regions are numerous in the DNA molecule thus easing the gene expression [3]. This process is facilitated majorly by the Promoter having much consideration of the responsibility it has on the transcription from a DNA strand to an RNA strand. This is further identified as the sequential upstream from the Transcriptional Start Site (TSS). This is well illustrated in the Fig. 1 [6]. The entire process commences with the binding of the RNA polymerase to a promoter array in the DNA molecule up to a point where the coding is realized. Coding occurs during the upstream movement around the promoter usually starting at 3 end of the DNA molecule to the 5 point in a DNA molecule [26] [22].
II.
METARIAL andMETHODS
There are two core classes of the promoter prediction, namely + and -. These classes will denote the existence of promoter prediction in the DNA sequence, having the + denoting for a positive indication of promoter location in the DNA sequence and the - denoting the absence of promoter locations in the DNA sequence. This research paper proposes to deal with a supervised learning technique in the prediction of promoter regions in the DNA sequence. 2.1. Data Set The research sought to incorporate the E. Cole promoter gene arrays of DNA in the testing the proficiency of ANN. Such data were collected from the UCI Repository[27]; this contains a set of 106 promoter and non-promoter instances. The research paper notes that such data is viable in the comparisons of ANN with the models existing in the literature; additionally such information involving the use of the data set is publicly available [4]. The 106 DNA arrays are composed of 57 nucleotides each. 53 of the DNA sequences in the data set had a + denoting, indicating the presence of promoter location in the DNA array. The research then sought to align the (+) parameter instances separately allowing for transcription. The following data characterize the (+) instances as observed from the experiment. One is that for every occurrence the (+) represents for the promoter positive presence, a name was also given in each instance and a classification of the DNA array was made composing of A, T, G and C stand for Adenine, Thymine, Guanine, Cytosine [27]. 2.2. Adaptive Neuro-Fuzzy Inference System (ANFIS) This is a Fuzzy Sugeno technique that is usually placed in a framework for adaptive systems to facilitate adaptation and learning [28].. It enhances the utilization of least-squares and a back propagation gradient descent technique. In addition, a hybrid learning algorithm is used in the identification of the membership function parameters and fuzzy IF- Then regulations that are usually considerate of single output or singleton [29]. The fuzzy inference is considered to bear two inputs and a single output. An equation is here below illustrated to affirm to the fuzzy if-then rules of Takagi and Sugeno rule[30]. Where if x=A and y=B then z is f (x, y) Where A and B are the fuzzy sets in the antecedents and Z = f (x, y) is a crisp function in the consequent. F (x, y) is usually a polynomial for the input variables x and y. Consider z= f (x, y) is a first-order Sugeno fuzzy inference system, which contains two rules. Rule 1: If x is A1 and y is B1, Then: f1 =a1 x +b1 y +c1 Rule 2: If x is A2 and y is B2, Then: f2=a2 x +b2 y +c2 ANFIS structure (Fig. 3.)contains five layers excluding input layer.
10
Fig. 4: Neural dendrites, axon, and synapse [37] In modeling, the density of real neurons is highly exhibited, fundamentally comprising of synapses, which are compounded by the density of the respective signals, in addition this is taken under a mathematical simulation thus helping to evaluate the activation of the neuron. This moreover helps to compute the result of the artificial neuron. According to [26] this hails from the property of the ANN that they can integrate artificial neurons in the processing of information. Usually, getting a precise definition of learning is a difficult task considering that the capability to learn is an essential characteristic of intelligence. From the experiment, it is posited that the ANN description is able to view from the efficient performance of a neuron task owing to updating of network systems. This is evidenced by the literature in [25] and [29]. One is able to obtain the desired output from the manipulation of the ANN; this is so by modifying the ANN weights. In such modifications, getting them by hand is a rather complicated and impossible task, giving supportive ground to the incorporation of ANN. In addition, [29] and [30], algorithms may be integrated in the modifications and alignments of ANN weights. The paper acknowledges the back-propagation algorithm where ANN is aligned in layers and is simulated for a forward signal transmission, thus allowing for signal errors to be propagated on the reverse [30]. The input area is the location where the neurons impact the networks and therefore initiating the output. Fig. 5 illustrates a three layered neural network having inputs and output.
11
Fig. 5: Multi-layer neural network A neuron possess two units that complement the products of weight coefficients and input signals with the other unit being responsible for the neuron activation function following its capability to decode nonlinearity. The units are denoted as Signal e for adder output signal and Signal y for the output signal of nonlinearity. The experiment notes the necessity to obtain a training data set that will comprise of input signals of x1 and x2 with a desired output z. In the network training, modifications of ANN weights are evaluated using the algorithm that will seek to commence with manipulating for both input signals from the training data set. Consequently, the output signals values are made easier to identify from each neuron in the network [33]. (See Fig. 5)
Fig. 6: Teaching Process of Multi-Layer NN The 106 DNA sequences composing the E. coli will feature for having 4 values. These values will stand for the A, T, G and C i.e. Adenine, Thymine, Guanine and Cytosine. Training the ANN and the DNA array with the 57 nucleotides attached to each promoter instance is coupled as an ANN input. The DNA sequence instances present the network output usually a description of either (+) or (-) occurrence.
III.
The experiments aim at evaluating an approach for proper arrays in the prediction and identification of the Promoter using the ANN and ANFIS. It also integrates the comparison of existing approaches from past researches. Such endeavors are usually conducted in two phases that showcase a precise learning algorithm, training and testing. Training will involve establishing a classification model; testing entails the implementation of the classification model previously established. A standard 5-fold cross-validation was integrated into the evaluation of the ANN performance by having the dataset being randomly portioned into 5 subsets. This classification ensures an equal ratio of (+) and (-) promoter locations in the DNA array. The training occurred on the ANN for a series 5 times engaging only 4 subsets for each training while as retaining the remaining 5 for testing. As a result, 5 models were established during the cross-validation. Additionally, a final prediction performance was carried out on the subsets evaluating the average results from the experiment.
12
TP is true positive (promoter predicted as promoter) FN is false negative (promoter predicted as non- promoter) TN is true negative (non- promoter predicted as non- promoter) FP is false positive (non- promoter predicted promoter). The detailed performance of modules in term of SE, SP, ACC and MCC is shown in the following tables. This involved the testing of the several structures of the ANN with one layer, logsig transfer function and trainrp, traincgp learning algorithms and had the following results; Table 1: The Performance of ANN-1 Transfer Function Learning Algorithms logsig logsig logsig trainrp traincgp trainscg
Table 2: The Performance of ANN-2 Transfer Function Learning Algorithms logsig logsig logsig trainrp traincgp trainscg
Table 3: The Performance of ANN-3 Transfer Function Learning Algorithms logsig logsig logsig trainrp traincgp trainscg
From the above experiment, this paper notes that there was a positive result, having a tremendous 0.84 (ACC) that is produced by the ANN with the one layer having 40 neurons, logsig transfer function and trainrp learning algorithm. In the experiments with the ANFIS, the paper acknowledges with great attention that in preparation of FIS, an FCM (fuzzy c-means) was used. This also integrated the application of Genfis3 which its core role is the generation of Sugeno-type FIS structure (fismat) with a given coding for input data (Xin 57-element nucleotide sequence) and output data (Xout - two classes, promoter or non-promoter). This information is illustrated as;ANFIS info: Number of nodes: 9340 Number of linear parameters: 4640 Number of nonlinear parameters: 9120 Total number of parameters: 13760 Number of training data pairs: 80 Number of checking data pairs: 0 Number of fuzzy rules: 80
13
ACC SE SP MCC
From the experiment on promoter prediction the best results were found to be at 0.70 (ACC) having been as a result of a non-satisfactory relationship as compared to ANN. Doing a review on the literature affirms that the application of the leave-one-out cross-validation (LOOCV) in the promoter prediction and when evaluating the performances of the ANN and ANFIS. In these endeavors the entire data excluding the single observation that was engaged in training. Such techniques are a bigger integral part in Bioinformatics[43].
Table 5:The Errors of Some Machine Learning Algorithms on Promoter Data Set. Errors System Classifier 0/106 REX-1 Inductive L.A ANN 0/106 One hidden layer 2/106 IREM Class-based entropy 4/106 KBANN A hybrid ML system 8/106 BP Standard backpropagation with one ANFIS 11/106 genfis3 layer 12/106 O'Neill Ad hoc tech. from the bio. lit.nearest neighbours NearNeigh 13/106 A algorithm ID3 19/106 Quinlan's decision builder By comparison the classifiers already integrated in the experiments seeking to investigate on promoter prediction, as in Table 4. ANN as described there in the paper does perform better than classifier for promoter prediction, BP, ID3, KB, NN and ONeil parameters bearing in mind the occurrence of errors.
IV.
CONCLUSIONS
Usually, in Bioinformatics a challenge exists in endeavors seeking to predict and identify the location of promoter regions in the DNA molecule. In such quest to investigate on the same, this paper finds the ANN and ANFIS techniques as being potential eliminators of such challenge. Through the computational ANN, a model based on the structure and functions of biological neural networks is established. In this, information is expected to flow through the network structure of the ANN because a neural network changes - or learns, in a sense - based on that input and output. It will involve a hybrid learning procedure used to form an input-output mapping based on the training data pairs, in addition, employing a fuzzy inference system in the framework of adaptive networks. Thus, the writer finds it of significance that employing ANN for promoter prediction leads to promising results and leads to improvements, increasing the accuracy of the results obtained. Employing Adaptive Neuro-Fuzzy Inference System is not an adequate method for prediction of promoter as exhibited by the results of the experiment. The dimensionality of this dataset should be reduced by means of the feature selection process, thereby in the future in a bid by increasing the accuracy of employing ANFIS.
REFERENCES
[1]. [2]. [3]. [4]. [5]. C. M. O'Connor and U. J. Adams, Essentials of Cell Biology, Cambridge: NPG Education, 2010. R. Taft, K. Pang and T. Mercer, "a Non-coding RNAs: regulators of disease," Journal of Pathology, no. 220, pp. 126-139, 2010. D. Pe'er, From Gene Expression To Molecular Pathways, Hebrew : Hebrew University, 2003. C. Gabriela and M.-I. Bocicor, "Promoter Sequences Prediction Using Relational Association Rule Mining," Evolutionary Bioinformatics, vol. 8, pp. 181-196, 2012. L. T. Corporation, Introduction to Gene Expression, Life Technologies Corporation, 2010.
14
[22].
[32]. [33].
15
16