Data Mining in Bioinformatics
Data Mining in Bioinformatics
Page |2
INDEX
I.
II.
III.
IV. V.
VI.
Page |3
ABSTRACT The field of bioinformatics , always generates a huge amount of data. To study such volumes of data ,data mining techniques needs to be used. The major data generated in this field falls in the below category: (1) understanding the gene sequencing , ie comparing differing genes of the same species (2) investigating data analysis approaches with the purpose of identifying promising methods pertinent to human health aspects and (3) Studying the different diseases associated with humans and also study their characteristics .
Page |4
INTRODUCTION Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics, for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing). A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name, the input sequence with a description of the type of molecule, the scientific name of the source organism from which it was isolated, and often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met: easy access to the information a method for extracting only that information needed to answer a specific biological question
Page |5
OVERVIEW
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The related terms data dredging, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can, however, be used in the creation of new hypotheses to test against the larger data populations. . DATA MINING ARCHITECTURE The data mining process consists of several processes and stages, which are related to each other and interactive. The main stages of the data mining process are (1) domain understanding; (2) data selection; (3) cleaning and preprocessing; (4) discovering patters; (5)interpretation; (6) reporting and using discovered knowledge
M.S.R.I.T Deptt. Of Information Science & Engineering
Page |6
Page |7
Page |8
1. lymphatics 2. block_of_affere 3. bl_of_lymph_c 4. bl_of_lymph_s 5. by_pass 6. extravasates 7. regeneration_of 8. early_uptake_in 9. lym_nodes_dimin 10.lym_nodes_enlar 11.changes_in_lym 12.defect_in_node 13.changes_in_node 14.changes_in_stru 15.special_forms 16.dislocation_of 17.exclusion_of_no 18.no_of_nodes_in 19.class
Page |9
2. Basic Statistics Once the data set has been loaded Weka will recognize the attributes and during the scan of the data will compute some basic statistics on each attribute. The left panel in Figure below shows the list of recognized attributes, while the top panels indicate the names of the base relation (or table) and the current working relation
Clicking on any attribute in the left panel will show the basic statistics on that attribute. For categorical attributes, the frequency for each attribute value is shown, while for continuous attributes we can obtain min, max, mean, standard deviation, etc. The figure below illustrates the same. It shows the type of attribute be it numeric, nominal etc. for nominal attribute Lead shown below it tells us the number of distinct values and also lists the number of occurrences along with the values for each attribute
P a g e | 10
P a g e | 11
3.Selecting and Filtering Data In our sample data file, each record is uniquely identified by a Flight no. We need to remove this attribute before the data mining step. We can do this by using the Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a popup window with list available filters. Expand the filters, then expand unsupervised, then expand attributes and select NumericToNominal filter from that. It is as shown below in the figure.
P a g e | 12
After this click on text box immediately to the right of the "Choose" button.This will conert the data into nominal as required to run apriori algorithm. Then click "OK". It is as illustrated below in the figure.
3. Discretization Some techniques, such as association rule mining, can only be performed on categorical data. This requires performing discretization on numeric or continuous attributes. There are many such attributes in this data set no_of nodes is an Integer we discretize it . In this case, we have opted for keeping all of these values in the data. This means we can simply discretize by removing the keyword integer as the type for the "no_of_nodes" attribute in the ARFF file, and replacing it with the set of discrete values. We do this directly by opening the lymph.arff file in Gedit as shown below in the figure.
P a g e | 13
P a g e | 14
Now select the attributes to be discretized which are 1,2,4,5,6,9 here and select bins =4. The discretized values are shown
P a g e | 15 4. Association
Mining
Now that all the attributes have been discretized we can perform association mining on the dataset. The most commonly used algorithm is the apriori which we will also be using. Go to the associate tab in weka. In that tab click on choose and select apriori as the associator. Then click on the textbox next to the choose button. A dialog box appears. Here we change the default value of rules to 20, this indicates that the program will report no more than the top 20 rules. The upper bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA starts with the upper bound support and incrementally decreases support (by delta increments which by default is set to 0.05 or 5%). The algorithm halts when either the specified number of rules are generated, or the lower bound for min. support is reached. The significance testing option is only applicable in the case of confidence and is by default not used (-1.0). The figure below shows the final dialog box for apriori. Set the car to true, classindex to 1 and also specify the number of rules in numRules.
P a g e | 16
Now we click ok and then click on start. The results are displayed as shown below
P a g e | 17
Visualization The relationship between attributes can be shown in terms of graphs by plotting tne X and Y coordinates of araph with the attributes between which relation should be visualized using visualize button in the top panel of the weka explorer By pressing the visualize button following screen is obtained
Consider we need to visualize relation between Status and Departure time . select the corresponding by clicking on the red square as shown above. The following screen appears showing the relation between status and departure time.
P a g e | 18
P a g e | 19
Conclusion Data Mining can be of great help in aviation. Various predictions can be made on the basis of the data collected which in turn can help in making the control and monitoring of air traffic and other related data very easy and convienient.
P a g e | 20
ACKNOWLEDGEMENT We would like to sincerely thank our teacher Prof. Sumanna, our HOD and all the faculty members. We would also like to thank our classmates and friends without whom this would not have been possible.
P a g e | 21