0% found this document useful (0 votes)
54 views

Data Mining in Bioinformatics

The document discusses using data mining techniques like association rule mining to analyze biomedical data. It describes loading biomarker data into the WEKA tool and performing preprocessing steps like discretization and attribute filtering. The Apriori algorithm is then used to generate association rules between biomarkers and extract patterns from the data.

Uploaded by

keerthanpai
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Data Mining in Bioinformatics

The document discusses using data mining techniques like association rule mining to analyze biomedical data. It describes loading biomarker data into the WEKA tool and performing preprocessing steps like discretization and attribute filtering. The Apriori algorithm is then used to generate association rules between biomarkers and extract patterns from the data.

Uploaded by

keerthanpai
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 21

DATA MINING IN BIOINFORMATICS

Page |2

INDEX

I.
II.

AbstractPage 2 Introduction Page 3 Overview..Page 4 Implementation Using WEKA...Page 5 Conclusion..Page 22 Acknowledgement.Page 23

III.
IV. V.

VI.

M.S.R.I.T Deptt. Of Information Science & Engineering

Page |3

ABSTRACT The field of bioinformatics , always generates a huge amount of data. To study such volumes of data ,data mining techniques needs to be used. The major data generated in this field falls in the below category: (1) understanding the gene sequencing , ie comparing differing genes of the same species (2) investigating data analysis approaches with the purpose of identifying promising methods pertinent to human health aspects and (3) Studying the different diseases associated with humans and also study their characteristics .

M.S.R.I.T Deptt. Of Information Science & Engineering

Page |4

INTRODUCTION Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics, for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing). A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name, the input sequence with a description of the type of molecule, the scientific name of the source organism from which it was isolated, and often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met: easy access to the information a method for extracting only that information needed to answer a specific biological question

M.S.R.I.T Deptt. Of Information Science & Engineering

Page |5

OVERVIEW

Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The related terms data dredging, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can, however, be used in the creation of new hypotheses to test against the larger data populations. . DATA MINING ARCHITECTURE The data mining process consists of several processes and stages, which are related to each other and interactive. The main stages of the data mining process are (1) domain understanding; (2) data selection; (3) cleaning and preprocessing; (4) discovering patters; (5)interpretation; (6) reporting and using discovered knowledge
M.S.R.I.T Deptt. Of Information Science & Engineering

Page |6

Implementation using WEKA Tool


Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. WEKA is free software available under the GNU General Public License. Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. All of Weka's techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes. Problem Statement:- We have considered a single multimedia schema describing video-library. The application should allow multiple movie makers working simultaneously to store, remove and manipulate different kinds of multimedia data we assume that some material is gathered from database. This application helps the manager of a a video library to group the customers according to the purchase language, rating, cast etc Working 1. Loading The Data After weka 3.6 has been installed we launch the explorer application of weka. Now we need to load the dataset we have created as a .csv extension. Click on choose file and then change the file type to .csv and browse to the desired location and select the file. It is as shown below in the figure.

M.S.R.I.T Deptt. Of Information Science & Engineering

Page |7

M.S.R.I.T Deptt. Of Information Science & Engineering

Page |8

Our dataset contain the following attributes

1. lymphatics 2. block_of_affere 3. bl_of_lymph_c 4. bl_of_lymph_s 5. by_pass 6. extravasates 7. regeneration_of 8. early_uptake_in 9. lym_nodes_dimin 10.lym_nodes_enlar 11.changes_in_lym 12.defect_in_node 13.changes_in_node 14.changes_in_stru 15.special_forms 16.dislocation_of 17.exclusion_of_no 18.no_of_nodes_in 19.class

M.S.R.I.T Deptt. Of Information Science & Engineering

Page |9

2. Basic Statistics Once the data set has been loaded Weka will recognize the attributes and during the scan of the data will compute some basic statistics on each attribute. The left panel in Figure below shows the list of recognized attributes, while the top panels indicate the names of the base relation (or table) and the current working relation

Clicking on any attribute in the left panel will show the basic statistics on that attribute. For categorical attributes, the frequency for each attribute value is shown, while for continuous attributes we can obtain min, max, mean, standard deviation, etc. The figure below illustrates the same. It shows the type of attribute be it numeric, nominal etc. for nominal attribute Lead shown below it tells us the number of distinct values and also lists the number of occurrences along with the values for each attribute

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 10

The visualization graphs shown below is for all attributes

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 11

3.Selecting and Filtering Data In our sample data file, each record is uniquely identified by a Flight no. We need to remove this attribute before the data mining step. We can do this by using the Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a popup window with list available filters. Expand the filters, then expand unsupervised, then expand attributes and select NumericToNominal filter from that. It is as shown below in the figure.

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 12

After this click on text box immediately to the right of the "Choose" button.This will conert the data into nominal as required to run apriori algorithm. Then click "OK". It is as illustrated below in the figure.

3. Discretization Some techniques, such as association rule mining, can only be performed on categorical data. This requires performing discretization on numeric or continuous attributes. There are many such attributes in this data set no_of nodes is an Integer we discretize it . In this case, we have opted for keeping all of these values in the data. This means we can simply discretize by removing the keyword integer as the type for the "no_of_nodes" attribute in the ARFF file, and replacing it with the set of discrete values. We do this directly by opening the lymph.arff file in Gedit as shown below in the figure.

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 13

Then select Discretize from the filters menu

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 14

Now select the attributes to be discretized which are 1,2,4,5,6,9 here and select bins =4. The discretized values are shown

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 15 4. Association

Mining

Now that all the attributes have been discretized we can perform association mining on the dataset. The most commonly used algorithm is the apriori which we will also be using. Go to the associate tab in weka. In that tab click on choose and select apriori as the associator. Then click on the textbox next to the choose button. A dialog box appears. Here we change the default value of rules to 20, this indicates that the program will report no more than the top 20 rules. The upper bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA starts with the upper bound support and incrementally decreases support (by delta increments which by default is set to 0.05 or 5%). The algorithm halts when either the specified number of rules are generated, or the lower bound for min. support is reached. The significance testing option is only applicable in the case of confidence and is by default not used (-1.0). The figure below shows the final dialog box for apriori. Set the car to true, classindex to 1 and also specify the number of rules in numRules.

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 16

Now we click ok and then click on start. The results are displayed as shown below

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 17

Visualization The relationship between attributes can be shown in terms of graphs by plotting tne X and Y coordinates of araph with the attributes between which relation should be visualized using visualize button in the top panel of the weka explorer By pressing the visualize button following screen is obtained

Consider we need to visualize relation between Status and Departure time . select the corresponding by clicking on the red square as shown above. The following screen appears showing the relation between status and departure time.

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 18

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 19

Conclusion Data Mining can be of great help in aviation. Various predictions can be made on the basis of the data collected which in turn can help in making the control and monitoring of air traffic and other related data very easy and convienient.

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 20

ACKNOWLEDGEMENT We would like to sincerely thank our teacher Prof. Sumanna, our HOD and all the faculty members. We would also like to thank our classmates and friends without whom this would not have been possible.

M.S.R.I.T Deptt. Of Information Science & Engineering

P a g e | 21

REFERENCES: Weka tool User guide By Waikato University www.ncbi.nlm.nih.gov http://repository.seasr.org/Datasets/UCI/arff/

M.S.R.I.T Deptt. Of Information Science & Engineering

You might also like