0% found this document useful (0 votes)

29 views9 pages

Text Categorization Using Association Rule and Naïve Bayes Classifier

The document presents a method for text categorization that uses association rule mining to derive feature sets from pre-classified documents, followed by using a Naive Bayes classifier on the derived features to perform the final categorization. It first preprocesses documents by removing stop words. Frequently occurring words are collected from each document and treated as items in a transaction to apply the Apriori algorithm and discover associated word sets. These associated words act as features for the Naive Bayes classifier.

Uploaded by

Unmesh Phalak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views9 pages

Text Categorization Using Association Rule and Naïve Bayes Classifier

Uploaded by

Unmesh Phalak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Text Categorization using Association Rule and Nave Bayes Classifier

S M Kamruzzaman and Chowdhury Mofizur Rahman* Department of Computer Science & Engineering International Islamic University Chittagong, Chittagong-4203, Bangladesh *Department of Computer Science & Engineering United International University, Bangladesh Email: [email protected], *[email protected]
English word found in that position. Nave Bayes categorization is given by: VNB = argmax P (Vj) P (aiVj) To summarize, the Nave Bayes categorization VNB is the categorization that maximizes the probability of observing the words that were actually found in the example documents, subject to the usual Nave Bayes independence assumption. The first term can be estimated based on the fraction of each class in the training data. The following equation is used for estimating the second term: nk + 1 n + vocabulary where n is the total number of word positions in all training examples whose target value is Vj, nk is the number of times that word is found among these n words positions, and vocabularyis the total number of distinct words found within the training data. The proposed system is given a set of example documents. We first preprocess the text documents by parsing and removing stop words (Frank). We then collect frequently occurring words from each document. Each document is treated as a transaction and the set of frequently occurring words are viewed as a set of items in the transaction. We then apply association mining method (Frank, 2000) to discover sets of associated words in the documents. These set of associated words act as features. We then classify new documents using Nave Bayes approach but using derived feature sets.

Abstract
As the amount of online text increases, the demand for text categorization to aid the analysis and management of text is increasing. Text is cheap, but information, in the form of knowing what classes a text belongs to, is expensive. Automatic categorization of text can provide this information at low cost, but the classifiers themselves must be built with expensive human effort, or trained from texts which have themselves been manually classified. Text categorization using Association Rule and Nave Bayes Classifier is proposed here. Instead of using words word relation i.e association rules from these words is used to derive feature set from pre-classified text documents. Nave Bayes Classifier is then used on derived features for final categorization.

Keywords: Text categorization, association rule, Apriori algorithm, confidence, support, frequent itemsets, Nave Bayes classifier. 1. Introduction
Text categorization is the automated assigning of natural language texts to predefined categories based on their content. Text categorization is the primary requirement of Text Retrieval systems, which retrieve texts in response to a user query, and Text Understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. There exist some algorithms for learning to classify text based on the Nave Bayes Classifier. The probabilistic approaches for learning to classify text are described by Lewis (Lewis et al. 1992). In applying Nave Bayes Classifier, each word position in a document is defined as an attribute and the value of that attribute to be the

2. Mining Association Rules

Association rule mining finds interesting association or correlation relationships among a large set of data items. The discovery of interesting association relationships among huge amounts of

transaction records can help in many decision making processes. 2.1 Data Mining Popularly referred to as Knowledge Discovery in Databases (KDD), Data Mining is the automated extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories. Standard data mining methods may be integrated with information retrieval techniques and the construction or use of hierarchies specifically for text data as well as disciplineoriented term categorization systems (such as in chemistry, medicine, law, or economics). Text databases are databases that contain word descriptions for objects. These word descriptions are usually not simple keywords but rather long sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary reports, notes, or other documents. The widely used and well-known data mining functionalities are Characterization and Discrimination, content based analysis (Hayes, 1990), Association Analysis, Categorization and Prediction (Han, 2001), Cluster Analysis (Lewis, 1990), Outlier Analysis, Evolution Analysis. For our text categorization purpose we have used Association Analysis for generating associative word sets. 2.2 Association Rule Let us consider the following assumptions for representing the Association rule in terms of mathematical representation, J = {i1, i2, , im} be a set of items. D = Set of database transactions where each transaction T is a set of items such that T J. Each transaction is associated with an identifier, called TID. A, B = Set of items. A transaction T is said to contain A if and only if A T. An association rule is an implication of the form A B, where A J, B J, and A B = The rule A B holds in the transaction set D with support S, where S is the percentage of transaction in D that contain A B, i.e., Support (A B) = P (A B). The rule A B has confidence C in the transaction set D if C is the percentage of transaction in D containing A that also contain B, i.e., confidence (A B) = P (BA) = [support count(A B) / support count(A)].

We now define some of the terminologies. Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the frequency, support count, or count of the itemset. An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to the product of min_sup and the total number of transactions in D. The number of transactions required for the itemset to satisfy minimum support is therefore referred to as the minimum support count. If an itemset satisfies minimum support, then it is a frequent itemset. 2.3 The Apriori Algorithm Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Association rule mining is a two steps process. 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-defined minimum support count. 2. Generate strong Association rules from the frequent itemsets: By definition, these rules must satisfy minimum confidence. Apriori employes an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. An important property called Apriori property, based on the observation is that, if an itemset I is not frequent, that is, P(I ) < min_sup then if an item A is added to the itemset I, the resulting itemset (i.e., I A) cannot occur more frequently than I. Therefore, I A is not frequent either, that is, P(I A) < min_sup. To understand how Apriori property is used in the algorithm, let us look at how Lk-1 is used to find Lk. A two-step process is followed, consisting of join and prune actions.

The join step: To find Lk, a set of candidate kitemsets is generated by joining Lk-1 with itself. This set of candidates is denoted by Ck. Let l1 and l2 be itemsets in Lk-1 then l1 and l2 are joinable if their first (k-2) items are in common, i.e., (l1[1]=l2[1]) . (l1[2]=l2[2]) ... (l1[k-2]=l2[k-2]) . (l1[k-1]<l2[k-1]). The prune step: Ck is the superset of Lk. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (itemsets having a count no less than minimum support in Ck). But this scan and computation can be reduced by applying the Apriori property. Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. 2.4 Illustration of Apriori Algorithm Consider an example of Apriori, based on the following transaction database, D of Figure:1, with 9 transactions, to illustrate Apriori algorithm.
TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of item_IDs I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3

1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item. 2. If minimum support count is set to 2, frequent 1-itemsets, L1, can then be determined from candidate 1-itemsets satisfying minimum support. 3. To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 | L1 to generate a candidate set of 2-itemsets (Figure: 4). 4. The transactions D are scanned and the support count of each candidate itemset in C2 is accumulated (Figure: 5). 5. The set of 2-itemsets, L2 (Figure: 6), is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.
Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} Figure: 4 C2 Itemset {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5} Figure: 6 Sup.count 4 4 2 4 2 2 L2 Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} Figure: 5 Sup.count 4 4 1 2 4 2 2 0 1 0 C2

Figure: 1 Itemset {I1} {I2} {I3} {I4} {I5} Figure: 2 Itemset {I1} {I2} {I3} {I4} {I5} Figure: 3 Sup. count 6 7 6 2 2 C1 Sup. count 6 7 6 2 2 L1 Itemset {I1, I2, I3} {I1, I2, I5} Figure: 7 C3

Itemset {I1, I2, I3} {I1, I2, I5} Figure: 8

Sup.count 2 2 C3

Itemset {I1, I2, I3} {I1, I2, I5} Figure: 9

Sup.count 2 2 L3

6. The generation of the set of candidate 3itemsets, C3, is detailed in Figure: 7 to Figure: 9. Here C3 = L2 | L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, the resultant candidate itemsets will be as in Figure: 7. 7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3itemsets in C3 having minimum support (Figure: 9). 8. The algorithm uses L3 | L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus, C4 = {}, and the algorithm terminates. 2.5 Implementation of Association Rule on Text Data Let us consider a set of transaction where each document is considered as a transaction as follows: 1. algorithm, network, graph, multicast, processor, system, parallel 2. cluster, network, design, message, processor, system, framework 3. algorithm, software, graph, method, session, analysis, parallel 4. switch, load, design, power, path, system, timing 5. cable, load, energy, power, current, motor, signal After implementation of the Association rule (considering minimum support as 0.4 & confidence 1) we will get, a. {algorithm, graph} {parallel} from 1, 3 b. {network, processor}{system} from 1, 2 c. {design} {system} from 2, 4 d. {load} {power} from 4, 5

inconsistent data are commonplace properties of large real world databases and data warehouses. 3.1 Training Data Abstracts from different thesis, research papers are considered as training document for developing a model for classifying new documents of unknown class. Most of the papers are collected from World Wide Web. Three categories of papers from Computer Science, Electrical and Electronic Engineering and Mechanical Engineering are considered as training documents.

3.2 Data

3. Preparing Text for categorization

Text categorization is the automated assigning of natural language texts to predefined categories based on their content (Hayes 1990, Sundheim, B 1991). Data stored in most text database are semistructured data in that they are neither completely unstructured nor completely structured (Han 2001). During the first stage the full text of a document to be classified must be parsed to produce a list of potential features that could serve as a basis for categorization. Incomplete, noisy and

Assumption, Consideration and Cleaning Each abstract is considered as a Transaction in the Text data. So number of abstracts is equal to the number of transactions in the Transaction set (Text data). The next step is to clean the text data by removing unnecessary words. It is obvious that in a text document only few words can be termed as keywords, characterize the document. Unlike considering all words in a text, in our thesis work we have considered only those words that are related to the subject of the text. Some filtering process is adopted in order to remove unnecessary words in many text retrieval, text categorization, and keyword extraction processes. We have followed a procedure which is similar to those conventional processes for filtering text data and collecting subject related words or keywords. First, all stop words in addition to periods, commas, and punctuations from the text are removed. Second, we delete all words other than frequent words. We define a word as frequent if it occurs more than once in a text. For counting a word whether it is frequent or not, we assume singular and plural form of a word as same and keeping the singular form in the text. Finally, the remaining frequent words are considered as a single transaction data in the set of database transaction. This process is applied to all text data (abstracts) before applying association mining to the transaction database. 3.3 Deriving associated word set from Training data In this paper, total 115 numbers of abstracts (Mitchell, 1997, www) are used as training data for learning to classify text from all three categories, of which 47 are from Computer Science, 48 are

from Electrical and Electronic Engineering and the rest 20 are from Mechanical Engineering papers. After preprocessing the text data association rule mining is applied to the set of transaction data where each frequent word set from each abstract is considered as a single transaction. A partial list of generated large word set with their occurrence frequency in corresponding categories of Computer Science, Electrical and Electronic and Mechanical Engineering is given below in Table 4.1. The term large is used here because any subset (items more than one) of the frequent word set is also frequent according to the property of Apriori algorithm and therefore is not mentioned in the list. The support and confidence is set to 0.02 and 0.75 accordingly. From the generated word set after applying association mining on training data we have found the following information based on the result. Total No. of Word Set = 107 Total No. of Word Set from Computer Science = 43 Total No. of Word Set from Electrical & Electronic = 47 Total No. of Word Set from Mechanical =17 Now we can recall the Nave Bayes classifier for probability calculation. NB = argmax P (j) P (aij) The calculation for first term is based on the fraction of each target class in the training data. Prior probability for Computer Science= 0.402 Prior probability for Electrical & Electrical = 0.44 Prior probability for Mechanical = 0.16 Then the second term of the equation is calculated by the following equation after adopting mestimate approach (Lewis 1990) in order to avoid zero probability value, nk + 1 .. (D) n + vocabulary where, n=Total no of word set position in all training examples whose target value is j nk = No. of times the word set found among all the training examples whose target value is j vocabulary =The total number of distinct word set found within all the training data Replacing values for each category from Table 3.1 to equation (D) we will get probability values for each word set. The probability values for some of the word set is listed below in Table 3.2

Large Word Set Found graph, algorithm technology, processor, system design, system message-passing, system oscillation, system, power, model distribution, load, feeder, system multicast, message-passing, system destination, multicast, approach system, result, model power, control, system problem, graph, algorithm message, communication, system stability, system, power
multidestination, message-passing, system

customer, feeder instability, experiment virtual, routing device, power block, power voltage, power shear, stress generator, test current, signal stability, control, system, power, model, strategy, device, oscillation change, distribution, system, load, customer, temperature, feeder pinout, framework, processor, technology, system, design approach, message-passing, multicast, destination, system broadcast, message, multicast, approach, destination
distribution, power, system, load, feeder

Number of Occurrence in Documents CS EE ME 5 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2

multidestination, communication, message, system, message-passing power, damping, model, oscillation, system
irregular, multicast, algorithm, system

algorithm, message-passing, multicast, system effect, system, power, load

multicast, network, message, algorithm

shear, experiment, rate, stress sequential, generator, circuit, test

Table 3.1: Word set with occurrence frequency

4. Text Categorization using Nave Bayes Classifier

Before classifying a new document the text data (abstract), target class of which is to be

determined, is again preprocessed similar to the process applied to training data.

Large Word Set graph, algorithm technology, processor, system design, system message-passing, system oscillation, system, power, model distribution, load, feeder, system multicast, message-passing, system destination, multicast, approach system, result, model power, control, system problem, graph, algorithm message, communication, system stability, system, power multidestination, messagepassing, system customer, feeder instability, experiment virtual, routing device, power block, power voltage, power shear, stress generator, test current, signal stability, control, system, power, model, strategy, device, oscillation Change, distribution, system, load, customer, temperature, feeder pinout, framework, processor, technology, system, design approach, message-passing, multicast, destination, system broadcast, message, multicast, approach, destination distribution, power, system, load, feeder Probability Value CS
0.04 0.033 0.033 0.033 0.0065 0.0065 0.027 0.027 0.0065 0.0065 0.027 0.027 0.0065 0.027 0.0065 0.0079 0.027 0.0065 0.0065 0.0065 0.0079 0.0065 0.0065 0.0065

EE
0.0067 0.0067 0.0067 0.0067 0.026 0.026 0.0067 0.0067 0.026 0.026 0.0067 0.0067 0.026 0.0067 0.026 0.0079 0.0067 0.026 0.026 0.026 0.0079 0.026 0.026 0.019

ME
0.0067 0.0067 0.0067 0.0067 0.0065 0.0065 0.0067 0.0067 0.0065 0.0065 0.0067 0.0067 0.0065 0.0067 0.0065 0.031 0.0067 0.0065 0.0065 0.0065 0.031 0.0065 0.0065 0.0065

4.1 Applying Nave Bayes Theorem in Text Categorization The steps for preprocessing and classifying a new document can be summarized as follows: Remove periods, commas, punctuation, stop words. Collect words that have occurrence frequency more than once in the document. View the frequent words as word sets. Search for matching word set(s) or its subset (containing items more than one) in the list of word sets collected from training data with that of subset(s) (containing items more than one) of frequent word set of new document. Collect the corresponding probability values of matched word set(s) for each target class. Calculate the probability values for each target class from Nave Bayes categorization theorem. Following the steps mentioned above, we can determine the target class of a new document. We will show an example in the next section and classify it according to the steps described. 4.2 Classifying a New Document Consider the following text (abstract) which can be any one of the categories of Computer Science, Electrical and Electronic Engg. or Mechanical Engg. Example: This paper discusses feedback control problems like regularization, noninteraction and linearization, for affine nonlinear singular systems. First, based on the constrained dynamic algorithm in affine nonlinear systems, an algorithm is introduced. By using such an algorithm, sufficient and necessary conditions are derived for the solvability of regularization problem. Then, another algorithm is proposed, based on which a sequence of integers can be defined for the system. It is shown that under some mild conditions, the dynamic part of singular systems can be linearized by using a regular feedback. Finally, an example is provided to illustrate the main results. After preprocessing the above text we have found the following frequent words: {feedback, problem, regularization, affine, nonlinear, singular, system, based, dynamic, algorithm, using, condition} Now search for word set(s) or its subset(s) from the list of word sets in Table: 5.2 matching with subset of frequent word set of new document. The

0.0065

0.019

0.0065

0.02

0.0067

0.02

0.0067

0.02

0.0067

0.0065

0.019

0.0065

Table 3.2: Word set with probability value

following probability values in different categories are found accordingly. Matched Word Set from Training data CS EE ME

{problem, graph, algorithm } .027 .0067 .0067

{ irregular, multicast, algorithm, system } 0.02 {algorithm, message-passing, multicast, system } 0.02 {dynamic, system, interaction} 0.0065 { multidestination, based, multicast, system } 0.02 {using, parameter, system } 0.0065 { condition, algorithm } 0.02 0.0067 0.0067 0.0067 0.019 0.0067 0.019 0.0067 0.0067 0.0065 0.0067 0.0065 0.0067

Matching subsets from frequent words of new document to be considered for probability calculation are:
1.{algorithm,problem} 2.{algorithm,system} 3.{dynamic,system} 4.{system,based} 5.{system,using} 6.{algorithm,condition}

The prior probability and probability values of word sets calculated using Nave Bayes equation are: Prior probability P (CS) = 0.40, P (EE) = 0.44, P (ME) = 0.16
P({algorithm,problem}CS)=0.027, P({algorithm,problem}ME)=0.0067 P({algorithm,condition}CS)=0.02, P({algorithm,condition}ME)=0.0067 P({algorithm,system}CS)=0.02, P({algorithm,system}ME)=0.0067 P({dynamic,system}CS)=0.0065, P({dynamic,system}ME)=0.0065 P({based,system}CS)=0.02, P({based,system}ME)=0.0067 P({using,system}CS)=0.0065, P({using,system}ME)=0.0065 P({algorithm,problem}EE)=0.0067, P({algorithm,condition}EE)=0.0067, P({algorithm,system}EE)=0.0067, P({dynamic,system}EE)=0.019, P({based,system}EE)=0.0067, P({using,system}EE)=0.019,

4.3 Taking the Effect of Number of Matching Words In the previous examples we consider only the probability values of word sets and the number of matching words in a word set has no effect in calculation. But we can take the effect of number of matching words by multiplying the fraction of matched words to the probability values during the calculation of probability for each target class. Example Given a connected graph G = (V; E) with n vertices and m edges, the distance between two vertices in G is the weight of the shortest path between them. A subgraph G0 is a tspanner (an approximate t-spanner) of G if, for every u, v 2 V , the distance between u and v in G0 is at most t (f(t)) times longer than the distance in G, where f(t) is a polynomial function of variable t and t <= f(t) < n. In this paper parallel algorithms for finding approximate tspanners on both unweighted graphs and weighted graphs are given. If G is an unweighted graph, our algorithm requires O( ntk log n) time and M(n) processors, and the spanner generated has size of O(( ntk )1+1=t +n) and factor of O(tk+1); otherwise our algorithm requires O(( ntk )2 + (ntk)1 + 2 = (t?2) log n) time and O(n2) processors. After preprocessing the above text we have found the following Frequent words: {graph, vertices, distance, t-spanner, approximate, time, algorithm, unweighted, require, log, processor} Now search for word set(s) or its subset(s) from the list of word sets in Table: 5.2 matching with subset of frequent word set of new document. The following probability values in different categories are found accordingly. Matched Word Set from Training data

For Computer Science =0.40.0270.0270.020.00650.020.0065 = 0.00000000000492804 For Electrical & Electronic =0.440.00670.00670.00670.0190.00670.0 19 = 0.000000000000320080405964 For Mechanical =0.160.00670.00670.00670.00650.00670. 0065 = 0.000000000000013622157796 From the above result we found the document classified as Computer Science.

{graph, algorithm} 0.04 0.0067 0.0067 {problem,graph,algorithm}0.027 0.0067 0.0067 {time,bound, algorithm}0.02 0.0067 0.0067

Matching subsets from frequent words of new document to be considered for probability calculation are: 1. {algorithm, graph} 2.{algorithm, time} Therefore two-third (2/3) of the word sets {problem, graph, algorithm} & {time, bound, algorithm} matched with the subset of frequent

words 1 & 2. The prior probability and probability values of word sets taking the effect of the fraction of matched word sets using nave Bayes equation are: Prior probability P(CS) = 0.40; P(EE) = 0.44; P(ME) = 0.16 P({algorithm,graph}CS) = 0.04 & 0.027*2/3; P({algorithm,graph}EE) = 0.0067 & 0.0067*2/3; P({algorithm,graph}ME) = 0.0067 & 0.0067*2/3P({algorithm,time}CS)=0.02*2/3; P({algorithm,time}EE) = 0.0067*2/3; P({algorithm,time}ME) = 0.0067*2/3 For Computer Science =0.40.040.027 2/3X0.2 2/3 = 0.000003878496 For Electrical & Electronic =0.440.0067 0.00672/30.00672/3= .000000059405504708 For Mechanical =0.160.00670.00672/30.00672/3 = 0.000000021602001712

text categorization. But this possibility of failure can be reduced by considering increased number of training data. For example, we can consider the following frequent words collected after preprocessing an abstract for categorization. 5.2 Efficiency of classifying a text We have considered only a total 115 number of documents as training data (Mitchell, 1997) which is very few and insufficient compared to Naive Bayes example of text categorization where 20,000 documents taken for developing the learning system and that system gives 89% efficiency in text categorization. We have started with 60 documents (20+20+20) initially then increase the number to 115. Increase in doubled the number of documents also doubled the number of generated word sets, which in turn increases significantly its ability to classify a text.

6. Conclusion and Future Work

In our training set of data, although all the abstracts have almost equal size in length, they have slightly different number of frequent words after preprocessing them. In order to avoid null attribute value in any transaction in the set of transaction database, we have considered 13(thirteen) frequent words from each text. The reason is that, null attribute values in the transaction set produce word sets containing null values. These word sets containing null values have no use in categorization. Texts with less than 13 frequent words are discarded (remaining 115 documents) and are not considered as training data. A process is followed for selecting 13 frequent words from documents having frequent words more than 13 based on occurrence frequency and position of frequent words from the beginning of a text in case of same frequency words. In other words, higher frequency words are considered first, then for the same frequency words, word that occurs earlier from the beginning of the text gets priority for selection over others. Considering increased number of attributes for generating associated word sets, increase the possibility of generating greater number of words in a word set and also increase the total number of word sets.

5. Experimental Results
In this work, classifying a new document depends on the associated word sets generated from training documents. So the number of training documents is vital in generating the number of word sets used to determine the class of a new document. The greater number of word sets from training documents reduces the possibility of failure to classify a new document. 5.1 Comparison with Nave Bayes Categorization - Word Set of items two (at least) or more is generated from Association mining. So there is no option for considering a single word using association concept. - Association mining largely reduces the number of words to be considered for classifying texts, keeping only words having association between them. - Possibility of words common in more than one target classes is higher than the possibility of word set in more than one target classes. So considering a single word for categorization increases the possibility of wrong categorization. - Considering word set instead of word for text categorization increases the possibility of failure of

References
Agrawal, R; Mannila, H; Srikant, R; Toivonan, H; Verkamo, A, 1996. Fast discovery of Association Rules, Advances in knowledge discovery and data mining. Bayer, Thomas, Renz, Ingrid, Stein, Michael and Kressel, Ulrich, 1996. Domain and Language Independent Feature Extraction for Statistical Text Categorization Computation and language, 7. David D. Lewis, 1992. Feature Selection and Feature Extraction for Text Categorization, appeared in Speech and Natural Language, Proceedings of a workshop held at Harriman, New York, February 23-26, 1992. Morgan Kaufmann, San Mateo, CA, pp. 212-217. Eibe Frank, Automatic Keyphrase Extraction, http://www.nzdl.org/kea/ Eibe Frank and Ian H. Witten, 2000. Data Mining: Practical Machine Learning Larning Tools and Techniques with Java Implementation, Morgan Kaufmann Publisher: CA. Hayes, Philip J. and Steren P. Weinstein, 1990. CONSTRUE/TIS: a system for content-based indexing of a database of news stories, in IAAI Jiawei Han and Micheline Kamber, 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann Publisher: CA. Lewis, D. and Croft, W., 1990. Term clustering of syntactic phrases, in ACM SIGIR-90, PP. 385404. Lewis, David D. 1992. Representationand Learning in Information Retrieval, Ph. D thesis, University of Massachusetts. Mitchell M. T., 1997. Machine Learning, McGraw Hill, New York, 1997. Sundheim, B, ed. 1991. Proceddings of the Third Message Understanding Evaluation and Conference, Morgan Kaufmann, Los Altos, CA. www.cs.waikato.ac.nz/ml/weka

Apriori Algorithm
No ratings yet
Apriori Algorithm
28 pages
p132 Closet
No ratings yet
p132 Closet
11 pages
An Efficient Algorithm For Mining
No ratings yet
An Efficient Algorithm For Mining
6 pages
Compusoft, 3 (10), 1140-1142 PDF
No ratings yet
Compusoft, 3 (10), 1140-1142 PDF
3 pages
Improved Method For Pattern Discovery in Text Mining
No ratings yet
Improved Method For Pattern Discovery in Text Mining
5 pages
Literature Survey On Various Frequent Pattern Mining Algorithm
No ratings yet
Literature Survey On Various Frequent Pattern Mining Algorithm
7 pages
Comparative Evaluation of Association Rule Mining Algorithms With Frequent Item Sets
No ratings yet
Comparative Evaluation of Association Rule Mining Algorithms With Frequent Item Sets
7 pages
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
No ratings yet
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
8 pages
Association Rule Mining For Modelling Academic Resources Using FP Growth Algorithm PDF
No ratings yet
Association Rule Mining For Modelling Academic Resources Using FP Growth Algorithm PDF
6 pages
Text Categorization Using Association Rule and Naïve Bayes Classifier
No ratings yet
Text Categorization Using Association Rule and Naïve Bayes Classifier
9 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
Performance Analysis of Distributed Association Rule Mining With Apriori Algorithm
No ratings yet
Performance Analysis of Distributed Association Rule Mining With Apriori Algorithm
5 pages
Report of 2nd Defence
No ratings yet
Report of 2nd Defence
6 pages
Q) Frequent Itemset Generation: States That If An Itemset Is Frequent, Then All of Its Subsets Must Also Be Frequent. This
No ratings yet
Q) Frequent Itemset Generation: States That If An Itemset Is Frequent, Then All of Its Subsets Must Also Be Frequent. This
9 pages
A Survey On Association Rule Mining For
No ratings yet
A Survey On Association Rule Mining For
8 pages
Association Rule Mining Using Apriori Al PDF
No ratings yet
Association Rule Mining Using Apriori Al PDF
11 pages
3final CH 5 Concept
No ratings yet
3final CH 5 Concept
101 pages
Feature Extraction and Reduction by Using ModifiedApriori Algorithm
No ratings yet
Feature Extraction and Reduction by Using ModifiedApriori Algorithm
9 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
New Microsoft PowerPoint Presentation
No ratings yet
New Microsoft PowerPoint Presentation
12 pages
DM Lab Manual
No ratings yet
DM Lab Manual
32 pages
Study On Application of Apriori Algorithm in Data Mining
No ratings yet
Study On Application of Apriori Algorithm in Data Mining
4 pages
Efficient Mining Frequent Itemsets Algorithms: Marghny H. Mohamed Mohammed M. Darwieesh
No ratings yet
Efficient Mining Frequent Itemsets Algorithms: Marghny H. Mohamed Mohammed M. Darwieesh
11 pages
Mining Association Rules From Infrequent Itemsets: A Survey
No ratings yet
Mining Association Rules From Infrequent Itemsets: A Survey
8 pages
ATC - Lecture - Notes - Data Mining Techniques - 2021
No ratings yet
ATC - Lecture - Notes - Data Mining Techniques - 2021
77 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
Generalized Association Rule Mining Using Genetic Algorithms
No ratings yet
Generalized Association Rule Mining Using Genetic Algorithms
11 pages
Extraction of Interesting Association Rules Using Genetic Algorithms
No ratings yet
Extraction of Interesting Association Rules Using Genetic Algorithms
8 pages
DWDM Unit 4 (R22)
No ratings yet
DWDM Unit 4 (R22)
25 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Data Mining Micro PGDM
No ratings yet
Data Mining Micro PGDM
40 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
12 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Ijctt V27P116
No ratings yet
Ijctt V27P116
7 pages
DMDW - Association Analysis
No ratings yet
DMDW - Association Analysis
12 pages
Unit 3
No ratings yet
Unit 3
69 pages
I Jcs It 2014050535
No ratings yet
I Jcs It 2014050535
5 pages
Understanding Association Rule in Data Mining
No ratings yet
Understanding Association Rule in Data Mining
4 pages
FALLSEM2022-23 SWE2009 ETH VL2022230101117 Reference Material I 25-08-2022 Frequent Pattern Mining
No ratings yet
FALLSEM2022-23 SWE2009 ETH VL2022230101117 Reference Material I 25-08-2022 Frequent Pattern Mining
42 pages
Volume 2, No. 5, April 2011 Journal of Global Research in Computer Science Research Paper Available Online at WWW - Jgrcs.info
No ratings yet
Volume 2, No. 5, April 2011 Journal of Global Research in Computer Science Research Paper Available Online at WWW - Jgrcs.info
3 pages
Apriori
No ratings yet
Apriori
27 pages
2010 - An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
No ratings yet
2010 - An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
14 pages
Apriori
No ratings yet
Apriori
27 pages
Efficient Apriori Algorithm Using Enhanced Transaction Reduction Approach
No ratings yet
Efficient Apriori Algorithm Using Enhanced Transaction Reduction Approach
5 pages
Thabet Slimani - Efficiant Analysis of Pattern and Association Rule Mining Approaches
No ratings yet
Thabet Slimani - Efficiant Analysis of Pattern and Association Rule Mining Approaches
14 pages
A Study On Document Classification Using Machine Learning Techniques
No ratings yet
A Study On Document Classification Using Machine Learning Techniques
6 pages
Association Rule-A Tool For Data Mining: Praveen Ranjan Srivastava
No ratings yet
Association Rule-A Tool For Data Mining: Praveen Ranjan Srivastava
6 pages
DM 100
No ratings yet
DM 100
17 pages
DWDM Unit 2 and 3
No ratings yet
DWDM Unit 2 and 3
31 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
FDS Unit02
No ratings yet
FDS Unit02
16 pages
An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
No ratings yet
An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
14 pages
Association Rule: Association Rule Learning Is A Popular and Well Researched Method For Discovering
No ratings yet
Association Rule: Association Rule Learning Is A Popular and Well Researched Method For Discovering
10 pages
1association Analysis-Apriori
No ratings yet
1association Analysis-Apriori
67 pages
Apriori
No ratings yet
Apriori
27 pages
Online Message Categorization Using Apriori Algorithm
No ratings yet
Online Message Categorization Using Apriori Algorithm
7 pages
Closet - An Efficient Algorithm For Mining Frequent
No ratings yet
Closet - An Efficient Algorithm For Mining Frequent
8 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Text Categorization Using Association Rule and Naïve Bayes Classifier

Uploaded by

Text Categorization Using Association Rule and Naïve Bayes Classifier

Uploaded by

Text Categorization using Association Rule and Nave Bayes Classifier

2. Mining Association Rules

Itemset {I1, I2, I3} {I1, I2, I5} Figure: 8

Itemset {I1, I2, I3} {I1, I2, I5} Figure: 9

3. Preparing Text for categorization

Number of Occurrence in Documents CS EE ME 5 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2

algorithm, message-passing, multicast, system effect, system, power, load

shear, experiment, rate, stress sequential, generator, circuit, test

Table 3.1: Word set with occurrence frequency

4. Text Categorization using Nave Bayes Classifier

determined, is again preprocessed similar to the process applied to training data.

Table 3.2: Word set with probability value

{problem, graph, algorithm } .027 .0067 .0067

6. Conclusion and Future Work

You might also like