0% found this document useful (0 votes)
14 views5 pages

Text Mining An Improvised Feature Based

A research paper related text mining

Uploaded by

Madhan HS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Text Mining An Improvised Feature Based

A research paper related text mining

Uploaded by

Madhan HS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Text Mining: An Improvised Feature Based Model

Approach

Shivaprasad KM Dr. T Hanumantha Reddy


Department of Computer Science and Engineering Department of Computer Science and Engineering
Rao Bahadur Y Mahabaleswarappa Engineering College, Rao Bahadur Y Mahabaleswarappa Engineering College,
Affiliated to VTU Belagavi Affiliated to VTU Belagavi
Ballari-583 104, Karnataka, India Ballari-583 104, Karnataka, India
[email protected] [email protected]

Abstract— In this knowledge era, plethora of textual information classifiers which classify and build the relevant information,
is collected and stored in various databases around the world Point wise Mutual Information to measure the association of
including internet as the largest database of all. Discovery of an attribute to a class, finding the features those are strongly
knowledge from this available database is not simple. Thus, the relevant and irrelevant and chi-square method can be used for
automatic feature selection approach is necessary for the evaluation of difference between the feature sets arised. In the
preprocessing of textual documents for data mining. The Feature
broader way, there are two approaches for selecting the 'best'
selection approach focuses on identifying relevant data, helps to
understand and visualize the data, it also reduces the training features, and they are filter and wrapper approaches. In
and processing time of huge amounts of data as well as increase filter method, selection of the subset is done without
the accuracy for the subsequent data mining tasks. Text mining considering any algorithm which is usually done before
offers various methods to fetch the interested data from vast processing. In wrapper method, evaluation of feature set is
databases. Text clustering is one of the most important areas in done by considering the algorithm.
text mining, which includes text preprocessing, dimension Bread Model is proposed for effective pattern
reduction by selecting some terms (features) and finally matching and to know the significance of the term set or
clustering using selected terms. Feature selection proves to have a feature set in matching the document. As a document is a key
vital role in this process.In this paper, bread model is proposed
element in the text mining it is necessary to know its
that processes text document using the input termset. Based on
first principles of instruction application methodology, the model relevance, which can be known by the significant term set.
phases are implemented that provides the effective results. The document can exist in any form from formal documents to
Keywords— Bread model, Feature selection, First principles of ad hoc. Identification of term set or features of the document
instructions, Text mining. is essential , which should effectively portray the meaning of
the document accurately. Accordingly, there should be a
I. INTRODUCTION knowledge of the dataset to select this term set. Using this
term set, bread model calculates the term frequency Tf in the
Text mining is the mushrooming process of discovering,
input text document. Bread model approach has been
extracting information from large unstructured textual
suggested using the first principles of instruction where this
resources.Text mining has high potential value in dealing with
first principle can be applied effectively to the problem
large and complex data sets of textual documents that contain
centered applications. Bread model involves the stages of,
much irrelevant and noisy information. In order to trodden that
Identification of the problem,extracting data, analysis of data,
irrelevant and noisy information, the feature selection method
and calculation of term frequency.
is embraced. Feature selection is a long existing, novel method
which aims to remove that irrelevant and noisy information by II. RELATED WORK
focusing and contour only the relevant and informative data
for use in the text mining. Feature selection approach opens From the earlier days, classification of text [1] is
new research door for text mining. There exist two questions directed towards the web documents. Geibel et al proposed
in feature selection approach first question is 'what are the method to classify these web documents using the document
features for machine learning which can represent the text in structure. Much more powerful approach than this is to
the effective way? and the second is: 'what is the best way to combine the structure with the linguistic and semantic
prune a large set of features down to a manageable set of most information.
discriminating features?' For the first question we can say, it
depends upon the processing power, language and corpora In Natural Language Processing (NLP), to extract the
working with and most importantly the specific problem you meaning, it makes use of the linguistic concepts (Parts of
are tackling .For the second question: We can try various speech tagging and grammatical structure). Marchisio et al
approaches in order to prune the feature sets including: utilizes the NLP techniques to write own parsers that do the

c
978-1-5090-2399-8/16/$31.00 2016 IEEE 38
entire parsing. Bunescu and Mooney adapt the technique of measures (as good as IG and CHI) for selecting informative
Support Vector Machines (SVM) for the process of features. Document frequency of a term is the number of
comparison between NLP and non-NLP techniques. SVM documents in which the term occurs [8]. The feature selection
technique is broadly used in the field of text mining. approach calculates document frequency for every term and
Boontham et al proposes the three different approaches for removes the terms whose document frequency is less than a
text categorization, simple word matching, Latent Semantic predefined threshold. The basic assumption is that frequent
Analysis (LSA) and topic models. Atkinson uses technique of terms are more important and relevant to the dataset in
genetic algorithms where features usually represented as comparison to the infrequent ones. Term strength is proposed
binary vectors. There are still many effective methods and in [9] initially for stop-word removal. This approach estimates
approaches for the feature selection or extraction like the strength of a term based on how likely it appears in
Information Gain (IG), Document Frequency (DF), Chi- “closely-related” documents. It is based on a heuristic that
square, Term Strength (TS), and Term Contribution (TC) [2]. documents with many shared words are related, and that terms
of heavily overlapping area of related documents are relatively
III. FEATURE SELECTION informative [7]. The approach has two steps:
Feature selection [3] is the process of developing the new • Finding pairs of similar documents. This step
subset of large sets of textual data. Data mining techniques calculates the similarities between all pairs of
cannot directly deal with the larger data, hence the reduction documents in the dataset sim(di,dj) using the cosine
value of the two document vectors. Two documents
into the representation as feature sets make easier for the data
di and dj are then considered “similar” if sim(di,dj) is
mining task. The main difficulty in text classification is the above the predefined threshold ξ.
high dimensionality of the feature space; this feature selection • Calculating term strength. The Strength of a term s(t)
reduces or compacts this feature space. Reduced feature space is computed based on the estimated conditional
is used by classifiers for the text classification. Process of probability that the term t occurs in a document di
feature selection reduces the computing complexity, and when it occurs in document dj, which is similar to di:
increases the accuracy rate by reducing the noise feature. s(t) = p(t ฀ di | t ฀ dj ฀ sim(di, dj) ≥ ξ).
Feature selection can be done using different methods by As we have mentioned earlier, the unsupervised feature
selection approaches save the cost of labelling data and it
selecting the features considering the classifiers or avoids the problem on the inaccuracy of homogeneity between
independent of the classifiers. training and test datasets in the supervised process. This
characteristic is especially important for text mining tasks in
A. Feature Selection for Text Mining
which we always need to deal with an incredibly huge amount
Recently, there has been a tremendous growth in of documents on various topics.
computer technologies and omnipresent usage of them, which
leads to accumulation of a large number of documents in B. FPI (First Principles of Instruction)
internet. Data processing capacity is not able to cope up with Principles method is a relationship that is always true under
the speed of accumulation.These documents accumulated are appropriate conditions, regardless of program or practice.
unstructured which is predominant. Text mining [4] therefore, Learning properties of the first principles of instruction for a
is one of the most important drudgery in data mining. In case
given program facilitates the direct proportion to its
of data mining techniques to textual documents, document is
considered as an instance (or transaction), while terms (words implementation.
or phrases) are considered as features (or items). The number • Analyze the instructional theories, models, programs,
of approaches can be effectively applied to this textual data. and products to extract general first principles of
Most of the approaches of feature selection are based on a instruction.
scoring scheme of terms [5]. The score of features represents • Identify the cognitive processes associated with each
the quality of the terms in the document dataset. A term set
principle.
which has got high score means it is important or relevant to
the dataset. In supervised approaches, term scores are based • Identify empirical support for each principle.
on a labelled training set, which is class information. Some of • Describe the implementation of these principles in a
the popular supervised feature selection approaches are variety of different instructional theories and models.
information gain (IG), mutual information (MI) and χ2 • Identify prescriptions for instructional design
statistics (CHI)[6]. Unsupervised feature selection approaches
associated with these principles.
are based on heuristics for estimating the quality of terms in a
dataset. For a dataset of textual documents, the heuristics 1) FPI - Instruction Phases
generally focus on term distribution among the dataset. The The present instructional model suggests that the most
popular unsupervised feature selection approaches include effective learning environments are those that are problem-
document frequency (DF) and term strength (TS). Document based and involve five distinct interrelated phases of learning.
frequency (DF) is a simple but effective measure for feature
selection. Yang et al [7] concluded that DF is among the best

2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT) 39
This model can be effectively applied to any problem centered
situations which can provide the substantial results.
a) Problem Centered: Identification of the problem
which can progress from simple to complex.
b) Activation: Recalls the prior knowledge or experience
and create a learning situation for the new problem.
c) Demonstration : Demonstrate or show a model of the
skill required for the new problem.
d) Application: Apply the skills obtained to the new
problem.
e) Integration: Provides the capabilities and to show the
acquired skill to another new situation.

Fig.2 Phases of Proposed Bread Model (Self-generated)

The system of bread model processes with the following


observation:
The term set T is defined using application phase Algorithm
called First Principles of Instruction (FPI) to find whether the
document belongs to the application or not. In this bread
model, the term is any sequence of words separated from other
terms. The term set which is defined by the FPI can be used
for the task of calculation of term frequency and to know the
significance of term set. A well selected subset of the set of all
terms set which portray effectively the input document is
Fig.1 Phases of First Principles of Instruction (Self-generated) considered. Let D= {d1, d2, d3} be a database of text
documents. The input dataset selected is a member of any
larger database.The T is the term set of all the terms defined
IV. PROPOSED BREAD MODEL ALGORITHM by the application phase of the FPI occurring in the document
A Bread model which is coined from word d. The following parameters are used
breadboard used in electronic circuits. Breadboard which has • D->Database
synonym ‘prototype’, is the facilitator used as a constructional • T-> Unique term set
base in electronic circuits for building and testing the circuits. • S -> Sentences in the document.
It helps to build the circuits effectively so that understanding • Tf ->Term frequency
and debugging of circuits is easier and quicker.
Similarly, breadboard depicts the data which is
extracted from it. Alike breadboard, here bread model tries to A. Implementation Steps
solve the problem of text mining in an effective and easy
1. Select term set ‘T’ (Feature term keywords) which is
manner by easy understanding and visualization. Many
a collection of terms t based on FPI.
approaches can be used for classifying and clustering the data.
It is considered in different phases. 2. Input the document ‘d’ taken from the database set
‘D’ along with term set T.
3. From the given document ‘d’ extract each sentence
‘s’ from large sentences set S and go to step 4.
4. Find the term frequency (Tf) using the frequent term
set ’T’ in ‘S’. If the match does not encounter with

40 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT)
the application feature term set keyword, allow the Here, the classification of terms into the above five groups’
user to make decisions based on the sentence by makes the “manual labeling” process easier and more
displaying it on the screen. accurate. It also gives the justification on which items are
relevant or irrelevant.For this knowledge of dataset is
5. Repeat the step 3 through step 4 until all the
essential.The significance of the individual term set can be
sentences s are read from the input document d. calculated using the formula as given:

Where,
ST – Significance of term set
Tf –Term frequency of individual set.
∑ Tf – Sum of Term frequencies of term sets.
Based on this significance of term set ST the document can be
considered as relevant or irrelevant.
C. Experimental Results
Proposed bread model system is implemented using the Java
toolkit. The system can be applied to the textual document of
any size. The experiment deals with the independent datasets
of .txt documents.
In the dataset, bread model finds for suitable occurrences or
match of the terms. Bread model calculates the term frequency
(Tf).We are able to know the significance of individual term
set using the suitable formula as illustrated. The document is
taken as relevant or irrelevant document based on the
Fig 3: Dataflow diagram of Bread Model calculation of Tf by considering the predefined threshold.
B. Relevance of documents based on the significance of
term set
The First phase of this pattern matching requires, the term set
or feature set which should effectively portray the meaning of
the input document. The implementation of different
approaches would generate different sets of relevant and
irrelevant terms. It must be conceived that generation of
relevant or irrelevant terms is not only important but also, it Table1: Experimental Result
should provide manageable results to the user. Identification
of this term set makes the process of pattern matching
effective and practical. In the experiment, terms can be
manually classified based on following five groups:
• Topics, tasks, approaches, applications: association,
classify, cluster.
• Concepts, terms: term, pattern, set, database, text,
algorithm. They are concepts used more frequently in
data mining than other IT topic.
• Words specially used for a topic of data mining:
frequency, large items et, a priori (association rule
mining); gene, tissue (data mining for biology); Fig 4: Graphical representation of Result
attribute, dimension (database mining); sequence,
parallel, regression (data mining approaches). From the above results, we can observe that the term set T4 is
highly significant and T2 is low significant. Based on this
• Words that are also popular in other IT topics: Term frequency Tf the document can be considered as
system, approach, software. relevant or irrelevant based upon the predefined threshold.
• Common words: show, define, increase, analyze,
accurate, automatic, intelligent, challenge. CONCLUSION
This paper gives a simple demonstration for the feature
selection approach based on the application property of the

2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT) 41
First Principles of Instruction with a bread model algorithm.
The system is used to quantify the application feature
supported by the document with the help of feature term set of
keywords. This technique requires an adequate set of
keywords to support the concept of application in the
document and those feature set keywords are generated with
the help of human judgement. The system can be applied to
the textual document of any size. It can be considered as a cost
effective tool which provides effective results.
In the future the bread model can be used to analyze
the learning material or any textual document with various
features based on the properties of First Principles of
Instruction. The bread model approach can be further
implemented for the multiple keywords in multiple
documents.
ACKNOWLEDGMENT
I like to thank my guide Dr.T.Hanumantha Reddy sir for
providing excellent guidance, encouragement and inspiration
throughout the paperwork. Without his invaluable guidance,
this work would never have been a successful one. Also, I
would like to thank my family & friends, who have been a
source of encouragement and inspiration throughout the
duration of the paper.
REFERENCES
[1] S. Wu, P.A. Flach, Feature Selection with Labelled and Unlabelled
Data. In Proc. Of ECML/PKDD'02 workshop on Integration and
Collaboration Aspects of Data Mining, Decision Support and Meta-
Learning, University of Helsinki (2002) 156-167.
[2] Anne Kao and Stephen R. Poteet (Eds),Natural Language Processing
and Text Mining, ©Springer-Verlag London Limited 2007, ISBN-13:
978-1-84628-175-4.
[3] K Nirmala,M Pushpa, Feature based text Classification using
Application Term Set,International Journal of Computer Applications
(0975 – 8887) Volume 52– No.10, August 2012
[4] D. Koller, M. Sahami, M, Toward optimal feature selection. In S.
Francisco (eds.): Thirteenth Interna-tional Conference on Machine
Learning (ICML '96), Lorenza Saitta, (1996) 284-292.
[5] M. Kantardzic, Data Mining: Concepts, Models, Methods, and
Algorithms. Wiley-IEEE Press (2003).
[6] Y. Yang, J.O. Pedersen, A comparative study on feature selection in text
categorization. In Proc. of the 14th International Conference on
Machine Learning (ICML-97), Morgan Kaufmann Publishers, San
Francisco, US (1997) 412-420.
[7] Y. Yang, J.O. Pedersen, A comparative study on feature selection in text
categorization. In Proc. of the 14th International Conference on
Machine Learning (ICML-97), Morgan Kaufmann Publishers, San
Francisco, US (1997) 412-420
[8] J.W. Wilbur, K. Sirotkin, The automatic identification of stop words. J.
Inf. Sci, Vol. 18 (1992) 45-55.
[9] D. Mladenic, Feature subset selection in text-learning. In Proc. of
European Conference on Machine Learning (1998) 95-100.
[10] Geibel, P., Krumnack, U., Pustylnikow, O., Mehler, A., et al. Structure-
Sensitive Learning of Text Types, In AI 2007: Advances in Artificial
Intelligence, Vol 4830, pp. 642-646.
[11] Schenker, A. Graph Theorectic Techniques for Web Content Mining.
PhD thesis, University of South Florida, 2003.
[12] Gee, K. R. and Cook, D. J. Text Classification Using Graph-Encoded
Linguistic Elements, In FLAIRS Conference 2005, pp. 487-492.

42 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT)

You might also like