0% found this document useful (0 votes)
17 views

WordNet-based Lexical Semantic Classification For Text Corpus Analysis

This document proposes a new method of document representation called WordNet-based lexical semantic VSM to address limitations of traditional statistical term-based methods. The proposed method uses WordNet to characterize documents based on their lexical semantic contents rather than just statistical term frequencies. It constructs a data structure to represent semantic element information and builds document representations as eigenvectors in a lexical-semantic space based on synset weights. Experimental results on text corpora show the lexical-semantic eigenvectors perform better than traditional TF-IDF eigenvectors for text classification. The method provides a new approach for document representation applicable to various text corpus analysis applications.

Uploaded by

amitguptakkrnic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

WordNet-based Lexical Semantic Classification For Text Corpus Analysis

This document proposes a new method of document representation called WordNet-based lexical semantic VSM to address limitations of traditional statistical term-based methods. The proposed method uses WordNet to characterize documents based on their lexical semantic contents rather than just statistical term frequencies. It constructs a data structure to represent semantic element information and builds document representations as eigenvectors in a lexical-semantic space based on synset weights. Experimental results on text corpora show the lexical-semantic eigenvectors perform better than traditional TF-IDF eigenvectors for text classification. The method provides a new approach for document representation applicable to various text corpus analysis applications.

Uploaded by

amitguptakkrnic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

J. Cent. South Univ.

(2015) 22: 1833−1840


DOI: 10.1007/s11771-015-2702-8

WordNet-based lexical semantic classification for text corpus analysis

LONG Jun(龙军)1, WANG Lu-da(王鲁达)1, LI Zu-de(李祖德)1, ZHANG Zu-ping(张祖平)1, YANG Liu(杨柳)2

1. School of Information Science and Engineering, Central South University, Changsha 410075, China;
2. School of Software, Central South University, Changsha 410075, China
© Central South University Press and Springer-Verlag Berlin Heidelberg 2015

Abstract: Many text classifications depend on statistical term measures to implement document representation. Such document
representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.
This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet,
this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM
modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document
representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text
corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic
eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of
document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.

Key words: document representation; lexical semantic content; classification; eigenvector

statistical methods of feature extraction.


1 Introduction However, in the information retrieval field,
statistical term measures neglect lexical semantic content.
Text corpus analysis is an important task. It causes corpus analysis to perform on the level of term
Meanwhile, clustering and classification are the key string basically, and disregard lexical replacement of
procedures for text corpus analysis. In addition, text document original at deceiving the text corpus analysis
classification is an active research area in information easily.
retrieval, machine learning and natural language Semantic approach is an effectively used technology
processing. Most classification algorithms based on for document analysis. It can capture the semantic
eigenvector prevail in this field, such as KNN, SVM, features of words under analysis, and based on that,
ELM. Eigenvector-based document classification is a characterizes and classifies the document. Close
widely used technology for text corpus analysis. relationship between the syntax and lexical semantic
Relevant classification algorithms and experiments are contents of words have attracted considerable interest in
typically based on eigenvector of document both linguistics and computational linguistics.
representation. Moreover, the key issue is eigenvector- The design and implementation of WordNet-based
based classification algorithms depending on the VSM [1]. lexical semantic classification take account of lexical
TF-IDF (term frequency–inverse document semantic content particularly. Unlike traditional statistical
frequency) [2] is a prevalent method for characterizing methods of feature extraction, our work developed a new
document, and its essence is statistical term measure. term measure which can characterize lexical semantic
Many methods of document representation based on contents, and provide a practical method of document
TF-IDF can construct vector space model (VSM) of text representation to can handle the impact of lexical
corpus. Similarly, many methods of document replacement. The document representation is normalized
representation exploit statistical term measures, such as as the eigenvector; consequently, it shall be applied to
Bag-of-Words [3] and Minwise hashing [4]. For current VSM-dependent classification algorithms.
document representation, these methods are perceived as Theoretical analysis and a large number of experiments

Foundation item: Project(2012AA011205) supported by National High-Tech Research and Development Program (863 Program) of China;
Projects(61272150, 61379109, M1321007, 61301136, 61103034) supported by the National Natural Science Foundation of China;
Project(20120162110077) supported by Research Fund for the Doctoral Program of Higher Education of China; Project(11JJ1012)
supported by Excellent Youth Foundation of Hunan Scientific Committee, China
Received date: 2014−03−21; Accepted date: 2014−10−11
Corresponding author: WANG Lu-da, PhD Candidate, Lecturer; Tel: +86−18613082443; E-mail: [email protected]
1834 J. Cent. South Univ. (2015) 22: 1833−1840
are carried out to verify the effectiveness of this method. Table 1 Statistical term measures on Sample A
Term Men Love Holiday Human Enjoys Vacation
2 Analysis of statistical term measure Weight
1 1 1 0 0 0
(frequency)
In the information retrieval field, similarity and
correlation analysis of text corpus needs to implement Table 2 Statistical term measures on Sample B
corresponding document representations for diverse Term Men Love Holiday Human Enjoys Vacation
algorithms. Many practicable methods of document Weight
0 0 0 1 1 1
representation share a basic mechanism, statistical term (frequency)
measure.
Typical statistical methods of feature extraction Comparing Tables 1 and 2, positive weights do not
include TF-IDF based on lexical term frequency and coexist in the same term of two samples. These two
shingle hash based on consecutive terms [5]. Many TF- orthogonal vectors of term weight demonstrate that the
IDF-based methods of feature extractions employ a statistical term measures for document representation
simple assumption that frequent terms are also cannot effectively signify semantic similarity of the
significant [2]. And, these methods quantify the extent of corpus example 1. And they did not recognize and
usefulness of terms in characterizing the document in represent the lexical semantic contents of these two
which they appear [2]. Besides, as for some hashing documents practically. As a result, these two vectors
measures based on fingerprinted shingle, people call a cannot provide mutual information of term meanings.
sequence of k consecutive terms in a document of shingle.
Then, a selection algorithm determines which shingles to 3 Proposed program
store in a hash table. And various estimation techniques
are used to determine which shingle is copied and from 3.1 Motivation and theoretical analysis
which most of the content originated [5]. For text corpus analysis, document representations
In the above, these methods for document which depend on statistical term measures shall lose
representation are perceived as the mode using statistical mutual information of term meanings. Besides, in
term measures. As a sort of ontology methods [6], different documents, term meanings are relevant to
document representations based on statistical term specific synonyms which are involved by lexical
measures ignore recognition of lexical semantic contents. semantic contents. Thus, this new work resorts to
It causes the document representation to lose the mutual WordNet [8], a lexical database for English, for
information [7] of term meanings which comes from extracting lexical semantic contents. Then, the method of
synonyms in different samples. Moreover, lexical document representation will construct a lexical semantic
replacement of document original cannot be represented VSM of text corpus to define eigenvector for text
literally by radical statistical mechanisms of term classification.
measure. Our comment on statistical term measures and In WordNet, a form is represented by a string of
document representation can be clarified by analyzing a ASCII characters, and a sense is represented by the set of
small text corpus example 1. (one or more) synonyms that have that sense [8].
Example 1. Synonymy is WordNet’s basic relation, because WordNet
Sample A: Men love holiday. uses sets of synonyms (synsets) to represent word senses.
Sample B: Human enjoys vacation. Videlicet, shown as Fig. 1, one word, refers to several
In example 1, the two simple sentences are viewed synsets.
as two document samples, and these two documents
comprise the small corpus. Evidently, the meanings of
sample A and sample B are extremely equivalent. Thus,
the correlation and semantic similarity between these two
documents are considerable. Meanwhile, sample B can
be regarded as a derivative of sample A via lexical
replacement of document original. The text segmentation
shall divide each document into meaningful terms, such Fig. 1 Common semantic-element of words
as words, sentences, or topics. As to example 1, all words
of documents are divided as terms. Obviously, on behalf In WordNet, because one word or term refers to
of statistical term measures, the document particular synsets, our motivation is that several
representations on Example 1 did not perform well, particular synsets can strictly describe the meaning of
which are listed in Table 1 and Table 2. one word for characterizing lexical semantic contents.
J. Cent. South Univ. (2015) 22: 1833−1840 1835
Then, our method defines these particular synsets as the I (X ; Y ), is re-defined to be
semantic-elements of word. I (X ; Y ) =
Based on the above definition, involved semantic- F (exi , y j ) mod N
elements can character the lexical semantic contents of å å F ( ex , y i j
) mod N lg
F (exi ) mod N ´ F (e y j ) mod N
Example 1, which shall accomplish feature extraction of xi Î X y j ÎY

lexical semantic contents. For instance, in Fig. 1, the (2)


words human and man belong to different document To denote probability P(x i ) or P(y j ), function
samples in Example 1, and the common semantic- F (exi ) or F (e y j ) is estimated by calculating the
element homo that simultaneously describes the
frequency of semantic-elements that describe the
meanings of human and man can gain mutual
meaning of xi or yj in sample X or Y, and modulo N, the
information [7] between term meanings. Moreover, our
total of semantic-elements in corpus. Meanwhile, exi , y j
document representation is able to capture the lexical
semantic mutual information between samples which lies is the common semantic-elements that simultaneously
in the same synonyms of different documents. describe the meaning of xi and yj, to denote joint
probability P(xi, yj), and function F (exi , y j ) is estimated
According to the statistical theory of
communications, our work needs further analysis for by calculating the frequency e xi , y j , and modulo N.
theoretical proof. The analysis first introduces some of In example 1, joint probability P(xi, yj) is estimated
the basic formulae of information theory [2, 7], which by counting the frequency of the common semantic-
are used in our theoretical development of samples elements, and modulo N. For instance, the words human
mutual information. Now, let xi and yj be two distinct and man are described by the common semantic-element
terms (events) from finite samples (event spaces) X and Y. homo (shown in Fig. 1). In reality, P(human, man)=
Then, let X or Y be random variable representing F(homo) mod N>0, as a result, lexical semantic mutual
distinct lexical semantic contents in sample X or Y, information between Sample A and Sample B,
which occurs with certain probabilities. In reference to I (X ; Y ), is positive. Thus, the analysis proves that the
above definitions, mutual information between X and semantic-elements and feature extraction of lexical
Y , represents the reduction of uncertainty about either semantic contents can provide the probability-weighted
X or Y when the other is known. The mutual amount of information (PWI) [2] between document
information between samples, I (X ; Y ), is specially samples on the lexical semantic level.
defined as
P( xi , y j ) 3.2 Lexical-semantic VSM of text corpus
I (X ; Y ) = å å P( xi , y j ) lg P( x ) P( y (1) In our work, documents are represented using the
xi Î X y j ÎY i j)
vector space model (VSM). The VSM represents each
In the statistical methods of feature extraction, document as a vector of identifiers [1]. Each dimension
probability P(xi) or P(yj) is estimated by counting the corresponds to a separate feature value. If a feature
number of observations (frequency) of xi or yj in sample occurs in the document, its value in the vector is non-
X or Y, and normalizing by N, the size of the corpus. zero. Several different ways of computing these values,
Joint probability, P(xi, yj), is estimated by counting the also known as (term) weights, have been developed.
number of times (related frequency) that term xi equals For organizing the lexical-semantic VSM of text
(is related to) yj in the respective samples of themselves, corpus in the lexical-semantic space, the procedures are
and normalizing by N. as follows. In the first place, for feature extraction of
Taking the Example 1, between any term xi in lexical semantic contents, our work makes a data
Sample A and any term yj in Sample B, there is not any structure of semantic-element information. Secondly, the
counting of times that xi equals yj. As a result, on corpus work uses EM modeling to disambiguate word stems.
Example 1, the statistical term measures indicate P(xi, Lastly, it constructs a lexical-semantic space and builds
yj)=0 so the samples mutual information I (X ; Y ) = 0. lexical-semantic eigenvectors in the space to characterize
Thus, the analysis verifies that the statistical methods of document samples.
feature extraction lose mutual information of term 1) The data structure of semantic-element
meanings. information comprises relevant information of each
On the other hand, for feature extraction of lexical semantic-element in a document sample, which is
semantic contents, our method uses several particular formalized as a data element, listed in Table 3. It can
semantic-elements to describe the meaning of one word record all important information of semantic-elements in
or term. In different samples, words can be related to a document, such as synset ID, weight, sample ID and
other words described by same semantic-elements. Then, relevant information of words.
lexical semantic mutual information between samples, Note that, in a record of the data structure, each
1836 J. Cent. South Univ. (2015) 22: 1833−1840
Table 3 Data structure of semantic-element information of semantic member. Both of the two linked lists carry
Item Explanation the essential information of original words and word
Synset ID Identification of synset stems in the semantic-element.
All synonyms in the identical synset 2) On the basis of data structure of semantic-
Set of synonym WordNet uses sets of synonyms (synsets) element information, semantic member needs to
to represent word senses [8] disambiguate word stems of original word. In the case of
Frequency of semantic-element in a
an original word referring to more than 1 word stem in
Weight (frequency) document sample (sum of semantic
members frequency) base form, semantic-element must ensure that one
Sample ID Identification of document sample original word refers to only 1 word stem. Then, in order
A linked list (shown in Fig. 2(a)) which to select only 1 word stem for an original word (shown in
carries all original words of terms Fig. 3), this method employs the maximum entropy
Semantic member
referring to semantic-element and model [11]. ME modeling provides a framework for
their word stem (s) integrating information for classification from many
A linked list (shown in Fig. 2(b)) which
heterogeneous information sources [12].
Semantic members carries frequency of each original words
frequency of terms (that refer to semantic-element) In our model, it is supposed that diversity [13] of
one by one semantic member implies the significance of the
semantic-element and the rationality of existing semantic
original word in inflected form [9] referring to the members.
semantic-element and its word stem(s) in base form Assume a set of original words X and a set of its
[9−10] are recorded by linked list of semantic member word stems C. The function cl(x): X→C chooses the
(shown in Fig. 2(a)). And, according to WordNet word stem c with the highest conditional probability,
framework [8], when original word refers to more than 1 which makes sure that original word x only refers to:
word stem, the linked-list of semantic member will c|(x)=arg maxc p(c|x). Each feature [12] of original word
expend the very node of the original word to register all is calculated by a function that is associated to a specific
word stems. word stem c, and it takes the form of Eq. (3), where Si is
the number of semantic member of semantic-element i,
Pj is the proportion of the frequency of original word j
Si
to weight in semantic-element i, and the -å Pj ⋅ log 2 Pj
j =1

indicates semantic member diversity of semantic-


element i in a document, in the form of Shannon-Wiener
index [13−14].
The conditional probability p(c|x) is defined by
Eq. (4). The parameter of the semantic-element i [12], αi,
is the frequency of original word x in semantic- element i.
K is the number of semantic-elements that word stem c
refers to, and Z(x) is a value to ensure that the sum of all
conditional probabilities for this context is equal to 1.

 Si
  Pj  log 2 Pj , if x refers to c and c refers
 j 1

f i ( x, c )   to semantic - element i (3)
0, otherwise


1 K f i ( x,c )
Fig. 2 Linked lists of semantic member (a) and semantic p (c x )  i
Z ( x) i 1
(4)
members frequency (b)
Above equations aim at finding the highest
Meanwhile, the linked-list of semantic members conditional probability p(c|x), and using the function c|(x)
frequency is shown in Fig. 2(b). It records the frequency to ensure that original word x refers to only 1 word stem
of each original word one by one in original word order (like Fig. 3). After semantic-elements characterizing
J. Cent. South Univ. (2015) 22: 1833−1840 1837
lexical semantic contents of a document preliminarily, employs an effective algorithm to classify the documents.
the specified ME modeling is applied to implementing After that, contrast between our method and typical
disambiguation of word stems. Necessarily, the relevant statistical method displays the effect of this work.
items in the data structure of semantic-element
information shall be modified, such as the semantic 4.1 Eigenvectors for document representation
member, the frequency of original word, and the weight. In our work, experiments use two sorts of
Furthermore, some relevant semantic-elements shall be eigenvector to represent document sample: 1) lexical-
eliminated. semantic eigenvector in the lexical-semantic VSM
shown in Eq. (5); 2) term-statistic eigenvector in the
term-space which takes different numbers of selected
features using information gain [15]. Using the typical
statistical method of feature extraction, TF-IDF, the
term-statistic eigenvector, dxRn, is given as [2]
d x  (d x (1) , d x ( 2) ,, d x ( n ) ) (6)

where n is the number of terms in corpus; dx(j) is the


feature value on the jth term, given as dx(j)=FTF(wj,
Fig. 3 1:1 reference of original word docx)·FIDF(wj) for all j=1 to n, and FTF(wj, docx) is the
frequency of the term wj in document docx and FIDF(wj) is
3) The document representation uses the vector the inverse document frequency of wj.
space model (VSM). In text corpus, all referred
semantic-elements are fixed by disambiguation of word 4.2 Datasets
stems, then, each identical synset ID of all semantic- Our experiments use two corpora: Reuter
elements fills one dimension in lexical-semantic VSM (http://kdd.ics.uci.edu/databases/reuters21578/reuters215
respectively. In lexical-semantic VSM, each document 78.html) and an adjusted corpus based on Reuter-21578.
representation is marked in the lexical-semantic space of 1) Reuter. The Reuters-21578 text categorization test
text corpus. Specifically, each document sample collection contains documents collected from the reuters
identified by sample ID is represented by the newswire in 1999. It is a standard text categorization
lexical-semantic eigenvector. The lexical-semantic VSM benchmark and contains 135 categories. Our experiments
represents a document docx, using a lexical-semantic used its subset: one consisting of 20 categories, which
eigenvector dxRm, given as has approximately 3500 documents (listed in Table 4).
d x  (d x (1) , d x ( 2) ,, d x ( m) ) (5) 2) Adjusted corpus (based on Reuter-21578). After
selecting the subset of Reuter-21578, the datasets unite
where m is the number of identical synset ID of all lexical-replacement documents deriving from 10% of the
semantic-elements in corpus; dx(i) is the feature value on subset originals with it. Specifically, each lexical-
the ith synset, given as dx(i)=FS(si, docx)·FIDF(Si) for all replacement document is changed from an original
i=1 to m. FS(si, docx) is the weight (frequency) of the ith document in the subset. For instance, in Table 5, the
corresponding semantic-element si in document docx. semantic contents of the lexical replacement and original
And FIDF(si)=lg(D/NDF(si)) is the inverse document are similar, and the meanings of them are extremely
frequency of si, where D is the sum of the documents in equivalent.
corpus, NDF(si) is the number of documents in which the
ith synset appears at least once. 4.3 Classification using NWKNN algorithm
In the text corpus analysis, KNN classification is
4 Experiment and its results especially effective on selection of data eigenvectors. To
tackle unbalanced text corpus, the experiments select an
To test the lexical-semantic VSM and verify the optimized KNN classification, the NWKNN (Neighbor-
lexical semantic classification, this work uses two sorts Weighted K-Nearest Neighbor) algorithm [14]. For
of eigenvector to represent document in two datasets, and NWKNN classification, each document d is represented

Table 4 Distribution of all categories in subset of Reuter-21578


Category Cotton Earn Cpi Rubber Sugar Money-fx Bop Grain Heat Money-supply
Sample 27 761 75 40 145 574 47 489 16 113
Category Silver Tin Crude Hog Nat- gas Jobs Cocoa Trade Housing Nickel
Sample 16 32 483 16 48 50 59 441 16 5
1838 J. Cent. South Univ. (2015) 22: 1833−1840
Table 5 Lexical replacement of <REUTERS … NEWID="40"> dimension and increasing document number, which
Original Lexical replacement indicates the corpus scale. The comparison between two
Unchanging accrual rates of sorts of eigenvectors on dimensions can display
Stable interest rates and a optimization to the dimension problem [18].
deposit and an uprising economy
growing economy are
are anticipated to render
expected to provide favorable 4.5 Experimental results
favourable status for further
conditions for further growth To accomplish three-fold cross validation, the
increment in 1987, president
in 1987, president Brian experiments conduct the training-test procedure on
Brian O’Malley said to
O’Malley told shareholders at datasets Reuter and adjust corpus three times alternately,
stockholders at the yearly
the annual meeting. and use the average of the three performances as final
meeting.
Standard Trustco previously result.
Standard Trustco antecedently
reported assets of 1.28 billion Using the NWKNN classification, Fig. 4 manifests
covered assets of 1.28 billion
dlrs in 1986, up from 1.10 the F1 measure curves of lexical-semantic eigenvector
dlrs in 1986, upward from 1.10
billion dlrs in 1985. Return on and term-statistic eigenvector on Reuter. Note that, the
billion dlrs in 1985. Return on
common shareholders' equity exponent takes 3 empirically [14]. It is obvious that the
common stockholders’ equity
was 18.6% last year, up from lexical-semantic eigenvector beats term-statistic
was 18.6% last year, upward
15% in 1985. eigenvectors under all selected feature numbers of term-
from 15% in 1985.
space [16] by 4%−7% on Reuter.
[15−16] using both lexical-semantic eigenvector and
term-statistic eigenvector. Formally, the decision rule [14]
in NWKNN classification can be written as

 
score(d , ci )  weight i   Sim(d , d j ) (d j , ci )  (7)
 
 d j KNN ( d ) 

1, d j  ci
 (d j , ci )   (8)
0, d j  ci

where KNN(d) indicates the set of K-nearest neighbors of


document d; Sim(d, di) denotes the similarity between
document d and di using cosine value between
eigenvectors of d and di [14]; δ(dj, ci) is the classification Fig. 4 Classification result of lexical-semantic eigenvector and
for document dj with respect to class ci. Besides, term-statistic eigenvector with different term-space feature
according to experience of NWKNN algorithm [14], a numbers on Reuter
parameter of weighti, exponent [14], ranges from 2.0 to
6.0. Using different exponents in NWKNN, Fig. 5
illustrates the F1 measure comparison between lexical-
4.4 Performance measure semantic eigenvector and term-statistic eigenvector on
To evaluate the text classification system, Reuter, respectively. Note that the feature number of
performance measure uses the F1 measure [17]. This term-space takes 10000. With the increase of exponent,
measure combines recall and precision in the following the lexical-semantic eigenvector performs better on
way: Reuter, and beats term-statistic eigenvector by
2  RRecall  PPrecision approximate 4% averagely.
F1  (9) Using different exponent in NWKNN, Fig. 6
RRecall  PPrecision
describes the macro-precision and macro-recall
where RRecall is the recall; PPrecision is the precision. comparison between lexical-semantic eigenvector and
Using F1 measure, it can display the effect of term-statistic eigenvector on Reuter, respectively. Note
different kinds of data on a text classification system [17]. that the feature number of term-space takes 10000. It is
For ease of comparison, our experiments summarize the an apparent phenomenon that with the increase of
F1 scores over the different categories using the exponent, the curves accord with the experience of
macro-averages of F1 scores; in the same way, the NWKNN [14]. Meantime, macro-precision or macro-
Macro-Recall and Macro-Precision can be obtained [17]. recall of lexical-semantic eigenvector is superior to the
Besides, to express the dimension reduction term-statistic eigenvector by 7% or 8% on Reuter
relatively, the experiments compare the numbers of averagely.
J. Cent. South Univ. (2015) 22: 1833−1840 1839
that, the exponent takes 3 empirically [14]. It is obvious
that the lexical-semantic eigenvector beats term-statistic
eigenvectors under all selected feature numbers of term-
space [16] by 10%−13% on adjusted corpus.
Using different exponents in NWKNN, Fig. 8
illustrates the F1 measure comparison between lexical-
semantic eigenvector and term-statistic eigenvector on
adjusted corpus, respectively. Note that the feature
number of term-space takes 10000. With the increase of
exponent, the lexical-semantic eigenvector performs
better on adjusted corpus, and beats term-statistic
eigenvector by 10% averagely.

Fig. 5 Classification result of l lexical-semantic eigenvector


and term-statistic eigenvector with different exponents on
Reuter

Fig. 8 Classification result of l lexical-semantic eigenvector


and term-statistic eigenvector with different exponents on
adjusted corpus

Using different exponents in NWKNN, Fig. 9


Fig. 6 Classification recall and precision of lexical-semantic
describes the macro-precision and macro-recall
eigenvector with different exponents and term-statistic
comparison between lexical-semantic eigenvector and
eigenvector on Reuter
term-statistic eigenvector on adjusted corpus,
respectively. Note that the feature number of term-space
Using the NWKNN classification, Fig. 7 manifests
takes 10000. It is an apparent phenomenon that with the
the F1 measure curves for lexical-semantic eigenvector
increase of exponent, the curves accord with the
and term-statistic eigenvector on adjusted corpus. Note

Fig. 7 Classification result of lexical-semantic eigenvector and Fig. 9 Classification recall and precision of lexical-semantic
term-statistic eigenvector with different term-space feature eigenvector with different exponents and term-statistic
numbers on adjusted corpus eigenvector on adjusted corpus
1840 J. Cent. South Univ. (2015) 22: 1833−1840
experience of NWKNN [14]. Meantime, macro-precision
or macro-recall of lexical-semantic eigenvector is References
superior to the term-statistic eigenvector by 11% or 12%
on adjusted corpus averagely. [1] JING L P, NG M K, HUANG JOSHUA Z. Knowledge-based vector
According to Eqs. (5) and (6), Fig. 10 reports the space model for text clustering [J]. Knowledge and Information
Systems, 2010, 25(1): 35−55.
dimensionalities of lexical-semantic eigenvector and
[2] ZHANG Wen, YOSHIDA Taketoshi, TANG Xi-jin. A comparative
term-statistic eigenvector on Reuter. Note that number of study of TF*IDF, LSI and multi-words for text classification [J].
document ranges from 200 to 2800. After the number of Expert Systems with Applications, 2011, 38(3): 2758−2765.
document reaching 1650, the dimensionality of lexical- [3] ZHANG Yin, JIN Rong, ZHOU Zhi-hua. Understanding
semantic eigenvector is less than that of the term-statistic bag-of-words model: a statistical framework [J]. International Journal
eigenvector. It indicates the improvement of dimension of Machine Learning and Cybernetics, 2010, 1(1/2/3/4): 43−52.
[4] LI P, SHRIVASTAVA A, KONIG A C. b-Bit minwise hashing in
reduction in our method.
practice [C]// Proceedings of the 5th Asia-Pacific Symposium on
Internetware. New York: ACM, 2013: 13−22.
[5] HAMID A O, BEHZADI B, CHRISTOPH S, HENZINGER M.
Detecting the origin of text segments efficiently [C]// Proceedings of
the 18th International Conference on World Wide Web. New York:
ACM, 2009: 61−70.
[6] SANCHEZ D, BATET M. A semantic similarity method based on
information content exploiting multiple ontologies [J]. Expert
Systems with Applications, 2013, 40(4): 1393−1399.
[7] CHURCH K W, HANKS P. Word association norms, mutual
information, and lexicography [J]. Computational linguistics, 1990,
16(1): 22−29.
[8] MILLER G A. WordNet: A lexical database for English [J].
Communications of the ACM, 1995, 38(11): 39−41.
[9] LINTEAN M, RUS V. Measuring Semantic similarity in short texts
through greedy pairing and word semantics [C]// Proceedings of the
25th International Florida Artificial Intelligence Research Society
Conference. Marco Island, USA: AAAI, 2012: 244−249.
Fig. 10 Dimensionalities of lexical-semantic eigenvector and [10] MIT. MIT Java Wordnet interface (JWI) [EB/OL]. [2013−12−20].
term-statistic eigenvector on Reuter-21578 http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStem
mer.html/.
[11] ZHAO Ling-yun, LIU Fang-ai, ZHU Zhen-fang. Frontier and future
5 Conclusion and future work
development of information technology in medicine and education:
Identification of evaluation collocation based on maximum entropy
1) A data structure of semantic-element information model [M]. 1st ed. New York: Springer, 2013: 713−721.
is constructed to record relevant information of each [12] HWANG M, CHOI C, KIM P. Automatic enrichment of semantic
semantic-element in document sample. It can relation network and its application to word sense disambiguation [J].
IEEE Transactions on Knowledge and Data Engineering, 2011, 23(6):
characterize lexical semantic contents and be adapted for 845−858.
disambiguation of word stems. [13] KEYLOCK C J. Simpson diversity and the Shannon–Wiener index
2) The lexical-semantic eigenvector using the as special cases of a generalized entropy [J]. Oikos, 2005, 109(1):
NWKNN algorithm achieves better performance of 203−207.
[14] TAN S. Neighbor-weighted k-nearest neighbor for unbalanced text
classification than term-statistic eigenvector which
corpus [J]. Expert Systems with Applications, 2005, 28(4): 667−671.
stands for the typical statistical method of feature [15] AGGARWAL C C, ZHAI C X. Mining text data: A survey of text
extraction, especially, for impact of lexical replacement. classification algorithms [M]. 1st ed. New York: Springer, 2012:
3) Our method of document representation 163−222.
[16] TATA S, PATEL J M. Estimating the selectivity of tf-idf based cosine
demonstrates the improvement of dimension reduction
similarity predicates [J]. ACM Sigmod Record, 2007, 36(2): 7−12.
for text classification. [17] van RIJSBERGEN C. Information retrieval [M]. London:
As for this work, the future research includes using Butterworths Press, 1979.
more current algorithms based on the lexical-semantic [18] YAN Jun, LIU Ning, YAN Shui-cheng, YANG Qiang, FAN Wei-guo,
eigenvector for text corpus analysis, and developing a WEI Wei, CHEN Zheng. Trace-oriented feature analysis for
large-scale text data dimension reduction [J]. IEEE Transactions
method for representing semi-structured document such
on Knowledge and Data Engineering, 2011, 23(7): 1103−1117.
as XML on the basis of semantic-element. (Edited by YANG Hua)

You might also like