WordNet-based Lexical Semantic Classification For Text Corpus Analysis
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
1. School of Information Science and Engineering, Central South University, Changsha 410075, China;
2. School of Software, Central South University, Changsha 410075, China
© Central South University Press and Springer-Verlag Berlin Heidelberg 2015
Abstract: Many text classifications depend on statistical term measures to implement document representation. Such document
representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.
This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet,
this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM
modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document
representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text
corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic
eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of
document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.
Foundation item: Project(2012AA011205) supported by National High-Tech Research and Development Program (863 Program) of China;
Projects(61272150, 61379109, M1321007, 61301136, 61103034) supported by the National Natural Science Foundation of China;
Project(20120162110077) supported by Research Fund for the Doctoral Program of Higher Education of China; Project(11JJ1012)
supported by Excellent Youth Foundation of Hunan Scientific Committee, China
Received date: 2014−03−21; Accepted date: 2014−10−11
Corresponding author: WANG Lu-da, PhD Candidate, Lecturer; Tel: +86−18613082443; E-mail: [email protected]
1834 J. Cent. South Univ. (2015) 22: 1833−1840
are carried out to verify the effectiveness of this method. Table 1 Statistical term measures on Sample A
Term Men Love Holiday Human Enjoys Vacation
2 Analysis of statistical term measure Weight
1 1 1 0 0 0
(frequency)
In the information retrieval field, similarity and
correlation analysis of text corpus needs to implement Table 2 Statistical term measures on Sample B
corresponding document representations for diverse Term Men Love Holiday Human Enjoys Vacation
algorithms. Many practicable methods of document Weight
0 0 0 1 1 1
representation share a basic mechanism, statistical term (frequency)
measure.
Typical statistical methods of feature extraction Comparing Tables 1 and 2, positive weights do not
include TF-IDF based on lexical term frequency and coexist in the same term of two samples. These two
shingle hash based on consecutive terms [5]. Many TF- orthogonal vectors of term weight demonstrate that the
IDF-based methods of feature extractions employ a statistical term measures for document representation
simple assumption that frequent terms are also cannot effectively signify semantic similarity of the
significant [2]. And, these methods quantify the extent of corpus example 1. And they did not recognize and
usefulness of terms in characterizing the document in represent the lexical semantic contents of these two
which they appear [2]. Besides, as for some hashing documents practically. As a result, these two vectors
measures based on fingerprinted shingle, people call a cannot provide mutual information of term meanings.
sequence of k consecutive terms in a document of shingle.
Then, a selection algorithm determines which shingles to 3 Proposed program
store in a hash table. And various estimation techniques
are used to determine which shingle is copied and from 3.1 Motivation and theoretical analysis
which most of the content originated [5]. For text corpus analysis, document representations
In the above, these methods for document which depend on statistical term measures shall lose
representation are perceived as the mode using statistical mutual information of term meanings. Besides, in
term measures. As a sort of ontology methods [6], different documents, term meanings are relevant to
document representations based on statistical term specific synonyms which are involved by lexical
measures ignore recognition of lexical semantic contents. semantic contents. Thus, this new work resorts to
It causes the document representation to lose the mutual WordNet [8], a lexical database for English, for
information [7] of term meanings which comes from extracting lexical semantic contents. Then, the method of
synonyms in different samples. Moreover, lexical document representation will construct a lexical semantic
replacement of document original cannot be represented VSM of text corpus to define eigenvector for text
literally by radical statistical mechanisms of term classification.
measure. Our comment on statistical term measures and In WordNet, a form is represented by a string of
document representation can be clarified by analyzing a ASCII characters, and a sense is represented by the set of
small text corpus example 1. (one or more) synonyms that have that sense [8].
Example 1. Synonymy is WordNet’s basic relation, because WordNet
Sample A: Men love holiday. uses sets of synonyms (synsets) to represent word senses.
Sample B: Human enjoys vacation. Videlicet, shown as Fig. 1, one word, refers to several
In example 1, the two simple sentences are viewed synsets.
as two document samples, and these two documents
comprise the small corpus. Evidently, the meanings of
sample A and sample B are extremely equivalent. Thus,
the correlation and semantic similarity between these two
documents are considerable. Meanwhile, sample B can
be regarded as a derivative of sample A via lexical
replacement of document original. The text segmentation
shall divide each document into meaningful terms, such Fig. 1 Common semantic-element of words
as words, sentences, or topics. As to example 1, all words
of documents are divided as terms. Obviously, on behalf In WordNet, because one word or term refers to
of statistical term measures, the document particular synsets, our motivation is that several
representations on Example 1 did not perform well, particular synsets can strictly describe the meaning of
which are listed in Table 1 and Table 2. one word for characterizing lexical semantic contents.
J. Cent. South Univ. (2015) 22: 1833−1840 1835
Then, our method defines these particular synsets as the I (X ; Y ), is re-defined to be
semantic-elements of word. I (X ; Y ) =
Based on the above definition, involved semantic- F (exi , y j ) mod N
elements can character the lexical semantic contents of å å F ( ex , y i j
) mod N lg
F (exi ) mod N ´ F (e y j ) mod N
Example 1, which shall accomplish feature extraction of xi Î X y j ÎY
Si
Pj log 2 Pj , if x refers to c and c refers
j 1
f i ( x, c ) to semantic - element i (3)
0, otherwise
1 K f i ( x,c )
Fig. 2 Linked lists of semantic member (a) and semantic p (c x ) i
Z ( x) i 1
(4)
members frequency (b)
Above equations aim at finding the highest
Meanwhile, the linked-list of semantic members conditional probability p(c|x), and using the function c|(x)
frequency is shown in Fig. 2(b). It records the frequency to ensure that original word x refers to only 1 word stem
of each original word one by one in original word order (like Fig. 3). After semantic-elements characterizing
J. Cent. South Univ. (2015) 22: 1833−1840 1837
lexical semantic contents of a document preliminarily, employs an effective algorithm to classify the documents.
the specified ME modeling is applied to implementing After that, contrast between our method and typical
disambiguation of word stems. Necessarily, the relevant statistical method displays the effect of this work.
items in the data structure of semantic-element
information shall be modified, such as the semantic 4.1 Eigenvectors for document representation
member, the frequency of original word, and the weight. In our work, experiments use two sorts of
Furthermore, some relevant semantic-elements shall be eigenvector to represent document sample: 1) lexical-
eliminated. semantic eigenvector in the lexical-semantic VSM
shown in Eq. (5); 2) term-statistic eigenvector in the
term-space which takes different numbers of selected
features using information gain [15]. Using the typical
statistical method of feature extraction, TF-IDF, the
term-statistic eigenvector, dxRn, is given as [2]
d x (d x (1) , d x ( 2) ,, d x ( n ) ) (6)
score(d , ci ) weight i Sim(d , d j ) (d j , ci ) (7)
d j KNN ( d )
1, d j ci
(d j , ci ) (8)
0, d j ci
Fig. 7 Classification result of lexical-semantic eigenvector and Fig. 9 Classification recall and precision of lexical-semantic
term-statistic eigenvector with different term-space feature eigenvector with different exponents and term-statistic
numbers on adjusted corpus eigenvector on adjusted corpus
1840 J. Cent. South Univ. (2015) 22: 1833−1840
experience of NWKNN [14]. Meantime, macro-precision
or macro-recall of lexical-semantic eigenvector is References
superior to the term-statistic eigenvector by 11% or 12%
on adjusted corpus averagely. [1] JING L P, NG M K, HUANG JOSHUA Z. Knowledge-based vector
According to Eqs. (5) and (6), Fig. 10 reports the space model for text clustering [J]. Knowledge and Information
Systems, 2010, 25(1): 35−55.
dimensionalities of lexical-semantic eigenvector and
[2] ZHANG Wen, YOSHIDA Taketoshi, TANG Xi-jin. A comparative
term-statistic eigenvector on Reuter. Note that number of study of TF*IDF, LSI and multi-words for text classification [J].
document ranges from 200 to 2800. After the number of Expert Systems with Applications, 2011, 38(3): 2758−2765.
document reaching 1650, the dimensionality of lexical- [3] ZHANG Yin, JIN Rong, ZHOU Zhi-hua. Understanding
semantic eigenvector is less than that of the term-statistic bag-of-words model: a statistical framework [J]. International Journal
eigenvector. It indicates the improvement of dimension of Machine Learning and Cybernetics, 2010, 1(1/2/3/4): 43−52.
[4] LI P, SHRIVASTAVA A, KONIG A C. b-Bit minwise hashing in
reduction in our method.
practice [C]// Proceedings of the 5th Asia-Pacific Symposium on
Internetware. New York: ACM, 2013: 13−22.
[5] HAMID A O, BEHZADI B, CHRISTOPH S, HENZINGER M.
Detecting the origin of text segments efficiently [C]// Proceedings of
the 18th International Conference on World Wide Web. New York:
ACM, 2009: 61−70.
[6] SANCHEZ D, BATET M. A semantic similarity method based on
information content exploiting multiple ontologies [J]. Expert
Systems with Applications, 2013, 40(4): 1393−1399.
[7] CHURCH K W, HANKS P. Word association norms, mutual
information, and lexicography [J]. Computational linguistics, 1990,
16(1): 22−29.
[8] MILLER G A. WordNet: A lexical database for English [J].
Communications of the ACM, 1995, 38(11): 39−41.
[9] LINTEAN M, RUS V. Measuring Semantic similarity in short texts
through greedy pairing and word semantics [C]// Proceedings of the
25th International Florida Artificial Intelligence Research Society
Conference. Marco Island, USA: AAAI, 2012: 244−249.
Fig. 10 Dimensionalities of lexical-semantic eigenvector and [10] MIT. MIT Java Wordnet interface (JWI) [EB/OL]. [2013−12−20].
term-statistic eigenvector on Reuter-21578 http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStem
mer.html/.
[11] ZHAO Ling-yun, LIU Fang-ai, ZHU Zhen-fang. Frontier and future
5 Conclusion and future work
development of information technology in medicine and education:
Identification of evaluation collocation based on maximum entropy
1) A data structure of semantic-element information model [M]. 1st ed. New York: Springer, 2013: 713−721.
is constructed to record relevant information of each [12] HWANG M, CHOI C, KIM P. Automatic enrichment of semantic
semantic-element in document sample. It can relation network and its application to word sense disambiguation [J].
IEEE Transactions on Knowledge and Data Engineering, 2011, 23(6):
characterize lexical semantic contents and be adapted for 845−858.
disambiguation of word stems. [13] KEYLOCK C J. Simpson diversity and the Shannon–Wiener index
2) The lexical-semantic eigenvector using the as special cases of a generalized entropy [J]. Oikos, 2005, 109(1):
NWKNN algorithm achieves better performance of 203−207.
[14] TAN S. Neighbor-weighted k-nearest neighbor for unbalanced text
classification than term-statistic eigenvector which
corpus [J]. Expert Systems with Applications, 2005, 28(4): 667−671.
stands for the typical statistical method of feature [15] AGGARWAL C C, ZHAI C X. Mining text data: A survey of text
extraction, especially, for impact of lexical replacement. classification algorithms [M]. 1st ed. New York: Springer, 2012:
3) Our method of document representation 163−222.
[16] TATA S, PATEL J M. Estimating the selectivity of tf-idf based cosine
demonstrates the improvement of dimension reduction
similarity predicates [J]. ACM Sigmod Record, 2007, 36(2): 7−12.
for text classification. [17] van RIJSBERGEN C. Information retrieval [M]. London:
As for this work, the future research includes using Butterworths Press, 1979.
more current algorithms based on the lexical-semantic [18] YAN Jun, LIU Ning, YAN Shui-cheng, YANG Qiang, FAN Wei-guo,
eigenvector for text corpus analysis, and developing a WEI Wei, CHEN Zheng. Trace-oriented feature analysis for
large-scale text data dimension reduction [J]. IEEE Transactions
method for representing semi-structured document such
on Knowledge and Data Engineering, 2011, 23(7): 1103−1117.
as XML on the basis of semantic-element. (Edited by YANG Hua)