0% found this document useful (0 votes)

17 views

WordNet-based Lexical Semantic Classification For Text Corpus Analysis

This document proposes a new method of document representation called WordNet-based lexical semantic VSM to address limitations of traditional statistical term-based methods. The proposed method uses WordNet to characterize documents based on their lexical semantic contents rather than just statistical term frequencies. It constructs a data structure to represent semantic element information and builds document representations as eigenvectors in a lexical-semantic space based on synset weights. Experimental results on text corpora show the lexical-semantic eigenvectors perform better than traditional TF-IDF eigenvectors for text classification. The method provides a new approach for document representation applicable to various text corpus analysis applications.

Uploaded by

amitguptakkrnic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

WordNet-based Lexical Semantic Classification For Text Corpus Analysis

Uploaded by

amitguptakkrnic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

J. Cent. South Univ.

(2015) 22: 1833−1840

DOI: 10.1007/s11771-015-2702-8

WordNet-based lexical semantic classification for text corpus analysis

LONG Jun(龙军)1, WANG Lu-da(王鲁达)1, LI Zu-de(李祖德)1, ZHANG Zu-ping(张祖平)1, YANG Liu(杨柳)2

1. School of Information Science and Engineering, Central South University, Changsha 410075, China;
2. School of Software, Central South University, Changsha 410075, China
© Central South University Press and Springer-Verlag Berlin Heidelberg 2015

Abstract: Many text classifications depend on statistical term measures to implement document representation. Such document
representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.
This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet,
this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM
modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document
representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text
corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic
eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of
document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.

Key words: document representation; lexical semantic content; classification; eigenvector

statistical methods of feature extraction.

1 Introduction However, in the information retrieval field,
statistical term measures neglect lexical semantic content.
Text corpus analysis is an important task. It causes corpus analysis to perform on the level of term
Meanwhile, clustering and classification are the key string basically, and disregard lexical replacement of
procedures for text corpus analysis. In addition, text document original at deceiving the text corpus analysis
classification is an active research area in information easily.
retrieval, machine learning and natural language Semantic approach is an effectively used technology
processing. Most classification algorithms based on for document analysis. It can capture the semantic
eigenvector prevail in this field, such as KNN, SVM, features of words under analysis, and based on that,
ELM. Eigenvector-based document classification is a characterizes and classifies the document. Close
widely used technology for text corpus analysis. relationship between the syntax and lexical semantic
Relevant classification algorithms and experiments are contents of words have attracted considerable interest in
typically based on eigenvector of document both linguistics and computational linguistics.
representation. Moreover, the key issue is eigenvector- The design and implementation of WordNet-based
based classification algorithms depending on the VSM [1]. lexical semantic classification take account of lexical
TF-IDF (term frequency–inverse document semantic content particularly. Unlike traditional statistical
frequency) [2] is a prevalent method for characterizing methods of feature extraction, our work developed a new
document, and its essence is statistical term measure. term measure which can characterize lexical semantic
Many methods of document representation based on contents, and provide a practical method of document
TF-IDF can construct vector space model (VSM) of text representation to can handle the impact of lexical
corpus. Similarly, many methods of document replacement. The document representation is normalized
representation exploit statistical term measures, such as as the eigenvector; consequently, it shall be applied to
Bag-of-Words [3] and Minwise hashing [4]. For current VSM-dependent classification algorithms.
document representation, these methods are perceived as Theoretical analysis and a large number of experiments

Foundation item: Project(2012AA011205) supported by National High-Tech Research and Development Program (863 Program) of China;
Projects(61272150, 61379109, M1321007, 61301136, 61103034) supported by the National Natural Science Foundation of China;
Project(20120162110077) supported by Research Fund for the Doctoral Program of Higher Education of China; Project(11JJ1012)
supported by Excellent Youth Foundation of Hunan Scientific Committee, China
Received date: 2014−03−21; Accepted date: 2014−10−11
Corresponding author: WANG Lu-da, PhD Candidate, Lecturer; Tel: +86−18613082443; E-mail: [email protected]
1834 J. Cent. South Univ. (2015) 22: 1833−1840
are carried out to verify the effectiveness of this method. Table 1 Statistical term measures on Sample A
Term Men Love Holiday Human Enjoys Vacation
2 Analysis of statistical term measure Weight
1 1 1 0 0 0
(frequency)
In the information retrieval field, similarity and
correlation analysis of text corpus needs to implement Table 2 Statistical term measures on Sample B
corresponding document representations for diverse Term Men Love Holiday Human Enjoys Vacation
algorithms. Many practicable methods of document Weight
0 0 0 1 1 1
representation share a basic mechanism, statistical term (frequency)
measure.
Typical statistical methods of feature extraction Comparing Tables 1 and 2, positive weights do not
include TF-IDF based on lexical term frequency and coexist in the same term of two samples. These two
shingle hash based on consecutive terms [5]. Many TF- orthogonal vectors of term weight demonstrate that the
IDF-based methods of feature extractions employ a statistical term measures for document representation
simple assumption that frequent terms are also cannot effectively signify semantic similarity of the
significant [2]. And, these methods quantify the extent of corpus example 1. And they did not recognize and
usefulness of terms in characterizing the document in represent the lexical semantic contents of these two
which they appear [2]. Besides, as for some hashing documents practically. As a result, these two vectors
measures based on fingerprinted shingle, people call a cannot provide mutual information of term meanings.
sequence of k consecutive terms in a document of shingle.
Then, a selection algorithm determines which shingles to 3 Proposed program
store in a hash table. And various estimation techniques
are used to determine which shingle is copied and from 3.1 Motivation and theoretical analysis
which most of the content originated [5]. For text corpus analysis, document representations
In the above, these methods for document which depend on statistical term measures shall lose
representation are perceived as the mode using statistical mutual information of term meanings. Besides, in
term measures. As a sort of ontology methods [6], different documents, term meanings are relevant to
document representations based on statistical term specific synonyms which are involved by lexical
measures ignore recognition of lexical semantic contents. semantic contents. Thus, this new work resorts to
It causes the document representation to lose the mutual WordNet [8], a lexical database for English, for
information [7] of term meanings which comes from extracting lexical semantic contents. Then, the method of
synonyms in different samples. Moreover, lexical document representation will construct a lexical semantic
replacement of document original cannot be represented VSM of text corpus to define eigenvector for text
literally by radical statistical mechanisms of term classification.
measure. Our comment on statistical term measures and In WordNet, a form is represented by a string of
document representation can be clarified by analyzing a ASCII characters, and a sense is represented by the set of
small text corpus example 1. (one or more) synonyms that have that sense [8].
Example 1. Synonymy is WordNet’s basic relation, because WordNet
Sample A: Men love holiday. uses sets of synonyms (synsets) to represent word senses.
Sample B: Human enjoys vacation. Videlicet, shown as Fig. 1, one word, refers to several
In example 1, the two simple sentences are viewed synsets.
as two document samples, and these two documents
comprise the small corpus. Evidently, the meanings of
sample A and sample B are extremely equivalent. Thus,
the correlation and semantic similarity between these two
documents are considerable. Meanwhile, sample B can
be regarded as a derivative of sample A via lexical
replacement of document original. The text segmentation
shall divide each document into meaningful terms, such Fig. 1 Common semantic-element of words
as words, sentences, or topics. As to example 1, all words
of documents are divided as terms. Obviously, on behalf In WordNet, because one word or term refers to
of statistical term measures, the document particular synsets, our motivation is that several
representations on Example 1 did not perform well, particular synsets can strictly describe the meaning of
which are listed in Table 1 and Table 2. one word for characterizing lexical semantic contents.
J. Cent. South Univ. (2015) 22: 1833−1840 1835
Then, our method defines these particular synsets as the I (X ; Y ), is re-defined to be
semantic-elements of word. I (X ; Y ) =
Based on the above definition, involved semantic- F (exi , y j ) mod N
elements can character the lexical semantic contents of å å F ( ex , y i j
) mod N lg
F (exi ) mod N ´ F (e y j ) mod N
Example 1, which shall accomplish feature extraction of xi Î X y j ÎY

lexical semantic contents. For instance, in Fig. 1, the (2)

words human and man belong to different document To denote probability P(x i ) or P(y j ), function
samples in Example 1, and the common semantic- F (exi ) or F (e y j ) is estimated by calculating the
element homo that simultaneously describes the
frequency of semantic-elements that describe the
meanings of human and man can gain mutual
meaning of xi or yj in sample X or Y, and modulo N, the
information [7] between term meanings. Moreover, our
total of semantic-elements in corpus. Meanwhile, exi , y j
document representation is able to capture the lexical
semantic mutual information between samples which lies is the common semantic-elements that simultaneously
in the same synonyms of different documents. describe the meaning of xi and yj, to denote joint
probability P(xi, yj), and function F (exi , y j ) is estimated
According to the statistical theory of
communications, our work needs further analysis for by calculating the frequency e xi , y j , and modulo N.
theoretical proof. The analysis first introduces some of In example 1, joint probability P(xi, yj) is estimated
the basic formulae of information theory [2, 7], which by counting the frequency of the common semantic-
are used in our theoretical development of samples elements, and modulo N. For instance, the words human
mutual information. Now, let xi and yj be two distinct and man are described by the common semantic-element
terms (events) from finite samples (event spaces) X and Y. homo (shown in Fig. 1). In reality, P(human, man)=
Then, let X or Y be random variable representing F(homo) mod N>0, as a result, lexical semantic mutual
distinct lexical semantic contents in sample X or Y, information between Sample A and Sample B,
which occurs with certain probabilities. In reference to I (X ; Y ), is positive. Thus, the analysis proves that the
above definitions, mutual information between X and semantic-elements and feature extraction of lexical
Y , represents the reduction of uncertainty about either semantic contents can provide the probability-weighted
X or Y when the other is known. The mutual amount of information (PWI) [2] between document
information between samples, I (X ; Y ), is specially samples on the lexical semantic level.
defined as
P( xi , y j ) 3.2 Lexical-semantic VSM of text corpus
I (X ; Y ) = å å P( xi , y j ) lg P( x ) P( y (1) In our work, documents are represented using the
xi Î X y j ÎY i j)
vector space model (VSM). The VSM represents each
In the statistical methods of feature extraction, document as a vector of identifiers [1]. Each dimension
probability P(xi) or P(yj) is estimated by counting the corresponds to a separate feature value. If a feature
number of observations (frequency) of xi or yj in sample occurs in the document, its value in the vector is non-
X or Y, and normalizing by N, the size of the corpus. zero. Several different ways of computing these values,
Joint probability, P(xi, yj), is estimated by counting the also known as (term) weights, have been developed.
number of times (related frequency) that term xi equals For organizing the lexical-semantic VSM of text
(is related to) yj in the respective samples of themselves, corpus in the lexical-semantic space, the procedures are
and normalizing by N. as follows. In the first place, for feature extraction of
Taking the Example 1, between any term xi in lexical semantic contents, our work makes a data
Sample A and any term yj in Sample B, there is not any structure of semantic-element information. Secondly, the
counting of times that xi equals yj. As a result, on corpus work uses EM modeling to disambiguate word stems.
Example 1, the statistical term measures indicate P(xi, Lastly, it constructs a lexical-semantic space and builds
yj)=0 so the samples mutual information I (X ; Y ) = 0. lexical-semantic eigenvectors in the space to characterize
Thus, the analysis verifies that the statistical methods of document samples.
feature extraction lose mutual information of term 1) The data structure of semantic-element
meanings. information comprises relevant information of each
On the other hand, for feature extraction of lexical semantic-element in a document sample, which is
semantic contents, our method uses several particular formalized as a data element, listed in Table 3. It can
semantic-elements to describe the meaning of one word record all important information of semantic-elements in
or term. In different samples, words can be related to a document, such as synset ID, weight, sample ID and
other words described by same semantic-elements. Then, relevant information of words.
lexical semantic mutual information between samples, Note that, in a record of the data structure, each
1836 J. Cent. South Univ. (2015) 22: 1833−1840
Table 3 Data structure of semantic-element information of semantic member. Both of the two linked lists carry
Item Explanation the essential information of original words and word
Synset ID Identification of synset stems in the semantic-element.
All synonyms in the identical synset 2) On the basis of data structure of semantic-
Set of synonym WordNet uses sets of synonyms (synsets) element information, semantic member needs to
to represent word senses [8] disambiguate word stems of original word. In the case of
Frequency of semantic-element in a
an original word referring to more than 1 word stem in
Weight (frequency) document sample (sum of semantic
members frequency) base form, semantic-element must ensure that one
Sample ID Identification of document sample original word refers to only 1 word stem. Then, in order
A linked list (shown in Fig. 2(a)) which to select only 1 word stem for an original word (shown in
carries all original words of terms Fig. 3), this method employs the maximum entropy
Semantic member
referring to semantic-element and model [11]. ME modeling provides a framework for
their word stem (s) integrating information for classification from many
A linked list (shown in Fig. 2(b)) which
heterogeneous information sources [12].
Semantic members carries frequency of each original words
frequency of terms (that refer to semantic-element) In our model, it is supposed that diversity [13] of
one by one semantic member implies the significance of the
semantic-element and the rationality of existing semantic
original word in inflected form [9] referring to the members.
semantic-element and its word stem(s) in base form Assume a set of original words X and a set of its
[9−10] are recorded by linked list of semantic member word stems C. The function cl(x): X→C chooses the
(shown in Fig. 2(a)). And, according to WordNet word stem c with the highest conditional probability,
framework [8], when original word refers to more than 1 which makes sure that original word x only refers to:
word stem, the linked-list of semantic member will c|(x)=arg maxc p(c|x). Each feature [12] of original word
expend the very node of the original word to register all is calculated by a function that is associated to a specific
word stems. word stem c, and it takes the form of Eq. (3), where Si is
the number of semantic member of semantic-element i,
Pj is the proportion of the frequency of original word j
Si
to weight in semantic-element i, and the -å Pj ⋅ log 2 Pj
j =1

indicates semantic member diversity of semantic-

element i in a document, in the form of Shannon-Wiener
index [13−14].
The conditional probability p(c|x) is defined by
Eq. (4). The parameter of the semantic-element i [12], αi,
is the frequency of original word x in semantic- element i.
K is the number of semantic-elements that word stem c
refers to, and Z(x) is a value to ensure that the sum of all
conditional probabilities for this context is equal to 1.

 Si
  Pj  log 2 Pj , if x refers to c and c refers
 j 1

f i ( x, c )   to semantic - element i (3)
0, otherwise




1 K f i ( x,c )
Fig. 2 Linked lists of semantic member (a) and semantic p (c x )  i
Z ( x) i 1
(4)
members frequency (b)
Above equations aim at finding the highest
Meanwhile, the linked-list of semantic members conditional probability p(c|x), and using the function c|(x)
frequency is shown in Fig. 2(b). It records the frequency to ensure that original word x refers to only 1 word stem
of each original word one by one in original word order (like Fig. 3). After semantic-elements characterizing
J. Cent. South Univ. (2015) 22: 1833−1840 1837
lexical semantic contents of a document preliminarily, employs an effective algorithm to classify the documents.
the specified ME modeling is applied to implementing After that, contrast between our method and typical
disambiguation of word stems. Necessarily, the relevant statistical method displays the effect of this work.
items in the data structure of semantic-element
information shall be modified, such as the semantic 4.1 Eigenvectors for document representation
member, the frequency of original word, and the weight. In our work, experiments use two sorts of
Furthermore, some relevant semantic-elements shall be eigenvector to represent document sample: 1) lexical-
eliminated. semantic eigenvector in the lexical-semantic VSM
shown in Eq. (5); 2) term-statistic eigenvector in the
term-space which takes different numbers of selected
features using information gain [15]. Using the typical
statistical method of feature extraction, TF-IDF, the
term-statistic eigenvector, dxRn, is given as [2]
d x  (d x (1) , d x ( 2) ,, d x ( n ) ) (6)

where n is the number of terms in corpus; dx(j) is the

feature value on the jth term, given as dx(j)=FTF(wj,
Fig. 3 1:1 reference of original word docx)·FIDF(wj) for all j=1 to n, and FTF(wj, docx) is the
frequency of the term wj in document docx and FIDF(wj) is
3) The document representation uses the vector the inverse document frequency of wj.
space model (VSM). In text corpus, all referred
semantic-elements are fixed by disambiguation of word 4.2 Datasets
stems, then, each identical synset ID of all semantic- Our experiments use two corpora: Reuter
elements fills one dimension in lexical-semantic VSM (http://kdd.ics.uci.edu/databases/reuters21578/reuters215
respectively. In lexical-semantic VSM, each document 78.html) and an adjusted corpus based on Reuter-21578.
representation is marked in the lexical-semantic space of 1) Reuter. The Reuters-21578 text categorization test
text corpus. Specifically, each document sample collection contains documents collected from the reuters
identified by sample ID is represented by the newswire in 1999. It is a standard text categorization
lexical-semantic eigenvector. The lexical-semantic VSM benchmark and contains 135 categories. Our experiments
represents a document docx, using a lexical-semantic used its subset: one consisting of 20 categories, which
eigenvector dxRm, given as has approximately 3500 documents (listed in Table 4).
d x  (d x (1) , d x ( 2) ,, d x ( m) ) (5) 2) Adjusted corpus (based on Reuter-21578). After
selecting the subset of Reuter-21578, the datasets unite
where m is the number of identical synset ID of all lexical-replacement documents deriving from 10% of the
semantic-elements in corpus; dx(i) is the feature value on subset originals with it. Specifically, each lexical-
the ith synset, given as dx(i)=FS(si, docx)·FIDF(Si) for all replacement document is changed from an original
i=1 to m. FS(si, docx) is the weight (frequency) of the ith document in the subset. For instance, in Table 5, the
corresponding semantic-element si in document docx. semantic contents of the lexical replacement and original
And FIDF(si)=lg(D/NDF(si)) is the inverse document are similar, and the meanings of them are extremely
frequency of si, where D is the sum of the documents in equivalent.
corpus, NDF(si) is the number of documents in which the
ith synset appears at least once. 4.3 Classification using NWKNN algorithm
In the text corpus analysis, KNN classification is
4 Experiment and its results especially effective on selection of data eigenvectors. To
tackle unbalanced text corpus, the experiments select an
To test the lexical-semantic VSM and verify the optimized KNN classification, the NWKNN (Neighbor-
lexical semantic classification, this work uses two sorts Weighted K-Nearest Neighbor) algorithm [14]. For
of eigenvector to represent document in two datasets, and NWKNN classification, each document d is represented

Table 4 Distribution of all categories in subset of Reuter-21578

Category Cotton Earn Cpi Rubber Sugar Money-fx Bop Grain Heat Money-supply
Sample 27 761 75 40 145 574 47 489 16 113
Category Silver Tin Crude Hog Nat- gas Jobs Cocoa Trade Housing Nickel
Sample 16 32 483 16 48 50 59 441 16 5
1838 J. Cent. South Univ. (2015) 22: 1833−1840
Table 5 Lexical replacement of <REUTERS … NEWID="40"> dimension and increasing document number, which
Original Lexical replacement indicates the corpus scale. The comparison between two
Unchanging accrual rates of sorts of eigenvectors on dimensions can display
Stable interest rates and a optimization to the dimension problem [18].
deposit and an uprising economy
growing economy are
are anticipated to render
expected to provide favorable 4.5 Experimental results
favourable status for further
conditions for further growth To accomplish three-fold cross validation, the
increment in 1987, president
in 1987, president Brian experiments conduct the training-test procedure on
Brian O’Malley said to
O’Malley told shareholders at datasets Reuter and adjust corpus three times alternately,
stockholders at the yearly
the annual meeting. and use the average of the three performances as final
meeting.
Standard Trustco previously result.
Standard Trustco antecedently
reported assets of 1.28 billion Using the NWKNN classification, Fig. 4 manifests
covered assets of 1.28 billion
dlrs in 1986, up from 1.10 the F1 measure curves of lexical-semantic eigenvector
dlrs in 1986, upward from 1.10
billion dlrs in 1985. Return on and term-statistic eigenvector on Reuter. Note that, the
billion dlrs in 1985. Return on
common shareholders' equity exponent takes 3 empirically [14]. It is obvious that the
common stockholders’ equity
was 18.6% last year, up from lexical-semantic eigenvector beats term-statistic
was 18.6% last year, upward
15% in 1985. eigenvectors under all selected feature numbers of term-
from 15% in 1985.
space [16] by 4%−7% on Reuter.
[15−16] using both lexical-semantic eigenvector and
term-statistic eigenvector. Formally, the decision rule [14]
in NWKNN classification can be written as

 
score(d , ci )  weight i   Sim(d , d j ) (d j , ci )  (7)
 
 d j KNN ( d ) 

1, d j  ci
 (d j , ci )   (8)
0, d j  ci

where KNN(d) indicates the set of K-nearest neighbors of

document d; Sim(d, di) denotes the similarity between
document d and di using cosine value between
eigenvectors of d and di [14]; δ(dj, ci) is the classification Fig. 4 Classification result of lexical-semantic eigenvector and
for document dj with respect to class ci. Besides, term-statistic eigenvector with different term-space feature
according to experience of NWKNN algorithm [14], a numbers on Reuter
parameter of weighti, exponent [14], ranges from 2.0 to
6.0. Using different exponents in NWKNN, Fig. 5
illustrates the F1 measure comparison between lexical-
4.4 Performance measure semantic eigenvector and term-statistic eigenvector on
To evaluate the text classification system, Reuter, respectively. Note that the feature number of
performance measure uses the F1 measure [17]. This term-space takes 10000. With the increase of exponent,
measure combines recall and precision in the following the lexical-semantic eigenvector performs better on
way: Reuter, and beats term-statistic eigenvector by
2  RRecall  PPrecision approximate 4% averagely.
F1  (9) Using different exponent in NWKNN, Fig. 6
RRecall  PPrecision
describes the macro-precision and macro-recall
where RRecall is the recall; PPrecision is the precision. comparison between lexical-semantic eigenvector and
Using F1 measure, it can display the effect of term-statistic eigenvector on Reuter, respectively. Note
different kinds of data on a text classification system [17]. that the feature number of term-space takes 10000. It is
For ease of comparison, our experiments summarize the an apparent phenomenon that with the increase of
F1 scores over the different categories using the exponent, the curves accord with the experience of
macro-averages of F1 scores; in the same way, the NWKNN [14]. Meantime, macro-precision or macro-
Macro-Recall and Macro-Precision can be obtained [17]. recall of lexical-semantic eigenvector is superior to the
Besides, to express the dimension reduction term-statistic eigenvector by 7% or 8% on Reuter
relatively, the experiments compare the numbers of averagely.
J. Cent. South Univ. (2015) 22: 1833−1840 1839
that, the exponent takes 3 empirically [14]. It is obvious
that the lexical-semantic eigenvector beats term-statistic
eigenvectors under all selected feature numbers of term-
space [16] by 10%−13% on adjusted corpus.
Using different exponents in NWKNN, Fig. 8
illustrates the F1 measure comparison between lexical-
semantic eigenvector and term-statistic eigenvector on
adjusted corpus, respectively. Note that the feature
number of term-space takes 10000. With the increase of
exponent, the lexical-semantic eigenvector performs
better on adjusted corpus, and beats term-statistic
eigenvector by 10% averagely.

Fig. 5 Classification result of l lexical-semantic eigenvector

and term-statistic eigenvector with different exponents on
Reuter

Fig. 8 Classification result of l lexical-semantic eigenvector

and term-statistic eigenvector with different exponents on
adjusted corpus

Using different exponents in NWKNN, Fig. 9

Fig. 6 Classification recall and precision of lexical-semantic
describes the macro-precision and macro-recall
eigenvector with different exponents and term-statistic
comparison between lexical-semantic eigenvector and
eigenvector on Reuter
term-statistic eigenvector on adjusted corpus,
respectively. Note that the feature number of term-space
Using the NWKNN classification, Fig. 7 manifests
takes 10000. It is an apparent phenomenon that with the
the F1 measure curves for lexical-semantic eigenvector
increase of exponent, the curves accord with the
and term-statistic eigenvector on adjusted corpus. Note

Fig. 7 Classification result of lexical-semantic eigenvector and Fig. 9 Classification recall and precision of lexical-semantic
term-statistic eigenvector with different term-space feature eigenvector with different exponents and term-statistic
numbers on adjusted corpus eigenvector on adjusted corpus
1840 J. Cent. South Univ. (2015) 22: 1833−1840
experience of NWKNN [14]. Meantime, macro-precision
or macro-recall of lexical-semantic eigenvector is References
superior to the term-statistic eigenvector by 11% or 12%
on adjusted corpus averagely. [1] JING L P, NG M K, HUANG JOSHUA Z. Knowledge-based vector
According to Eqs. (5) and (6), Fig. 10 reports the space model for text clustering [J]. Knowledge and Information
Systems, 2010, 25(1): 35−55.
dimensionalities of lexical-semantic eigenvector and
[2] ZHANG Wen, YOSHIDA Taketoshi, TANG Xi-jin. A comparative
term-statistic eigenvector on Reuter. Note that number of study of TF*IDF, LSI and multi-words for text classification [J].
document ranges from 200 to 2800. After the number of Expert Systems with Applications, 2011, 38(3): 2758−2765.
document reaching 1650, the dimensionality of lexical- [3] ZHANG Yin, JIN Rong, ZHOU Zhi-hua. Understanding
semantic eigenvector is less than that of the term-statistic bag-of-words model: a statistical framework [J]. International Journal
eigenvector. It indicates the improvement of dimension of Machine Learning and Cybernetics, 2010, 1(1/2/3/4): 43−52.
[4] LI P, SHRIVASTAVA A, KONIG A C. b-Bit minwise hashing in
reduction in our method.
practice [C]// Proceedings of the 5th Asia-Pacific Symposium on
Internetware. New York: ACM, 2013: 13−22.
[5] HAMID A O, BEHZADI B, CHRISTOPH S, HENZINGER M.
Detecting the origin of text segments efficiently [C]// Proceedings of
the 18th International Conference on World Wide Web. New York:
ACM, 2009: 61−70.
[6] SANCHEZ D, BATET M. A semantic similarity method based on
information content exploiting multiple ontologies [J]. Expert
Systems with Applications, 2013, 40(4): 1393−1399.
[7] CHURCH K W, HANKS P. Word association norms, mutual
information, and lexicography [J]. Computational linguistics, 1990,
16(1): 22−29.
[8] MILLER G A. WordNet: A lexical database for English [J].
Communications of the ACM, 1995, 38(11): 39−41.
[9] LINTEAN M, RUS V. Measuring Semantic similarity in short texts
through greedy pairing and word semantics [C]// Proceedings of the
25th International Florida Artificial Intelligence Research Society
Conference. Marco Island, USA: AAAI, 2012: 244−249.
Fig. 10 Dimensionalities of lexical-semantic eigenvector and [10] MIT. MIT Java Wordnet interface (JWI) [EB/OL]. [2013−12−20].
term-statistic eigenvector on Reuter-21578 http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStem
mer.html/.
[11] ZHAO Ling-yun, LIU Fang-ai, ZHU Zhen-fang. Frontier and future
5 Conclusion and future work
development of information technology in medicine and education:
Identification of evaluation collocation based on maximum entropy
1) A data structure of semantic-element information model [M]. 1st ed. New York: Springer, 2013: 713−721.
is constructed to record relevant information of each [12] HWANG M, CHOI C, KIM P. Automatic enrichment of semantic
semantic-element in document sample. It can relation network and its application to word sense disambiguation [J].
IEEE Transactions on Knowledge and Data Engineering, 2011, 23(6):
characterize lexical semantic contents and be adapted for 845−858.
disambiguation of word stems. [13] KEYLOCK C J. Simpson diversity and the Shannon–Wiener index
2) The lexical-semantic eigenvector using the as special cases of a generalized entropy [J]. Oikos, 2005, 109(1):
NWKNN algorithm achieves better performance of 203−207.
[14] TAN S. Neighbor-weighted k-nearest neighbor for unbalanced text
classification than term-statistic eigenvector which
corpus [J]. Expert Systems with Applications, 2005, 28(4): 667−671.
stands for the typical statistical method of feature [15] AGGARWAL C C, ZHAI C X. Mining text data: A survey of text
extraction, especially, for impact of lexical replacement. classification algorithms [M]. 1st ed. New York: Springer, 2012:
3) Our method of document representation 163−222.
[16] TATA S, PATEL J M. Estimating the selectivity of tf-idf based cosine
demonstrates the improvement of dimension reduction
similarity predicates [J]. ACM Sigmod Record, 2007, 36(2): 7−12.
for text classification. [17] van RIJSBERGEN C. Information retrieval [M]. London:
As for this work, the future research includes using Butterworths Press, 1979.
more current algorithms based on the lexical-semantic [18] YAN Jun, LIU Ning, YAN Shui-cheng, YANG Qiang, FAN Wei-guo,
eigenvector for text corpus analysis, and developing a WEI Wei, CHEN Zheng. Trace-oriented feature analysis for
large-scale text data dimension reduction [J]. IEEE Transactions
method for representing semi-structured document such
on Knowledge and Data Engineering, 2011, 23(7): 1103−1117.
as XML on the basis of semantic-element. (Edited by YANG Hua)

Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
100% (1)
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
949 pages
Alajaji Chen2018 Book AnIntroductionToSingle UserInf
No ratings yet
Alajaji Chen2018 Book AnIntroductionToSingle UserInf
333 pages
Feature Selection and Feature Extraction in Pattern Analysis: A Literature Review
No ratings yet
Feature Selection and Feature Extraction in Pattern Analysis: A Literature Review
14 pages
WordNet Embeddings
No ratings yet
WordNet Embeddings
10 pages
Vector Based Models
No ratings yet
Vector Based Models
41 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Short Text Similarity Calculation Based On Jaccard and Semantic Mixture
No ratings yet
Short Text Similarity Calculation Based On Jaccard and Semantic Mixture
9 pages
Paper 2
No ratings yet
Paper 2
9 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
Admin, 4015
No ratings yet
Admin, 4015
19 pages
Semantic Density Analysis: Comparing Word Meaning Across Time and Phonetic Space
No ratings yet
Semantic Density Analysis: Comparing Word Meaning Across Time and Phonetic Space
8 pages
Paper 125
No ratings yet
Paper 125
11 pages
Evolution of Semantic Similarity - A Survey
No ratings yet
Evolution of Semantic Similarity - A Survey
35 pages
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
No ratings yet
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
4 pages
A Survey On Semantic Similarity Measures
No ratings yet
A Survey On Semantic Similarity Measures
5 pages
Semantic Syntactic Doc Classifiers
No ratings yet
Semantic Syntactic Doc Classifiers
2 pages
Survey On Vector Representations
No ratings yet
Survey On Vector Representations
46 pages
H94-1072
No ratings yet
H94-1072
6 pages
A Review of Semantic Similarity Measures in WordNet
No ratings yet
A Review of Semantic Similarity Measures in WordNet
12 pages
Advanced Cogntive Science
No ratings yet
Advanced Cogntive Science
15 pages
A Comparative Study On Text Representation Schemes in Text Categorization
No ratings yet
A Comparative Study On Text Representation Schemes in Text Categorization
11 pages
Semantic Similarity
No ratings yet
Semantic Similarity
14 pages
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
No ratings yet
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
11 pages
Measure Term Similarity Using A Semantic Network Approach
No ratings yet
Measure Term Similarity Using A Semantic Network Approach
5 pages
NLP-UNIT-4 (1) (1)
No ratings yet
NLP-UNIT-4 (1) (1)
23 pages
Topics in Cognitive Science - 2010 - McNamara - Computational Methods to Extract Meaning From Text and Advance Theories of
No ratings yet
Topics in Cognitive Science - 2010 - McNamara - Computational Methods to Extract Meaning From Text and Advance Theories of
15 pages
Neural Decoding
No ratings yet
Neural Decoding
34 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
Semantic Relatedness Applied To All Words Sense Disambiguation
No ratings yet
Semantic Relatedness Applied To All Words Sense Disambiguation
72 pages
M S S W: A S: Easurement of Emantic Imilarity Between Ords Urvey
No ratings yet
M S S W: A S: Easurement of Emantic Imilarity Between Ords Urvey
10 pages
06879d26e3ba5b6fb7feeddc199f24dd4ff6
No ratings yet
06879d26e3ba5b6fb7feeddc199f24dd4ff6
7 pages
COMP5046: Natural Language Processing
No ratings yet
COMP5046: Natural Language Processing
71 pages
Acta Psychologica: Mark Steyvers
No ratings yet
Acta Psychologica: Mark Steyvers
10 pages
Semantic Representation
No ratings yet
Semantic Representation
48 pages
Yadav 2014
No ratings yet
Yadav 2014
6 pages
Compositional Approaches For Representing Relations Between Words - A Comparative Study
No ratings yet
Compositional Approaches For Representing Relations Between Words - A Comparative Study
33 pages
Evaluating of Efficacy Semantic Similarity Methods
No ratings yet
Evaluating of Efficacy Semantic Similarity Methods
8 pages
UMA Literature Survey
No ratings yet
UMA Literature Survey
11 pages
DVT UNIT -4 Notes 211124 (1)
No ratings yet
DVT UNIT -4 Notes 211124 (1)
21 pages
CRPITV74Yang
No ratings yet
CRPITV74Yang
10 pages
Graph Based Representation and Analysis of Text Document: A Survey of Techniques
No ratings yet
Graph Based Representation and Analysis of Text Document: A Survey of Techniques
8 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
Trigram 11
No ratings yet
Trigram 11
16 pages
Artificial Intelligence: Francisco Pereira, Matthew Botvinick, Greg Detre
No ratings yet
Artificial Intelligence: Francisco Pereira, Matthew Botvinick, Greg Detre
13 pages
Unsupervised Compositionality Prediction of Nominal Compounds
No ratings yet
Unsupervised Compositionality Prediction of Nominal Compounds
57 pages
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
No ratings yet
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
40 pages
Performance Enhancement of WSD Using Association Rules in WEKA
No ratings yet
Performance Enhancement of WSD Using Association Rules in WEKA
8 pages
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
No ratings yet
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
9 pages
Improving WordNet Using Word Embeddings
No ratings yet
Improving WordNet Using Word Embeddings
8 pages
feature eng
No ratings yet
feature eng
34 pages
A Comparative Study of Root-Based and Stem-Based Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining Applications
No ratings yet
A Comparative Study of Root-Based and Stem-Based Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining Applications
13 pages
A Comparative Study of TF IDF LSI
No ratings yet
A Comparative Study of TF IDF LSI
8 pages
Tac Lde Notation Graph
No ratings yet
Tac Lde Notation Graph
12 pages
Cross-Cutting Models of Distributional Lexical Semantics
No ratings yet
Cross-Cutting Models of Distributional Lexical Semantics
53 pages
Ubicc-Id365 365
No ratings yet
Ubicc-Id365 365
9 pages
978 3 319 11749 2 - 8 PDF
No ratings yet
978 3 319 11749 2 - 8 PDF
2 pages
Unit 3-1
No ratings yet
Unit 3-1
66 pages
Shankara Digvijaya With Commentary (Sanskrit)
100% (2)
Shankara Digvijaya With Commentary (Sanskrit)
624 pages
Sentence Similarity Based On Semantic Networks
No ratings yet
Sentence Similarity Based On Semantic Networks
36 pages
AAAI06-123 (Revisar para Referencias)
No ratings yet
AAAI06-123 (Revisar para Referencias)
6 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Web Page Classification Based On Schema - Org Collection
No ratings yet
Web Page Classification Based On Schema - Org Collection
5 pages
A Knowledge Enhanced Ensemble Learning Model For Mental Disorder
No ratings yet
A Knowledge Enhanced Ensemble Learning Model For Mental Disorder
495 pages
BS Nagarajan SwDefinedInfra Day2
No ratings yet
BS Nagarajan SwDefinedInfra Day2
38 pages
(Karl Paulsen) Moving Media Storage Technologies (B-Ok - Xyz)
No ratings yet
(Karl Paulsen) Moving Media Storage Technologies (B-Ok - Xyz)
665 pages
Notes On Kullback-Leibler Divergence and Likelihood Theory
No ratings yet
Notes On Kullback-Leibler Divergence and Likelihood Theory
4 pages
Avhyas Technical Document User Manual
No ratings yet
Avhyas Technical Document User Manual
208 pages
Fundamental Connections Between Utility Theories of Wealth and Information Theory
No ratings yet
Fundamental Connections Between Utility Theories of Wealth and Information Theory
21 pages
Fooled by Correlation Common Misinterpre
No ratings yet
Fooled by Correlation Common Misinterpre
11 pages
Information Theory and Coding - Chapter 5
No ratings yet
Information Theory and Coding - Chapter 5
41 pages
Applied and computational measurable dynamics 1st Edition Erik M. Bollt pdf download
100% (2)
Applied and computational measurable dynamics 1st Edition Erik M. Bollt pdf download
64 pages
Objective Assessment of Multiresolution Image Fusion Algorithms For Context Enhancement in Night Vision: A Comparative Study
No ratings yet
Objective Assessment of Multiresolution Image Fusion Algorithms For Context Enhancement in Night Vision: A Comparative Study
16 pages
Boosting Secret Key Generation For IRS-Assisted Symbiotic Radio Communications
No ratings yet
Boosting Secret Key Generation For IRS-Assisted Symbiotic Radio Communications
6 pages
Coding 7
No ratings yet
Coding 7
7 pages
Information Theory Coding and Cryptograp PDF
No ratings yet
Information Theory Coding and Cryptograp PDF
140 pages
An Empirical Study of The Naive Bayes Classifier: I. Rish
No ratings yet
An Empirical Study of The Naive Bayes Classifier: I. Rish
6 pages
Fundamental Steps in Digital Image Processing
No ratings yet
Fundamental Steps in Digital Image Processing
26 pages
Ch. 2 Source Coding-Ppt1 PDF
No ratings yet
Ch. 2 Source Coding-Ppt1 PDF
59 pages
EC8501 Digital Communication
No ratings yet
EC8501 Digital Communication
14 pages
Wiley - Interscience.elements - Of.information - Theory.jul.2006.ebook DDU
100% (3)
Wiley - Interscience.elements - Of.information - Theory.jul.2006.ebook DDU
774 pages
Bio - Informatics Unit - 1 Introduction To Bio-Informatics
No ratings yet
Bio - Informatics Unit - 1 Introduction To Bio-Informatics
28 pages
Lecture2 1
No ratings yet
Lecture2 1
37 pages
Capacity BEC Z
No ratings yet
Capacity BEC Z
6 pages
Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh
67% (6)
Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh
46 pages
Ee5143 Pset1 PDF
No ratings yet
Ee5143 Pset1 PDF
4 pages
Nonlinear Source Separation - Luis B. Almeida
No ratings yet
Nonlinear Source Separation - Luis B. Almeida
114 pages
Learning with Uncertainty 1st Edition Xizhao Wang - Own the ebook now with all fully detailed content
100% (2)
Learning with Uncertainty 1st Edition Xizhao Wang - Own the ebook now with all fully detailed content
56 pages
Identification of Multiple Harmonic Sources in Power Systems Using Independent Component Analysis and Mutual Information
No ratings yet
Identification of Multiple Harmonic Sources in Power Systems Using Independent Component Analysis and Mutual Information
14 pages
Research Article: Short-Term Power Load Point Prediction Based On The Sharp Degree and Chaotic RBF Neural Network
No ratings yet
Research Article: Short-Term Power Load Point Prediction Based On The Sharp Degree and Chaotic RBF Neural Network
9 pages
CS 464 Introduction To Machine Learning: Feature Selection
No ratings yet
CS 464 Introduction To Machine Learning: Feature Selection
36 pages
Measures of Information: - Hartley Defined The First Information Measure
No ratings yet
Measures of Information: - Hartley Defined The First Information Measure
9 pages
Information Theory for Complex Systems Scientists
No ratings yet
Information Theory for Complex Systems Scientists
112 pages

WordNet-based Lexical Semantic Classification For Text Corpus Analysis

Uploaded by

WordNet-based Lexical Semantic Classification For Text Corpus Analysis

Uploaded by

J. Cent. South Univ.

(2015) 22: 1833−1840

WordNet-based lexical semantic classification for text corpus analysis

LONG Jun(龙军)1, WANG Lu-da(王鲁达)1, LI Zu-de(李祖德)1, ZHANG Zu-ping(张祖平)1, YANG Liu(杨柳)2

Key words: document representation; lexical semantic content; classification; eigenvector

statistical methods of feature extraction.

lexical semantic contents. For instance, in Fig. 1, the (2)

indicates semantic member diversity of semantic-

where n is the number of terms in corpus; dx(j) is the

Table 4 Distribution of all categories in subset of Reuter-21578

where KNN(d) indicates the set of K-nearest neighbors of

Fig. 5 Classification result of l lexical-semantic eigenvector

Fig. 8 Classification result of l lexical-semantic eigenvector

Using different exponents in NWKNN, Fig. 9

You might also like