Text Features Extraction Based On TF-IDF Associating Semantic
Text Features Extraction Based On TF-IDF Associating Semantic
net/publication/334894282
CITATIONS READS
33 1,995
5 authors, including:
Some of the authors of this publication are also working on these related projects:
Research on Representation Learning and Ensemble Learning Toward the Problems of Time Series Clustering (NSFC) View project
All content following this page was uploaded by Yun Yang on 07 September 2019.
NaiYao Wang
650091. Software College
Yunnan University
Kunming, China
e-mail: [email protected]
Abstract—The TF-IDF (term frequency–inverse document features will be deleted, and finally the important features
frequency) algorithm is based on word statistics for text (sentences, words or characters, etc.) will be combined with
feature extraction. Which considers only the expressions of their weights to reflect the information contained in the text
words that are same in all texts, such as ASCLL, without [2].
considering that they could be represented by their synonyms. Text characterization based on word statistics is often
Separating words with the same or similar meanings will result used to extract text features. The BOW (bag of words) [3]
in the loss of partial information when text feature were and the TF-IDF (term frequency–inverse document
extracted. The representation of words needs to extract the frequency) [4] are the most typical models. These models
similarity of words, and the similarity among words needs to
can simplify the process of extraction and it is easy to
be obtained by the meaning of words in texts. In order to
improve the accuracy of text feature extraction, this paper uses
understand. However, when extracting words for text feature,
the word2vec model to train the word vector in the corpus to each word in the text is treated as a separate unit, so the
obtain its semantic features. After excluding words with low semantic features of the text cannot be effectively obtained.
TF-IDF value, the density clustering algorithm is used to For example, “kid” and “child” should belong to the
cluster the remaining words according to word vector same concept in our life, but the traditional method
similarity. As a result, similar words are clustered together and extracting text feature treats two words as different concepts,
can be represented to each other. Experiments show that using resulting in the loss of text information that represented by
the TF-IDF algorithm again, constructing a VSM (vector space the text features.
model) with these clusters as feature units can effectively Extracting the semantic features of each word in the text
improve the accuracy of text feature extraction. needs to be measured by the context where the word is
located. Word2Vector, which trains the corresponding word
Keywords-Text feature; TF-IDF; Word vector; Semantic vector based on the context of the word in the text, plays an
features; Clustering important role in extracting the semantic of words.
Word2Vector is a technique for transforming word
I. INTRODUCTION representation into space vector. It mainly uses the idea of
Text feature extraction is an important step of data deep learning to train corpus, by associating context of each
mining and information retrieval. It quantifies the feature word and mapping them into different N-dimensional vector
words which extracted from the text to represent the text [5]. In this way, the semantic features of each word can be
expressed and recognized by the computer .
information, and converts them from an unstructured original
For the data that already has semantic features,
text to a structured information which computer can
representations among words need to calculate semantic
recognize and process [1], that is, describing and replacing
similarity and uniformly store synonyms. Using the density
the text by the dimensionality reduction of the text word
clustering algorithm [6] in machine learning , the words with
space and the establishment of its mathematical model. In the
similar meanings can be clustered. This algorithm can
process of extracting text feature, irrelevant or redundant
2339
and set a threshold , then exclude the words whose TF-IDF uses density clustering algorithm to cluster words. The
is less than are all . density clustering algorithm is mainly based on a set of
In equation (1), TF (w) denotes the frequency of the word “domain” parameters (,MinPts) , which are used to specify
t the closeness level of the sample distribution. In addition, the
w in the text, count(w) and count( w j ) respectively denote word vector density (distance) is measured by the Euclidean
the number of the word w in target text and the number of Distance between the two vectors .
the word w in corpus samples t j , and m denotes the According to the idea of the density clustering algorithm,
number of samples contained word w in corpus. we clustering the elements in the word vector set V final ,
where V final {v1, v2 ,...vn} corresponds to W final in the
TF ( w)
count( w)
equation (4). First find the field of each vector and
m t determine the core object set , then randomly select a core
count( w j )
j 1 object vr1 as a seed from . Find all the vectors that are
reachable by its density to form a cluster c1 , and then
In equation (2), IDF(w) denotes the inverse file remove the core objects contained in c1 from , that is
frequency of the word w , m denotes the number of samples
- c1 . Another seed vr 2 is randomly selected from the
contained word w in corpus, and n denotes the total
number of corpus texts. updated set to generate the next cluster. This process is
repeated until is empty. [13]The process of clustering
n words is shown in “Fig.1”.
IDF( w) ln( )
m 1
n count( w)
TFIDF( w)
n m t
ln( ) count( w j )
m 1 j 1
2340
used to represent the TF-IDF feature of each word in its Word w c1 w c2 ... w cx
cluster. In this paper, TF ' (w ) , IDF' (w ) and TFIDF' (w ) Text w1c1 ~ wkc11 w1c2
~ wkc22 ... w1cx cx
~ wkx
are used to represent the term frequency, inverse document
frequency and term frequency–inverse document frequency T1 TFIDFT'1 (w c1 ) TFIDFT'1 (w c2 ) ... TFIDFT'1 (w cx )
of word w . T2 TFIDFT'2 (w c1 ) TFIDFT'2 (w c2 ) ... TFIDFT'2 (w cx )
This part assumes that the word w is included both in the
target text and in the cluster c .
... ... ... ... ...
Equation (5) calculates the term frequency of the word Ty TFIDFT' y ( w c1 ) TFIDFT' y (w c2 ) ... TFIDFT' y (w cx )
w corresponding to cluster c , k is the number of w in
Figure 4. Vector space model constructed After clustering
each cluster c , and wcj represents the j th word in cluster c .
k V. EXPERIMENT AND RESULTS
TF ' ( w c )
j 1
TF ( wcj )
After the text edit has been completed, the paper is ready
for the template. Duplicate the template file by using the
Equation (6) calculates the inverse document frequency Save As command, and use the naming convention
of the word w corresponding to cluster c , where n is the prescribed by your conference for the name of your paper. In
same as in Equation (2), and represents the total number of this newly created file, highlight all of the contents and
samples of the corpus, and m' represents the total number of import your prepared text file. You are now ready to style
samples to which each word w in cluster c belongs. your paper.
n A. Experimental Data
IDF' ( w c ) ln( )
m'1 The experiment selected the Wikipedia corpus and the
abstracts of the papers crawled from CNKI as experimental
Equation (7) calculates the TF-IDF of the word w data. From the CNKI website, 5,000 paper abstracts about
according to Equation (5) and (6). “computer artificial intelligence” and “chemical pharmacy”
k were crawled from the CNKI website, each with 2,500 texts.
TF (w )
n c Each of them uses 2000 texts as the target text for calculating
TFIDF' ( w) ln( ) j
m'1 TF-IDF and 500 texts for testing. The Wikipedia corpus
j 1
contains 3,271,863 texts, which are separated into different
E. Constructing Vector Space Model Based on Clustering paragraphs by “\n” and stored in a text that will serve as an
Results auxiliary corpus for the experiment.
Similar to the TF-IDF algorithm, this paper uses the B. Experimental Process
vector space model for text characterization. The former's 1) For the corpus
feature unit is the TF-IDF of each word, as shown in “Fig.3”,
while the latter's feature unit is the TF-IDF of each virtual a) We first separate the words in the entire corpus,
word w , as shown in “Fig.4”. For each word that needs to leaving the character “\n” in the process.
extract text features later, we no longer consider their b) Using the Word2vec tool to train corpus that have
individual TF-IDF, but instead use the TF-IDF of the virtual separated words but not separated text.
word w corresponding to the cluster c that they belong, c) After obtaining word vectors, we divide the corpus
which is represented as follow: according to “\n” and save the result in each separate text.
2) For each target text:
TFIDF(wcj ) TFIDF' (w c )
a) Using all the independent texts in the corpus as
This solves the problem of considering only the auxiliary corpus, calculate the TF-IDF of all words in the
expressions of words (such as ASCLL) that are the same in target text according to equation (3), and set the threshold
all texts, without considering that they have other value to 0.0005, and delete the word vector whose TF-IDF is
expressions - synonyms. So, for Example 1, it is possible to lower than this value.
associate “monkey” with “gorilla” based on the meaning of b) The domain parameters ( , MinPts) is set to
the word.
Word =0.034, MinPts =1, and the updated word vectors is
Text w1 w2 ... wx clustered using the DBSCAN(Density-Based Spatial
T1 TFIDFT1 (w1) TFIDFT1 (w2 ) ... TFIDFT1 (wx ) Clustering of Applications with Noise. MinPts is set to 1,
TFIDFT2 (w1) TFIDFT2 (w2 ) ... TFIDFT2 (wx )
essentially because the purpose of using this algorithm in this
T2
paper is to obtain as close words as possible for each word,
... ... ... ... ...
so that each word can be a core object.
Ty TFIDFTy (w1) TFIDFTy (w2 ) ... TFIDFTy (wx )
c) Calculate the TF-IDF of each cluster according to
Figure 3. Vector space model constructed by TF-IDF equation (7).
2341
d) We use the TF-IDF of each cluster as the text feature “Fig. 6” shows that the test results added the “other
unit, and combine these feature units into a text feature in a classes” whose average distance from 5 most similar texts is
specified order, and finally construct a vector space model greater than 1.36.
that can represent the text features. Based on the above second classification method, we use
the Macro-F1 in the F1-measure [16] standard to evaluate the
C. Analysis of Results classification performance. The corresponding average
His paper measures the effect of text characterization values are shown in “Tab.1” and “Tab.2”.
based on the classification results of the text. We classify the “Tab.1” shows the results obtained using only the vector
text by KNN (k-NearestNeighbor) [14] during the space model constructed based on the TF-IDF algorithm.
experiment. The KNN algorithm works by inputting new
data without tags and comparing each feature of them with TABLE I. TEST RESULTS FOR TF-IDF
the characteristics of the data containing tag, and then computer chemical
average
extracting the classification tag for the most similar data[15]. AI PM
When classifying the test text, we extracted the top 5 Num 98.2 82.0 -
most similar texts in the 5000 data that have obtained the text Recall 75.4% 62.2% 68.8%
Precision 76.8% 75.9% 76.4%
features. We stipulate that when at least three similar texts 76.1% 68.4% 72.25%
Macro-F1
belong to the same class A, then the test text is also classified
in A, as Fig. 5 shows. In addition, when the average distance
between the test text and the 5 most similar texts is greater “Tab.2” shows the results obtained using the vector space
than 1.36, regardless of which category the test text should model constructed by the method of this paper.
be classified, it is considered not to belong to any class or
TABLE II. TEST RESULTS FOR THE METHOD OF THIS PAPER
belong to other classes, as Fig. 6 shows.
We divided 1000 texts into 5 tests, and took 200 samples computer chemical
average
each time (100 “Computer Artificial Intelligence” texts and AI PM
100 “Chemical Pharmaceutical” texts). Finally, the Num 101.0 82.2 -
Recall 81.0% 66.8% 73.9%
classification results of the 5 tests are compared with the
Precision 80.2% 81.3% 80.75%
actual data. The classification accuracy rates of the two Macro-F1 80.6% 73.3% 77.0%
methods are shown in “Fig. 5” and “Fig. 6”.
“Fig. 5” shows the results that the texts being tested are
only classified into the “computer artificial intelligence” For these five experiments, it can be seen from “Tab.1”
class or the “chemical pharmaceutical” class. that when we only use the TF-IDF algorithm, an average of
100% 98.2 articles are classified into “computer artificial
95% intelligence”, and 75.4 articles are classified correctly; an
average of 82 articles are classified into “chemical
Correct rate
75% 72.50%
2342
the final required feature model. Compared with the on Network Security and Communication Engineering (NSCE 2014),
traditional TF-IDF algorithm, the accuracy of text feature 2014, pp. 5.
extraction is higher. Compared with the method of feature [3] Y. Tian, J. Zhang, “Improvement of Linked Data Fusion Algorithm
Based on Bag of Words,” Library Journal, vol. 35, pp. 17-22, 2016.
extraction that based on domain knowledge engineering, this
[4] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of
paper uses the readied Word2vector model based on deep TF*IDF, LSI and multi-words for text classification,” Expert Systems
learning ideas to extract texts feature from different fields, with Applications, vol. 38, pp. 2758-2765, 2010.
which does not need to spend too much time to construct [5] M. Y. Jiang, R. Liu, and F. Wang, “Word Network Topic Model
domain Knowledge Graphs to extract semantic features. Based on Word2Vector,” IEEE Explore, pp. 241-247.
Nevertheless, for some words that are not included in the [6] Y. Kim, et al, “DBCURE-MR: An efficient density-based clustering
corpus or their context information is blurred, this paper algorithm for large data using MapReduce,” Information Systems, vol.
cannot effectively extract the semantic features of these 42, pp. 162-166, 2013.
words. The extraction of these semantic features also needs [7] S. Q. Xue, and Y. J. Niu, “Research on Chinese text similarity based
to be done with some semantic rules and specialized domain on vector space model,” Electronic Design Engineering, 2016.
Knowledge Graphs. In the future research work, we will [8] L. H. Patil, and M. Atique, “A novel feature selection based on
information gain using WordNet,” Science and Information
improve the related work on the basis of this paper. Conference IEEE, pp. 625-629, 2013.
ACKNOWLEDGEMENT [9] Y. Lu, and M. Liang, “Improvement of Text Feature Extraction with
Genetic Algorithm,” New Technology of Library & Information
This work is supported by: (i) Natural Science Service, vol. 38, pp. 523–525, 2014.
Foundation China (NSFC) under the Grant No. 61402397, [10] L. H. Wang, “An Improved Method of Short Text Feature Extraction
61263043 and 61663046; (ii) Yunnan Applied Fundamental Based on Words Co-Occurrence,” Applied Mechanics & Materials,
Research Project under the Grant No. 2016FB104; (iii) vol. 519-520, pp. 842-845, 2014.
Yunnan Provincial Young academic and technical leaders [11] H. Liang, et al, “Text feature extraction based on deep learning: a
review:,” Eurasip Journal on Wireless Communications &
reserve talents under the Grant No. 2017HB005; (iv) Yunnan Networking, vol. 2017, pp. 211, 2017.
Provincial Innovation Team under the Grant No. [12] S. Zhou , et al, “Characteristic representation method of document
2017HC012;(v) The National Nature Science Fund Project based on Word2vector,” Journal of Chongqing University of Posts &
(61562093), Key Project of Applied Basic Research Program Telecommunications, vol. 30, pp. 272-279, 2018.
of Yunnan Province (2016FA024);(vi) MOE Key Laboratory [13] Y. Chen, et al, “A fast clustering algorithm based on pruning
of Educational Informatization for Nationalities (YNNU) unnecessary distance computations in DBSCAN for high-dimensional
Open Funding Project (EIN2017001). data,” Pattern Recognition, 2018, 83.
[14] S. Tan, “Neighbor-weighted K-nearest neighbor for unbalanced text
REFERENCES corpus,” Expert Systems with Applications. vol. 28, pp. 667-671,
2005.
[1] V. Singh, B. Kumar, and T. Patnaik, “Feature Extraction Techniques
for Handwritten Text in Various Scripts: a Survey,” International [15] T. Denoeux, “A k-nearest neighbor classification rule based on
Journal of Soft Computing & Engineering, vol. 3, pp. 238-241, 2013. dempster-shafer theory” IEEE Transactions on Systems Man and
Cybernetics vol. 25 no. 5 pp. 804-813 1995.
[2] X. Chen, S. F. Li, and Y. F. Wang, “The feature extraction of the text
based on the deep learning,” Advanced Science and Industry [16] He Shaojun, et al, “The Capability Analysis on the Characteristic
Research Center. Proceedings of the 2014 International Conference Selection Algorithm of Text Categorization Based on F1 Measure
Value,” IEEE Explore, pp. 742 -746.
2343