0% found this document useful (0 votes)

4 views7 pages

Text Features Extraction Based On TF-IDF Associating Semantic

This conference paper discusses a method for text feature extraction that combines the TF-IDF algorithm with word vector representations using the word2vec model and density clustering. The approach aims to improve the accuracy of text feature extraction by considering semantic similarities among words, thereby reducing information loss associated with traditional methods. The proposed method demonstrates enhanced performance in identifying relevant text features by clustering similar words based on their semantic meanings.

Uploaded by

Merga Kumela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

Text Features Extraction Based On TF-IDF Associating Semantic

Uploaded by

Merga Kumela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334894282

Text Features Extraction based on TF-IDF Associating Semantic

Conference Paper · December 2018

DOI: 10.1109/CompComm.2018.8780663

CITATIONS READS
33 1,995

5 authors, including:

Yun Yang Wang Naiyao

Yunnan University Dalian Maritime University
116 PUBLICATIONS 1,474 CITATIONS 6 PUBLICATIONS 39 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Research on Representation Learning and Ensemble Learning Toward the Problems of Time Series Clustering (NSFC) View project

Research on multi-source ensemble transfer learning (NSFC) View project

All content following this page was uploaded by Yun Yang on 07 September 2019.

The user has requested enhancement of the downloaded file.

2018 IEEE 4th International Conference on Computer and Communications

Text Features Extraction based on TF-IDF Associating Semantic

Qing Liu Jing Wang

650091. Software College 650091. Software College
Yunnan University Yunnan University
Kunming, China Kunming, China
e-mail: [email protected] e-mail: [email protected]

Dehai Zhang * Yun Yang

650091. Software College 650091. Software College
Yunnan University Yunnan University
Kunming, China Kunming, China
e-mail: [email protected] e-mail: [email protected]

NaiYao Wang
650091. Software College
Yunnan University
Kunming, China
e-mail: [email protected]

Abstract—The TF-IDF (term frequency–inverse document features will be deleted, and finally the important features
frequency) algorithm is based on word statistics for text (sentences, words or characters, etc.) will be combined with
feature extraction. Which considers only the expressions of their weights to reflect the information contained in the text
words that are same in all texts, such as ASCLL, without [2].
considering that they could be represented by their synonyms. Text characterization based on word statistics is often
Separating words with the same or similar meanings will result used to extract text features. The BOW (bag of words) [3]
in the loss of partial information when text feature were and the TF-IDF (term frequency–inverse document
extracted. The representation of words needs to extract the frequency) [4] are the most typical models. These models
similarity of words, and the similarity among words needs to
can simplify the process of extraction and it is easy to
be obtained by the meaning of words in texts. In order to
improve the accuracy of text feature extraction, this paper uses
understand. However, when extracting words for text feature,
the word2vec model to train the word vector in the corpus to each word in the text is treated as a separate unit, so the
obtain its semantic features. After excluding words with low semantic features of the text cannot be effectively obtained.
TF-IDF value, the density clustering algorithm is used to For example, “kid” and “child” should belong to the
cluster the remaining words according to word vector same concept in our life, but the traditional method
similarity. As a result, similar words are clustered together and extracting text feature treats two words as different concepts,
can be represented to each other. Experiments show that using resulting in the loss of text information that represented by
the TF-IDF algorithm again, constructing a VSM (vector space the text features.
model) with these clusters as feature units can effectively Extracting the semantic features of each word in the text
improve the accuracy of text feature extraction. needs to be measured by the context where the word is
located. Word2Vector, which trains the corresponding word
Keywords-Text feature; TF-IDF; Word vector; Semantic vector based on the context of the word in the text, plays an
features; Clustering important role in extracting the semantic of words.
Word2Vector is a technique for transforming word
I. INTRODUCTION representation into space vector. It mainly uses the idea of
Text feature extraction is an important step of data deep learning to train corpus, by associating context of each
mining and information retrieval. It quantifies the feature word and mapping them into different N-dimensional vector
words which extracted from the text to represent the text [5]. In this way, the semantic features of each word can be
expressed and recognized by the computer .
information, and converts them from an unstructured original
For the data that already has semantic features,
text to a structured information which computer can
representations among words need to calculate semantic
recognize and process [1], that is, describing and replacing
similarity and uniformly store synonyms. Using the density
the text by the dimensionality reduction of the text word
clustering algorithm [6] in machine learning , the words with
space and the establishment of its mathematical model. In the
similar meanings can be clustered. This algorithm can
process of extracting text feature, irrelevant or redundant

978-1-5386-8339-2/18/$31.00 ©2018 IEEE 2338

flexibly control the minimum value of the distance of word Example 1: The words “monkeys like to eat fruits” and
vector (word similarity) in each cluster, and does not need to “gorillas are interested in plants” should be very similar.
set the number of cluster in advance, which is very practical When using TF-IDF for text characterization in word vector
for the case where the number of cluster is not clear and not space, their nonzero feature words do not intersect at all,
strict. even let alone find any similarities. According the method
with word vector processing and cluster substitution ---- after
II. RELATED WORKS use word2vector word vector training on corpus, the
In 1969, Gerard Salton and McGill proposed the VSM meanings of “monkey” and “gorilla”, “like” and “interest”,
(Vector Space Model), which converts the processing of text “fruit” and “plant” are very similar ---- these synonyms are
into a operation on a space vector by mapping the document divided into each unified clusters. Replacing the elements in
to a feature vector. And the similarity among the texts is the cluster to which they belong, so that the “monkey” in text
reflected by the similarity of vector space, which makes the A and the “gorilla” in text B can replace each other, which
processing of the text simpler [7]. Many methods for text can solve the above problems.
characterization based on VSM are proposed, one of the Back to the text feature extraction, our job is to find the
popular methods include the TF-IDF (term frequency- words worthy to further explore in the text through the basic
inverse document frequency) method, the DF (document word statistics method. Based on the uses of various
frequency) method, the MI (Mutual Information) method, algorithms to the corpus, explore other text features implied
and IG (Information Gain) method [8], etc. by these words. The source of these features are mainly the
At present, there are many researches on text feature contextual relationship among each word in corpus.
extraction. Yonghe Lu et al. [9] used feature pool to preselect
feature word, introduced genetic algorithm to encode IV. TEXT FEATURE EXTRACTION BASED ON WORD
candidate feature words and extracted the best feature vector. VECTOR CLUSTERING
In the case of less feature dimension, there is a better effect
The method of extracting text features based on word
of feature extraction, but the semantic association among
vector clustering mainly includes the following contents:
words is not considered in the process of block coding.
managing data and training word vector; excluding words
LiHong Wang [10] proposed a method for feature extraction
whose TF-IDF is too low in text; using density clustering
by improving the weight calculation based on word co-
algorithm to divide remaining words by word vector distance
occurrence. Compared with the traditional TF-IDF method,
clustering; obtains the virtual words that can replace any
this method has better performance in short text clustering.
element in the cluster and its TF-IDF; using the TF-IDF of
The above methods are based on word statistics to
each virtual word as the feature unit model to establish the
achieve text feature extraction, and do not consider the
vector space. The above will be described below in detail.
semantic association between words. Hong Liang et al. [11]
In this paper, w denotes a words, W denotes a
proposed a method of text feature extraction based on deep
learning. Zhou Shunxian et al. [12] proposed another collection of words, v denotes a word vector, V denotes a
context-based text characterization method for textual collection of word vectors, wt denotes the word w in the
similarity problems. These methods consider the semantics text t , v t denotes the word wt corresponds to the word
of words in the text in a certain degree, but the former has vector.
higher requirements on data quality, and has certain
limitations for larger data sets. The latter uses “centroid” as A. Managing Data and Training Word Vector
the text feature unit, because of the limitation of its algorithm, There are two main types of data in this paper. One is the
it is necessary to set the number of “centroids” in advance, target text that needs to obtain the TF-IDF feature, and
and it is not flexible to set the minimum word meaning another is the corpus text that assists in calculating the TF-
distance to an ideal range. IDF value. After data is cleaned for all texts, words in all
A good method to extract text feature needs not only texts are separated according to the specified symbols (there
necessary word statistics on the text and to construct a word is no need to consider separate words for English corpus),
statistical model, such as TF-IDF, but also needs to consider and using the Word2vect tool to train word vector. In order
the semantic features of the words in the text. to improve the accuracy of the training result, all the contents
of the corpus are placed in the same text for training. For the
III. CONTRIBUTIONS same words in different texts, the same word vector initial
In this paper, we use the TF-IDF statistical model, and value is given during the training process, which avoids the
use word2vec model and density clustering algorithm to case where some words have multiple word vectors.
create method to extract text feature, which takes into
account both word statistics and semantic features in text. B. Exclude low TF-IDF Words
The word2vector model is used to train the word vector in Clustering vectors of all words will take into account
the text, and a new set of text feature vectors suitable for the most of the words that have no effect on text feature
VSM is generated by clustering those word vector based on extraction. Although this can not affect the accuracy of
the TF-IDF algorithm, which can finally better reflect the feature extraction, it is much computationally intensive.
text features. Example 1 is about the specific application of Therefore, after obtaining the word vector of all words, we
the method proposed in this paper on text recognition. should calculate the TF-IDF of each word in the target text,

2339
and set a threshold  , then exclude the words whose TF-IDF uses density clustering algorithm to cluster words. The
is less than  are all . density clustering algorithm is mainly based on a set of
In equation (1), TF (w) denotes the frequency of the word “domain” parameters (，MinPts) , which are used to specify
t the closeness level of the sample distribution. In addition, the
w in the text, count(w) and count( w j ) respectively denote word vector density (distance) is measured by the Euclidean
the number of the word w in target text and the number of Distance between the two vectors .
the word w in corpus samples t j , and m denotes the According to the idea of the density clustering algorithm,
number of samples contained word w in corpus. we clustering the elements in the word vector set V final ,
where V final  {v1, v2 ,...vn} corresponds to W final in the
 TF ( w) 
count( w)
 equation (4). First find the   field of each vector and

m t determine the core object set  , then randomly select a core
count( w j )
j 1 object vr1 as a seed from  . Find all the vectors that are
reachable by its density to form a cluster c1 , and then
In equation (2), IDF(w) denotes the inverse file remove the core objects contained in c1 from  , that is
frequency of the word w , m denotes the number of samples
   - c1 . Another seed vr 2 is randomly selected from the
contained word w in corpus, and n denotes the total
number of corpus texts. updated set  to generate the next cluster. This process is
repeated until  is empty. [13]The process of clustering
n words is shown in “Fig.1”.
 IDF( w)  ln( ) 
m 1

Equation (3) calculates the TF-IDF value of the word w

according to equation (1) and (2).

n  count( w)
 TFIDF( w)  

n m t
ln( ) count( w j )
m 1 j 1

Equation (4) obtains a set W final of all words remaining

after excluding the low TF-IDF word, and Ws denotes a set
of words ws whose TF-IDF is smaller than  .
Figure 1. Process of clustering words
t
 W final  mj1 w j  ws (TFIDF(ws )   ) 
According to the final clustering result, the average value
of all word vectors in each cluster can be obtained, denoted
Perhaps there will be a question here: Why not first by v , and its corresponding word is denoted by the virtual
exclude words with a TF-IDF lower than  and then train word w . The result is shown in “Fig. 2”.
word vector, isn't this more likely to reduce the amount of
computation? According to the mechanism of Word2vector
algorithmic, it must consider all words and their contexts so
that train word vectors more accurately. excluding seemingly
unrelated words will affect the training results.
C. Clustering Similar Words
Words whose TF-IDF is not too low and its word vector
have been obtained in the above steps. Since the word vector
can represent the semantic features of the words in the
corpus, we clustered words based on the word vector
distance. As a result, the words with similar semantic are
clustered together. Figure 2. Clustering results and representation
Since the number of a text may be a lot, and the number
of words in each cluster is only a little in many cases D. Calculating the TF-IDF of the word w
(because it is necessary to ensure that the semantic similarity
in each cluster is high), the number of clusters is not easy to For the feature extraction of the specified text, the TF-
measure when clustering. Considering the advantages of IDF feature of any individual word in the text will not be
density clustering algorithm for such problems, this paper considered, but the TF-IDF of the virtual word w will be

2340
used to represent the TF-IDF feature of each word in its Word w c1 w c2 ... w cx
cluster. In this paper, TF ' (w ) , IDF' (w ) and TFIDF' (w ) Text w1c1 ~ wkc11 w1c2
~ wkc22 ... w1cx cx
~ wkx
are used to represent the term frequency, inverse document
frequency and term frequency–inverse document frequency T1 TFIDFT'1 (w c1 ) TFIDFT'1 (w c2 ) ... TFIDFT'1 (w cx )
of word w . T2 TFIDFT'2 (w c1 ) TFIDFT'2 (w c2 ) ... TFIDFT'2 (w cx )
This part assumes that the word w is included both in the
target text and in the cluster c .
... ... ... ... ...
Equation (5) calculates the term frequency of the word Ty TFIDFT' y ( w c1 ) TFIDFT' y (w c2 ) ... TFIDFT' y (w cx )
w corresponding to cluster c , k is the number of w in
Figure 4. Vector space model constructed After clustering
each cluster c , and wcj represents the j th word in cluster c .
k V. EXPERIMENT AND RESULTS
 TF ' ( w c )  
j 1
TF ( wcj ) 
After the text edit has been completed, the paper is ready
for the template. Duplicate the template file by using the
Equation (6) calculates the inverse document frequency Save As command, and use the naming convention
of the word w corresponding to cluster c , where n is the prescribed by your conference for the name of your paper. In
same as in Equation (2), and represents the total number of this newly created file, highlight all of the contents and
samples of the corpus, and m' represents the total number of import your prepared text file. You are now ready to style
samples to which each word w in cluster c belongs. your paper.

n A. Experimental Data
 IDF' ( w c )  ln( ) 
m'1 The experiment selected the Wikipedia corpus and the
abstracts of the papers crawled from CNKI as experimental
Equation (7) calculates the TF-IDF of the word w data. From the CNKI website, 5,000 paper abstracts about
according to Equation (5) and (6). “computer artificial intelligence” and “chemical pharmacy”
k were crawled from the CNKI website, each with 2,500 texts.
TF (w ) 
 n c Each of them uses 2000 texts as the target text for calculating
TFIDF' ( w)  ln( ) j
m'1 TF-IDF and 500 texts for testing. The Wikipedia corpus
j 1
contains 3,271,863 texts, which are separated into different
E. Constructing Vector Space Model Based on Clustering paragraphs by “\n” and stored in a text that will serve as an
Results auxiliary corpus for the experiment.
Similar to the TF-IDF algorithm, this paper uses the B. Experimental Process
vector space model for text characterization. The former's 1) For the corpus
feature unit is the TF-IDF of each word, as shown in “Fig.3”,
while the latter's feature unit is the TF-IDF of each virtual a) We first separate the words in the entire corpus,
word w , as shown in “Fig.4”. For each word that needs to leaving the character “\n” in the process.
extract text features later, we no longer consider their b) Using the Word2vec tool to train corpus that have
individual TF-IDF, but instead use the TF-IDF of the virtual separated words but not separated text.
word w corresponding to the cluster c that they belong, c) After obtaining word vectors, we divide the corpus
which is represented as follow: according to “\n” and save the result in each separate text.
2) For each target text:
TFIDF(wcj )  TFIDF' (w c ) 
a) Using all the independent texts in the corpus as
This solves the problem of considering only the auxiliary corpus, calculate the TF-IDF of all words in the
expressions of words (such as ASCLL) that are the same in target text according to equation (3), and set the threshold
all texts, without considering that they have other value to 0.0005, and delete the word vector whose TF-IDF is
expressions - synonyms. So, for Example 1, it is possible to lower than this value.
associate “monkey” with “gorilla” based on the meaning of b) The domain parameters ( , MinPts) is set to
the word.
Word  =0.034, MinPts =1, and the updated word vectors is
Text w1 w2 ... wx clustered using the DBSCAN(Density-Based Spatial
T1 TFIDFT1 (w1) TFIDFT1 (w2 ) ... TFIDFT1 (wx ) Clustering of Applications with Noise. MinPts is set to 1,
TFIDFT2 (w1) TFIDFT2 (w2 ) ... TFIDFT2 (wx )
essentially because the purpose of using this algorithm in this
T2
paper is to obtain as close words as possible for each word,
... ... ... ... ...
so that each word can be a core object.
Ty TFIDFTy (w1) TFIDFTy (w2 ) ... TFIDFTy (wx )
c) Calculate the TF-IDF of each cluster according to
Figure 3. Vector space model constructed by TF-IDF equation (7).

2341
d) We use the TF-IDF of each cluster as the text feature “Fig. 6” shows that the test results added the “other
unit, and combine these feature units into a text feature in a classes” whose average distance from 5 most similar texts is
specified order, and finally construct a vector space model greater than 1.36.
that can represent the text features. Based on the above second classification method, we use
the Macro-F1 in the F1-measure [16] standard to evaluate the
C. Analysis of Results classification performance. The corresponding average
His paper measures the effect of text characterization values are shown in “Tab.1” and “Tab.2”.
based on the classification results of the text. We classify the “Tab.1” shows the results obtained using only the vector
text by KNN (k-NearestNeighbor) [14] during the space model constructed based on the TF-IDF algorithm.
experiment. The KNN algorithm works by inputting new
data without tags and comparing each feature of them with TABLE I. TEST RESULTS FOR TF-IDF
the characteristics of the data containing tag, and then computer chemical
average
extracting the classification tag for the most similar data[15]. AI PM
When classifying the test text, we extracted the top 5 Num 98.2 82.0 -
most similar texts in the 5000 data that have obtained the text Recall 75.4% 62.2% 68.8%
Precision 76.8% 75.9% 76.4%
features. We stipulate that when at least three similar texts 76.1% 68.4% 72.25%
Macro-F1
belong to the same class A, then the test text is also classified
in A, as Fig. 5 shows. In addition, when the average distance
between the test text and the 5 most similar texts is greater “Tab.2” shows the results obtained using the vector space
than 1.36, regardless of which category the test text should model constructed by the method of this paper.
be classified, it is considered not to belong to any class or
TABLE II. TEST RESULTS FOR THE METHOD OF THIS PAPER
belong to other classes, as Fig. 6 shows.
We divided 1000 texts into 5 tests, and took 200 samples computer chemical
average
each time (100 “Computer Artificial Intelligence” texts and AI PM
100 “Chemical Pharmaceutical” texts). Finally, the Num 101.0 82.2 -
Recall 81.0% 66.8% 73.9%
classification results of the 5 tests are compared with the
Precision 80.2% 81.3% 80.75%
actual data. The classification accuracy rates of the two Macro-F1 80.6% 73.3% 77.0%
methods are shown in “Fig. 5” and “Fig. 6”.
“Fig. 5” shows the results that the texts being tested are
only classified into the “computer artificial intelligence” For these five experiments, it can be seen from “Tab.1”
class or the “chemical pharmaceutical” class. that when we only use the TF-IDF algorithm, an average of
100% 98.2 articles are classified into “computer artificial
95% intelligence”, and 75.4 articles are classified correctly; an
average of 82 articles are classified into “chemical
Correct rate

90% 86.00% 87.50%

85.50%
85% 84.50% 81.00% pharmaceuticals”, 62.2 articles are classified correctly. After
80%
80.50%
using the method of this paper to improve, an average of 101
80.00%
75% 79.00% 79.50% articles are classified into “computer artificial intelligence
76.50%
70% class”, 81.0 articles are classified correctly; an average of
65% 82.2 articles are classified into “chemical pharmaceuticals”
60% category, and 66.8 articles are classified correctly. The
55% former has an average Macro-F1 value of 72.25%, while the
50% latter has a corresponding average Macro-F1 value of 77.0%.
test 1 test 2
TFIDF
test 3 test 4
The Paper
test 5
Then, we can easily see with reference to “Fig.5” and
“Fig.6” that the method proposed in this paper does improve
Figure 5. Comparison of the two test results with only two classes the accuracy of text feature extraction.
90% VI. CONCLUSION
85%
In this paper, the text feature is expressed by constructing
80%
76.50%
vector space model. The construction idea focuses on the
73.00% 72.50%
75.00%
introduction of semantic features in the text. The
Correct rate

75% 72.50%

70% construction method gives full play to the advantages of

65% 68.00% 69.50% 67.50%
70.00% 69.00% Word2vec tool for extraction of word feature, and considers
the characteristics of multiple and scattered words in the text,
60%
appropriately adopts the density clustering algorithm. The
55%
construction process uses TF-IDF algorithms twice. For the
50% first time, the words with low TF-IDF in the text are
test 1 test 2 test 3
TFIDF
test 4
The Paper
test 5
excluded, which effectively reduces the amount of
computation in the construction process. For the second time,
Figure 6. Comparison of the two test results added the “other classes” the algorithm uses each cluster as the feature unit to generate

2342
the final required feature model. Compared with the on Network Security and Communication Engineering (NSCE 2014),
traditional TF-IDF algorithm, the accuracy of text feature 2014, pp. 5.
extraction is higher. Compared with the method of feature [3] Y. Tian, J. Zhang, “Improvement of Linked Data Fusion Algorithm
Based on Bag of Words,” Library Journal, vol. 35, pp. 17-22, 2016.
extraction that based on domain knowledge engineering, this
[4] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of
paper uses the readied Word2vector model based on deep TF*IDF, LSI and multi-words for text classification,” Expert Systems
learning ideas to extract texts feature from different fields, with Applications, vol. 38, pp. 2758-2765, 2010.
which does not need to spend too much time to construct [5] M. Y. Jiang, R. Liu, and F. Wang, “Word Network Topic Model
domain Knowledge Graphs to extract semantic features. Based on Word2Vector,” IEEE Explore, pp. 241-247.
Nevertheless, for some words that are not included in the [6] Y. Kim, et al, “DBCURE-MR: An efficient density-based clustering
corpus or their context information is blurred, this paper algorithm for large data using MapReduce,” Information Systems, vol.
cannot effectively extract the semantic features of these 42, pp. 162-166, 2013.
words. The extraction of these semantic features also needs [7] S. Q. Xue, and Y. J. Niu, “Research on Chinese text similarity based
to be done with some semantic rules and specialized domain on vector space model,” Electronic Design Engineering, 2016.
Knowledge Graphs. In the future research work, we will [8] L. H. Patil, and M. Atique, “A novel feature selection based on
information gain using WordNet,” Science and Information
improve the related work on the basis of this paper. Conference IEEE, pp. 625-629, 2013.
ACKNOWLEDGEMENT [9] Y. Lu, and M. Liang, “Improvement of Text Feature Extraction with
Genetic Algorithm,” New Technology of Library & Information
This work is supported by: (i) Natural Science Service, vol. 38, pp. 523–525, 2014.
Foundation China (NSFC) under the Grant No. 61402397, [10] L. H. Wang, “An Improved Method of Short Text Feature Extraction
61263043 and 61663046; (ii) Yunnan Applied Fundamental Based on Words Co-Occurrence,” Applied Mechanics & Materials,
Research Project under the Grant No. 2016FB104; (iii) vol. 519-520, pp. 842-845, 2014.
Yunnan Provincial Young academic and technical leaders [11] H. Liang, et al, “Text feature extraction based on deep learning: a
review:,” Eurasip Journal on Wireless Communications &
reserve talents under the Grant No. 2017HB005; (iv) Yunnan Networking, vol. 2017, pp. 211, 2017.
Provincial Innovation Team under the Grant No. [12] S. Zhou , et al, “Characteristic representation method of document
2017HC012;(v) The National Nature Science Fund Project based on Word2vector,” Journal of Chongqing University of Posts &
(61562093), Key Project of Applied Basic Research Program Telecommunications, vol. 30, pp. 272-279, 2018.
of Yunnan Province (2016FA024);(vi) MOE Key Laboratory [13] Y. Chen, et al, “A fast clustering algorithm based on pruning
of Educational Informatization for Nationalities (YNNU) unnecessary distance computations in DBSCAN for high-dimensional
Open Funding Project (EIN2017001). data,” Pattern Recognition, 2018, 83.
[14] S. Tan, “Neighbor-weighted K-nearest neighbor for unbalanced text
REFERENCES corpus,” Expert Systems with Applications. vol. 28, pp. 667-671,
2005.
[1] V. Singh, B. Kumar, and T. Patnaik, “Feature Extraction Techniques
for Handwritten Text in Various Scripts: a Survey,” International [15] T. Denoeux, “A k-nearest neighbor classification rule based on
Journal of Soft Computing & Engineering, vol. 3, pp. 238-241, 2013. dempster-shafer theory” IEEE Transactions on Systems Man and
Cybernetics vol. 25 no. 5 pp. 804-813 1995.
[2] X. Chen, S. F. Li, and Y. F. Wang, “The feature extraction of the text
based on the deep learning,” Advanced Science and Industry [16] He Shaojun, et al, “The Capability Analysis on the Characteristic
Research Center. Proceedings of the 2014 International Conference Selection Algorithm of Text Categorization Based on F1 Measure
Value,” IEEE Explore, pp. 742 -746.

2343

View publication stats

Pak Identity
No ratings yet
Pak Identity
2 pages
David Lindquist Cardboard Chair Lesson Plan
No ratings yet
David Lindquist Cardboard Chair Lesson Plan
5 pages
Charles Patterson - Eternal Treblinka PDF
No ratings yet
Charles Patterson - Eternal Treblinka PDF
156 pages
Network Communication Types: by Ahmed El Hefny
100% (1)
Network Communication Types: by Ahmed El Hefny
15 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
A Review On Machine Learning Text Feature Extraction Techniques
No ratings yet
A Review On Machine Learning Text Feature Extraction Techniques
6 pages
Text Feature Extraction Based On Deep Learning A Review (PRINTED)
No ratings yet
Text Feature Extraction Based On Deep Learning A Review (PRINTED)
12 pages
Research On The TF IDF Algorithm Combined With Semantics For Automatic Extraction of Keywords From Network News Texts
No ratings yet
Research On The TF IDF Algorithm Combined With Semantics For Automatic Extraction of Keywords From Network News Texts
10 pages
Lect 5
No ratings yet
Lect 5
40 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
No ratings yet
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
6 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Module III
No ratings yet
Module III
42 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
Ner X LSTM
No ratings yet
Ner X LSTM
6 pages
Similarity-Based Techniques For Text Document Classification
No ratings yet
Similarity-Based Techniques For Text Document Classification
8 pages
Lab 5
No ratings yet
Lab 5
27 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Extractive Text Summarization Using Word Vector Embedding
No ratings yet
Extractive Text Summarization Using Word Vector Embedding
5 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
A Comparative Study of Keyword Extraction Algorithms For English Texts
No ratings yet
A Comparative Study of Keyword Extraction Algorithms For English Texts
8 pages
Machine Learning Approach To Document Classificati
No ratings yet
Machine Learning Approach To Document Classificati
5 pages
Semantics Graph Mining For Topic Discovery and Word Associations
No ratings yet
Semantics Graph Mining For Topic Discovery and Word Associations
14 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
A Text-Image Feature Mapping Algorithm Based On TR
No ratings yet
A Text-Image Feature Mapping Algorithm Based On TR
10 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Deep Keywordnet: Automated English Keyword Extraction in Documents Using Deep Keyword Network Based Ranking
No ratings yet
Deep Keywordnet: Automated English Keyword Extraction in Documents Using Deep Keyword Network Based Ranking
33 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
978 3 319 11749 2 - 8 PDF
No ratings yet
978 3 319 11749 2 - 8 PDF
2 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Text Classification Research With Attention-Based Recurrent Neural Networks
No ratings yet
Text Classification Research With Attention-Based Recurrent Neural Networks
12 pages
Cross-Cutting Models of Distributional Lexical Semantics
No ratings yet
Cross-Cutting Models of Distributional Lexical Semantics
53 pages
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
68 pages
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
No ratings yet
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
11 pages
Text Mining
No ratings yet
Text Mining
34 pages
Ci 5
No ratings yet
Ci 5
17 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
Paper 2 DK
No ratings yet
Paper 2 DK
20 pages
Lec 6
No ratings yet
Lec 6
2 pages
TextRank Keyword Extraction Algorithm Using Word
No ratings yet
TextRank Keyword Extraction Algorithm Using Word
20 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
Sentiment Analysis Based On Vector Embeding
No ratings yet
Sentiment Analysis Based On Vector Embeding
5 pages
Recent Survey On Automatic Ontology Learning
No ratings yet
Recent Survey On Automatic Ontology Learning
5 pages
HW 5 Q 1
No ratings yet
HW 5 Q 1
22 pages
Complex Linguistic Features For Text Classification: A Comprehensive Study
No ratings yet
Complex Linguistic Features For Text Classification: A Comprehensive Study
15 pages
مقاله4 2019
No ratings yet
مقاله4 2019
14 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
A Modified Approach To Keyword Extraction Based On Word-Similarity
No ratings yet
A Modified Approach To Keyword Extraction Based On Word-Similarity
5 pages
Uwb at Semeval-2016 Task 5: Aspect Based Sentiment Analysis
No ratings yet
Uwb at Semeval-2016 Task 5: Aspect Based Sentiment Analysis
8 pages
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Cedric Tutt Dentistry CPD - Caroline Edit
No ratings yet
Cedric Tutt Dentistry CPD - Caroline Edit
100 pages
The C.P.A - Sunday Service of 17TH March 2024.
No ratings yet
The C.P.A - Sunday Service of 17TH March 2024.
3 pages
TVL Smaw11 Q1 M 7
No ratings yet
TVL Smaw11 Q1 M 7
10 pages
Elevate Abap Ty M
100% (1)
Elevate Abap Ty M
141 pages
Thought Mastery Vocab Text PDF
No ratings yet
Thought Mastery Vocab Text PDF
2 pages
Individual Differences, Factors, Benefits of Diversity and Classroom Strategies
100% (1)
Individual Differences, Factors, Benefits of Diversity and Classroom Strategies
18 pages
The Art of Support
No ratings yet
The Art of Support
203 pages
70 413 CertifyChat Vce 20 01 2016 v1
No ratings yet
70 413 CertifyChat Vce 20 01 2016 v1
181 pages
2122 - S3 - MA - UT6 - MS@7 (For Next Year)
No ratings yet
2122 - S3 - MA - UT6 - MS@7 (For Next Year)
10 pages
Test 2 Questions
No ratings yet
Test 2 Questions
6 pages
Null 001.2015.issue 273 en
No ratings yet
Null 001.2015.issue 273 en
26 pages
Glenn Gould RCM
No ratings yet
Glenn Gould RCM
5 pages
Knowledge Cartography 2014
No ratings yet
Knowledge Cartography 2014
555 pages
MODULE 2 Exam p6 Lab PERFECT
No ratings yet
MODULE 2 Exam p6 Lab PERFECT
6 pages
Form 5 Chapter 5 - Environment
No ratings yet
Form 5 Chapter 5 - Environment
49 pages
Internal Column Section External Column Section
No ratings yet
Internal Column Section External Column Section
1 page
' ''Shivanshu
No ratings yet
' ''Shivanshu
12 pages
Hypnosis Hypnotic Gaze Braco
No ratings yet
Hypnosis Hypnotic Gaze Braco
7 pages
TMMIN SPEX - Working Calender On Feb '2025
No ratings yet
TMMIN SPEX - Working Calender On Feb '2025
1 page
Tamheed Ul Iman by Ala Hazrat
100% (1)
Tamheed Ul Iman by Ala Hazrat
38 pages
Ontology As A Service (Oaas) : A Case For Sub-Ontology Merging On The Cloud
No ratings yet
Ontology As A Service (Oaas) : A Case For Sub-Ontology Merging On The Cloud
32 pages
Doctor's Secret To Hair Growth
100% (1)
Doctor's Secret To Hair Growth
42 pages
Business Ethics Case Study PDF
No ratings yet
Business Ethics Case Study PDF
5 pages
Zia Ul Islam
No ratings yet
Zia Ul Islam
2 pages
Agadu Du Du
No ratings yet
Agadu Du Du
15 pages
Developing A Multicultural and Inclusive Classroom
No ratings yet
Developing A Multicultural and Inclusive Classroom
2 pages

Text Features Extraction Based On TF-IDF Associating Semantic

Uploaded by

Text Features Extraction Based On TF-IDF Associating Semantic

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Text Features Extraction based on TF-IDF Associating Semantic

Conference Paper · December 2018

Yun Yang Wang Naiyao

SEE PROFILE SEE PROFILE

Research on multi-source ensemble transfer learning (NSFC) View project

The user has requested enhancement of the downloaded file.

Text Features Extraction based on TF-IDF Associating Semantic

Qing Liu Jing Wang

Dehai Zhang * Yun Yang

978-1-5386-8339-2/18/$31.00 ©2018 IEEE 2338

Equation (3) calculates the TF-IDF value of the word w

Equation (4) obtains a set W final of all words remaining

90% 86.00% 87.50%

70% construction method gives full play to the advantages of

View publication stats

You might also like