Text Vectorization Using Data Mining Methods
Text Vectorization Using Data Mining Methods
net/publication/353592446
CITATIONS READS
4 477
3 authors:
Kravchenko Yury
Southern Federal University
54 PUBLICATIONS 179 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ali Mansour on 14 August 2023.
Abstract—With the incredible increase in the amount of text been effectively applied to analyze educational data in many
data, the need to develop efficient methods for processing and types of research to improve the learning process [4].
analyzing it increases. In this context, feature extraction from text
is an urgent task to solve many texts mining and information For example, text mining has been used for the
retrieval problems. Traditional text feature extraction methods personalization of recommendations within adaptive e-
such as TF-IDF and bag-of-words are effective and characterized learning systems [5], where the students' sentiments towards a
by intuitive interpretability, but suffer from the “curse of course have served as feedback for teachers. This information
dimensionality”, and they are unable to capture the meanings of has been used to support personalized learning, as in [6],
words. On the other hand, modern distributed methods effectively where they propose to analyze the students’ text feedback
capture the hidden semantics, but they are computationally automatically performing sentiment analysis tasks to predict
intensive, and uninterpretable. This paper proposes a new the level of teaching performance.
concept-mining-based text vectorization method called Bag of
weighted Concepts BoWC that aims to generate representations In addition, several systems have been developed that use
with low dimensions and high representational ability. BoWC text mining for automated tagging of essays and short free-text
vectorizes a document according to the concepts’ information it responses [7]. Similarly in [8], text analysis is used for real-
contains, where it creates concepts by clustering word vectors, then time formative assessment purposes of short answer
uses the frequencies of these concept clusters to represent questions. Moreover, there are many important services and
document vectors. To enrich the resulted document applications in automating education services, such as
representation, new weighting functions are proposed for concept plagiarism detection systems [9] and paraphrasing programs
weighting based on statistics extracted from word embedding [10], which are used effectively in the education field. These
information. This work is a development of previous research in techniques serve to reduce the time and cost and improve the
which the proposed method was developed and tested on a text reliability and generalizability of the assessment process.
classification task. In this work, empirical tests were extended to
include tuning the parameters of the proposed method and To apply various text mining methods, raw text must be
analyzing the effect of each on the efficiency of the method. The converted into a format that a machine can understand [11].
proposed method has been tested in two data mining tasks; The first step to making text documents machine-readable is
document clustering and classification, with five various to perform text vectorization. Vectorization is a feature
benchmark data sets, and it was compared with several baselines, extraction process for constructing digital vector
including Bag-of-words, TF-IDF, Averaged word embeddings, representations of text in natural language [12].
Bag-of-Concepts, and VLAC. The proposed method outperforms
most baselines in terms of the minimum number of features and Traditional text vectorization methods such as TF-IDF and
maximum classification and clustering accuracy. bag-of-words are effective and characterized by simplicity,
efficiency, and intuitive interpretability, but they suffer from
Keywords — Text vectorization, text mining, concept mining, the “curse of dimensionality”, and they do not take into
classification, clustering, educational data mining account the semantic relationship between words. On the
other hand, modern distributional semantic representation
I. INTRODUCTION methods effectively capture the hidden semantics, but they are
The widespread use of computers and the Internet in all computationally intensive, and uninterpretable [13].
areas of education, has resulted in the availability of large
One of the proposed solutions to overcome these problems
amounts of data. To take advantage of this huge source of
is to use concepts instead of words to represent the text. This
information, it is necessary to develop innovative techniques
approach relies on concept mining as a prerequisite for text
for mining and extracting hidden information from these data.
vectorization, and it is driven by the fact that concept can
The application of data mining techniques to improve
provide powerful insights into the meaning, provenance, and
educational engineering applications is not new and is referred
similarity of documents.
to as educational data mining (EDM). EDM is a research field
that aims to prepare techniques for exploring a specialized Several methods for exploiting concepts in generating
form of information collected from the educational sector and documents representations have been presented, and they can
to apply these techniques to improve understanding about be categorized according to the source of the information into
students and the environment in which they learn [1-3]. two approaches. The first relies on external sources such as
knowledge bases or lexical databases to discover concepts
Among the various forms of data, the text is the most
[14-17]. The methods based on this approach first extract the
common way for information transmission and interaction in
characteristic keywords of the document and then assign these
education. Accordingly, text mining is essential to take
words to the corresponding entries in the knowledge base,
advantage of this huge source of information, where it has
passing through a word disambiguation phase.
In the experimental section of the previous work [22], the A. Data sets and preprocessing
efficiency of the BoWC-generated representations was To evaluate the proposed method, we have used five data
evaluated by using them in a document classification task sets [32, 33] presented in Table 1.
with five benchmark data sets, and their performance was
evaluated using V-measure [31]. One type of words
TABLE I. AN OVERVIEW OF DATA SETS STATISTICS Google News corpus (3 billion words) and Glove model which
# # Mean number # was trained on the Common Crawl data set and contains
Dataset Documents Classes of in– Documents vectors for 1.9 million English words.
vocabulary with < 10
words per embedded Since we are interested in the semantic relationship
document words between terms in search of concepts, the window size used to
BBC 2225 5 207 0% train self-embedding models was set to 10, since a window
RE 8491 8 66 1.97% with a size smaller than 5 gives results that are more syntactic
OH in nature, and a window with a size larger than 5 gives more
5380 7 79 24.38%
semantic results.
20NG 18821 20 135 2.19%
WebKB 4199 4 98 4.23% B. Evaluation metrics
F1-measure. The F-measure is one of the common
measures to rate how successful a classifier is [37]. It is the
BBC data set [34], consists of 2225 documents from the harmonic average between precision and recall, which are
BBC news website covering stories in five topical areas from given by followed formulas:
2004-2005.
Reuters (RE) data set contains articles from the Reuters 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (12)
news feed. This paper uses the R8 partition of the Reuters data
set, which contains 8491 documents.
20Newsgroups (20NG) is a collection of 18,821 𝑅𝑒𝑐𝑎𝑙𝑙 =
documents distributed into 20 different newsgroup categories
with an approximate uniform distribution. ( )⋅ ⋅
𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
⋅
OHSUMED (OH) data set is a subset of clinical paper
abstracts from the Medline database. It consists of the 23
Medical Subject Headings (MeSH) diseases categories. In this Here, C is the class label, True Positive TP is the count of
work, a partition of 5380 documents was used. documents correctly labeled by the classifier to be in class C,
and False Positive FP is the number of documents incorrectly
WebKB data set contains web pages from various sections labeled by the classifier to be in class C. Meanwhile, False
of computer science collected by the World Wide Knowledge Negative FN is the number of documents that belong to class
Base (Web->Kb) project of the CMU text learning group [35]. C and which are incorrectly identified by the classifier not to
These web pages have been divided into 7 different classes: be in class C, and True Negative TN is the number of
students, faculty, staff, etc. In this work, a preprocessed documents does not belong to class C correctly labelled by the
WebKB data set was used, which contains 4 different classes classifier not to be in class C.
and a total of 4199 documents.
The β coefficient can be adjusted to provide either
The same preprocessing steps were applied to all data sets Precision or Recall of the clustering algorithm. If β is greater
by converting the text to lowercase and breaking it into the than 1, Precision is evaluated more strongly in the calculation,
longest non-empty alphanumeric character sequences that if β is less than 1, Recall is evaluated more strongly. β is 1.0
contain at least three-letter characters. Numbers, white spaces, by default.
punctuations, accent marks, stop words, rare words (that occur
in corpus less than 20 times), and words without embeddings Due to the imbalance and different sizes of the used data
(out-of-vocabulary) have been removed. sets, F1-score measure with micro-average parameter was
used in this research [38], and it is defined by the following
Both types of word embedding models were used: the pre- formula:
trained models and the models trained on the same data set,
referred to in this work as self-embedding. 𝑀𝑖𝑐𝑟𝑜 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
The lemmatization was performed with self-trained
embeddings only, in order to reduce the number of out-of- ( )⋅∑
vocabulary words when using the pre-trained models. On the ( )⋅∑ ∑ ∑
other hand, stemming was not performed because it is only a
heuristic process that cuts off the ends of words, and often V-measure. For the clustering task, knowing the ground-
includes the removal of derivational affixes. However, a truth labels, an intuitive V-measure is determined using
preprocessed (stemmed and lemmatized) version of WebKB conditional entropy analysis. V-measure is an entropy-based
data set was used because the original data set could not be measure, that explicitly measures how successfully the criteria
retrieved. for homogeneity H, when each cluster contains only members
We have used three of the most efficient embedding of one class, and completeness C, when all members of a given
models: glove, word2vec, and FastText. As a pre-trained class belong to the same cluster, were satisfied. The V-
models, we have used, FastText model [36] with 1-million- measure is computed as the harmonic mean for various
word vectors trained with sub-word information on Wikipedia estimates of homogeneity and completeness, just as Precision
2017, UMBC webbase corpus, and statmt.org news data set and Recall are usually combined into an F-measure [31].
(16B tokens); word2vec model 2 which was trained with
2
We used the GloVe and word2vec models from genism
https://radimrehurek.com/gensim/models/word2vec.html
𝑉 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
( )⋅ ⋅
D. Results and discussions
( ⋅ )
The results of comparing the proposed BoWC with the
The beta coefficient is set exactly as in the F1-score considered baselines in the documents classification and
formula. In this research the default value of one was adopted. clustering tasks are presented in Table 2 and Table 3.
C. Documents vectorization As can be seen for the Reuters data set classification task,
our method outperformed all baselines with a significant
For the implementation of the BoWC method, first the pre- difference in performance (0.05-1% minimum). In the BBC
processing of documents was carried out and then the concept data set classification task, our method outperformed BOC,
dictionary was built according to Algorithm 1. For this Bag of Words, and self-averaged word2vec with only 100
purpose, all the unique terms in the corpus were extracted and concepts, while it performed similarly to the TF-IDF for 200
the words representation generated using an embedding model concepts. However, BoWC could not outperform VLAC
with word vectors of size 300. We get the so-called dictionary (9000) on this data set.
of terms. Based on this dictionary, a concept dictionary was
built by applying the spherical K-mean algorithm with a Moreover, it is noted that all methods achieve high
similarity threshold of 0.6 for word clustering. classification accuracy in this data set, which, in the authors'
opinion, is due to the fact that the average number of in-
To study and analyze the efficiency of the proposed vocabulary words is relatively large (see Table 1), which
method, several BoWC implementations were tested against means that in each document there are enough words to extract
each other in two text mining tasks; classification and clear semantics about each concept, thus ensuring the
clustering. generation of a more quality concept vector.
For the classification task, the number of concepts was Knowing that in this work a preprocessed version of the
systematically increased from 10 to 300, and the threshold WebKB data set was accessed (where the words were
value Ɵ from 0.2 to 0.8. Linear Support Vector Machines stemmed in a way that is not compatible with our method), our
SVM was used in these experiments as a classifier on top of method achieved the best performance (87.77%) after TF-IDF
the feature generation methods. For our method, the “POLY” (89.5%). This is an additional point that distinguishes our
kernel was best suited for resulted vectors. For the other method from the others, which is that, regardless of the
methods, the results were validated with all SVM kernel types complexity of the used data set, BoWC achieves very good
(POLY, Linear, and RBF) and the highest value among them results, competing with strong baselines.
was selected. Moreover, 10-fold cross-validation was applied
in each prediction instance to decrease the chance of TABLE II. CLASSIFICATION RESULTS, MEASURED BY MICRO-
overfitting the data and creating biased results. AVERAGE F1.
Correspondingly for the clustering task, the previous Methods BBC RE OH 20NG WebKB
experiment was repeated with 100 then 200 concepts with an BoWCglove (100) 97.01 92.86 62.997 69.00 76.25
increasing similarity threshold from 0.2 to 0.8 for each data
BoWCglove (200) 97.61 91.23 65.35 74.42 81.07
set. k-means algorithm was used as the clustering algorithm.
For the VLAC 3 method, the attached API was used to re- BoWCw2v (100) 96.86 92.4 65.99 68.58 78.55
implement experiments on data sets due to differences in the BoWCw2v (200) 96.65 93.15 63.00 74.31 83.13
data source and evaluation criteria. Thirty concepts (9000
BoWCFastText (100) 95.51 91.2 66.23 66.19 77.74
features) were accepted, taking into account that a larger
number of concepts requires large computational costs [9]. BoWCFastText (200) 96.26 92.02 68.94 68.48 81.92
VLAC was implemented with pre-trained GloVe embedding Self BoWCw2v (100) 96.84 93.49 64.11 66.02 85.25
as it performed best among all other embedding types [9]. Self BoWCw2v (200) 97.29 94.26 68.28 73.18 86.08
Considering that BOC is a special case of the proposed Self BoWCglove (100) 95.8 92.29 64.78 84.95 43.44
BoWC method, it was implemented through the code of the Self BoWCglove (200) 96.26 92.72 68.62 85.43 52.33
BoWC method. As shown in [8] that Bag-of-Concepts, in Self BoWCFastText
comparison with TF-IDF, require at least 100 concepts to (200)
96.99 93.65 68.36 70.07 86.24
achieve competitive results. So, to maximize Bag-of-Concepts Self BoWCFastText
97.89 94.75 69.95 77.07 87.77
accuracy, the number of concepts was set to 200. The (200)
similarity threshold value used with BOC was not mentioned TF-IDF 97.89 94.7 74.31 91.49 89.5
by the author, so it was empirically determined and set to 0.4 BoW 96.98 94.15 69.89 83.85 85.56
for all experiments. For the Averaged word embeddings Averaged GloVe
method, pre-trained GloVe and self word2vec word 97.75 93.32 67.39 78.7 80.22
(300)
embeddings were used. self-averaged w2v
91.599 87.81 56.78 62.37 83.27
(300)
All methods were realized and implemented using Python
BOCCF_IDF (200) 95.48 90.9 48.8 69.44 78.26
language with Google Collab environment. Asus notebook
with Intel Core i7 processor 1.99 GHz, was used. VLAC (9000) 98.2 94.06 69.37 86.4 85.54
a.
Underlined values are the highest results for the proposed method, whereas bold values are the best
results compared to all other methods for a single data set.
3
We used the implementation provided by authors on
https://github.com/MaartenGr/VLAC
In the OHSUMED series of experiments, it is interesting With so many differences between data sets it is hard to
that BOC performs the worst among all other baselines, pin point the exact reason for the differences in performance.
achieving a micro-average F1 value of 48.8% versus 69.95% One obvious reason could be the number of out-of-vocabulary
for our method. This is a very important point, especially since words per document, the proportion of which varies among
our method follows the same approach used in BOC, and this the data sets used (see Table 1).
confirms the efficiency of our weighting function and its
As mentioned above, the BoWC method is dependent on
ability to capture semantics at the terms and concepts level at
the same time. it is also worth noting that, our method the word embeddings type, the number of concepts and the
similarity threshold used to detect the document concepts.
outperformed Bag-of-Concepts on all data sets by a significant
percentage (between 2-20% depending on the dataset) with Following we discuss in detail the effect of each of these
parameters on the performance of the method. We illustrate
50% savings in the number of features.
the dependence relationship using only one data set (Reuters),
For 20Newsgroups, which is a relatively large data set since it shows a general character on the whole.
compared to the others, all concept-based vectorization
methods performed poorly. Since 20Newsgroups documents The similarity threshold and the type of embeddings. The
role of the similarity threshold is to verify the occurrence of a
are categorized into categories and subcategories, i.e., there is
concept within the document so that when the similarity ratio
a similarity between categories, which requires a large number
between a document word and the center of the cluster is
of concepts to clearly separate the categories. For this reason,
greater than the threshold, M clusters words closest to the
the TF-IDF method has achieved the best performance (91%),
since it is based on terms, not concepts. However, Although center will be matched with that word. Accordingly, the higher
the threshold, the fewer matching operations (Fig. 2). In other
the VLAC method works at the conceptual level, its vectors
(9,000 features) contain detailed information about the words, the higher the threshold value, the less likely a concept
is to appear in a document, which results in irrational weights
concept words, and this explains why it has achieved an
acceptable result (85.6%). of concepts and thus affects the accuracy of
classification/clustering.
Clustering is a more complex task that requires vectors to
be highly expressive so that they can clearly separate classes. In general, it is observed that for threshold values higher
than 0.5 the accuracy decreases significantly over all data sets.
However, depending on the word embeddings that were used,
our BoWC method for only 100 concepts was able to However, by looking at each type of embeddings separately,
we see that for the word2vec and FastText embeddings, the
outperform all other baselines on most data sets, except for
threshold values (from 0.2 to 0.7) all give close results
20Newsgroups where 200 concepts were required (Table 3).
The VLAC clustering experiment with the 20Newsgroups according to the micro average F1-measure.
data set was ruled out due to its high computational costs, as Moreover, the effect of the threshold value also depends
the clustering algorithm failed due to a full load of 12.72 GB on the type of embedding, and this is normal, given that the
of RAM. representational ability of the embedding vectors varies from
one model to another according to the nature and size of the
TABLE III. CLUSTERING RESULTS, MEASURED BY V-MEASURE data on which the embedding model was trained.
Methods BBC RE OH 20NG WebKB Fig. 4 and Fig. 5 confirm this result, as it is noted that the
BoWCglove (100) 0.728 0.449 0.104 0.373 0.109 overall performance of the BoWC is significantly affected by
the type of word embedding used, and that the self-trained
BoWCglove (200) 0.733 0.45 0.118 0.385 0.149
word embeddings used for BoWC performed significantly
BoWCw2v (100) 0.666 0.445 0.16 0.345 0.128 better than pre-trained word embeddings.
BoWCw2v (200) 0.616 0.452 0.127 0.338 0.123
BoWCFastText (100) 0.515 0.379 0.143 0.275 0.166
BoWCFastText (200) 0.625 0.375 0.155 0.301 0.206
Self BoWCw2v
0.54 0.505 0.123 0.211 0.337
(100)
Self BoWCw2v (200) 0.536 0.493 0.125 0.232 0.333
Self BoWCglove
0.794 0.574 0.114 0.311 0.329
(100)
Self BoWCglove
0.810 0.574 0.117 0.396 0.340
(200)
Self BoWCFastText
0.768 0.591 0.129 0.316 0.318
(200)
Self BoWCFastText
0.771 0.545 0.140 0.391 0.328
(200)
Fig. 2. Dependence of Reuters classification accuracy (in terms of micro-
TF-IDF 0.663 0.513 0.122 0.362 0.313
average F1 score) on the similarity threshold values
BoW 0.209 0.248 0.027 0.021 0.021
Averaged GloVe
0.774 0.481 0.109 0.381 0.219
(300)
self-averaged w2v
0.631 0.557 0.115 0.325 0.324
(300)
BOCCF_IDF (200) 0.743 0.527 0.1 0.394 0.088
VLAC (9000) 0.808 0.456 0.118 ꟷ 0.286
the use of contextual word embeddings models, such as,
Embeddings from Language Models (ELMo) [39] and
Bidirectional Encoder Representations from Transformers
(BERT) [40] that create, for a single word, different word
embeddings depending on its contexts. The use of such
models makes the concept clusters purer since only words
with similar meanings and context will be grouped. On the
other hand, the document words will be represented with
better accuracy and it becomes possible to apply the condition
Fig. 3. Dependence of the five data sets classification accuracy (in terms of
that the word belongs to only one concept (hard attribution).
micro-average F1 score) on the type of embedding model
VI. CONCLUSION
Increasing the efficiency of various applications and
information systems for engineering education is associated
with the development and use of text analysis methods to
solve such tasks as the classification and clustering of text,
creation of question-answering systems or recommending
content, etc., to enhance learning and meet the needs of
students and teachers.
Fig. 4. Dependence of the five data sets clustering accuracy (in terms of V1 In connection with the importance of these techniques, this
score) on the type of embedding model paper has introduced a novel method for the generation of
textual features, namely Bag of Weighted Concepts (BOWC).
The number of concepts. To analysis the dependence BoWC vectorizes a document according to the concepts’
relationship between the number of concepts (for different information it contains, by clustering word vectors to create
threshold values) and classification accuracy (in terms of concepts, then uses the frequencies of these concept clusters,
Micro-average score F1), a series of Reuters classification enriching them with similarity information of clusters’ words
experiments were carried out using BoWC with self FastText. to represent the document vectors. The method is
The number of concepts has been gradually increased from 10 characterized by a new concept weighting function that
to 300 with the similarity threshold values increased from 0.2 combines term level semantics with concept level semantics,
to 0.6. It is noticeable that BoWC begins to provide allowing the production of more valuable features and low-
comparable performance for only 50 concepts. Precisely, it dimensional vectors which mean better accuracy and lower
outperformed BOC by 1% for just 20 concepts, and it also computational cost when applied with text mining algorithms
outperformed averaged GloVe for 90 concepts. We can also such as classification and clustering.
notice that for a number of concepts greater than 200, the
accuracy begins to decrease slightly, due to the fact that the In two experiments the performance of BoWC was
concepts become overlapping and therefore non- measured and benchmarked based on Reuters,
discriminatory. 20Newsgroups, BBC, OHSUMED, and WebKB data sets
Another important point related to word embedding, is that using SVM and KNN classifiers. It was tested against several
in the proposed method implementation algorithm, a word baselines including Bag-of-words, TF-IDF, Bag-of-Concepts,
from a document can belong to more than one concept at the averaged word embeddings, and VLAC. On average, BoWC
same time (soft attribution). This is because the embedding was shown to outperform most baselines in terms of the
models used in this research generate a single vector of the minimum number of features and maximum classification and
word regardless of its context, and therefore, it cannot be clustering accuracy. While this work has focused on text
definitively judged that a word belongs to a concept and not classification and clustering, many other tasks, such as
to another. Thus to check that a concept occurs in a particular document retrieval, topic extraction, and educational content
document, every word in the document has to be matched with recommendation can be solved by BoWC.
all concepts. Future research is planned with the aim of reducing
computational cost, by developing a mechanism to reduce the
number of matches between document words and concepts.
One planned action is to carry out the 'feature selection' phase
by applying a term weighting process in order to represent a
document by only its most characteristic words. Another
direction could focus on testing contextual word embedding
models such as ELMo and BERT, which rely on word
contexts when creating embeddings. This could allow the
application of hard matching so that each document word can
belong to only one concept and thus the formation of clusters
with better representation ability.
ACKNOWLEDGMENT
Fig. 5. Dependence of accuracy (in terms of micro-average F1 score) on the The study was performed by the grant from the Russian
number of concepts. Science Foundation № 22-21-00316,
https://rscf.ru/project/22-21-00316/, in the Southern Federal
However, it may be possible to control the number of
University.
comparisons between concepts and document words through
View publication stats