0% found this document useful (0 votes)
19 views

Text Vectorization Using Data Mining Methods

This document discusses text vectorization methods for educational data mining. Traditional methods like TF-IDF and bag-of-words have limitations, while modern distributed methods are computationally intensive. The paper proposes a new concept-mining based method called Bag of Weighted Concepts (BoWC) that aims to generate low-dimensional, high-representational text vectors. BoWC clusters word vectors to create concepts, then represents documents based on concept frequencies and weighting. The method is tested on document clustering and classification tasks, outperforming baselines with fewer features and higher accuracy.

Uploaded by

1494834438
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Text Vectorization Using Data Mining Methods

This document discusses text vectorization methods for educational data mining. Traditional methods like TF-IDF and bag-of-words have limitations, while modern distributed methods are computationally intensive. The paper proposes a new concept-mining based method called Bag of Weighted Concepts (BoWC) that aims to generate low-dimensional, high-representational text vectors. BoWC clusters word vectors to create concepts, then represents documents based on concept frequencies and weighting. The method is tested on document clustering and classification tasks, outperforming baselines with fewer features and higher accuracy.

Uploaded by

1494834438
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/353592446

TEXT VECTORIZATION USING DATA MINING METHODS

Article in IZVESTIYA SFedU ENGINEERING SCIENCES · July 2021


DOI: 10.18522/2311-3103-2021-2-154-167

CITATIONS READS
4 477

3 authors:

Ali Mansour Juman Mohammad


Southern Federal University Southern Federal University
9 PUBLICATIONS 12 CITATIONS 4 PUBLICATIONS 4 CITATIONS

SEE PROFILE SEE PROFILE

Kravchenko Yury
Southern Federal University
54 PUBLICATIONS 179 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ali Mansour on 14 August 2023.

The user has requested enhancement of the downloaded file.


Text vectorization method based on concept mining
using clustering techniques
Ali Mansour Juman Mohammad Yury Kravchenko
Dept. of Computer Aided Design Dept. of Computer Aided Design Dept. of Computer Aided Design
Southern Federal University “sfedu” Southern Federal University “sfedu” Southern Federal University “sfedu”
Taganrog, Russia Taganrog, Russia Taganrog, Russia
mansour.mh.ali@gmail.ru juman.hs.mohammad@gmail.com krav-jura@yandex.ru

Abstract—With the incredible increase in the amount of text been effectively applied to analyze educational data in many
data, the need to develop efficient methods for processing and types of research to improve the learning process [4].
analyzing it increases. In this context, feature extraction from text
is an urgent task to solve many texts mining and information For example, text mining has been used for the
retrieval problems. Traditional text feature extraction methods personalization of recommendations within adaptive e-
such as TF-IDF and bag-of-words are effective and characterized learning systems [5], where the students' sentiments towards a
by intuitive interpretability, but suffer from the “curse of course have served as feedback for teachers. This information
dimensionality”, and they are unable to capture the meanings of has been used to support personalized learning, as in [6],
words. On the other hand, modern distributed methods effectively where they propose to analyze the students’ text feedback
capture the hidden semantics, but they are computationally automatically performing sentiment analysis tasks to predict
intensive, and uninterpretable. This paper proposes a new the level of teaching performance.
concept-mining-based text vectorization method called Bag of
weighted Concepts BoWC that aims to generate representations In addition, several systems have been developed that use
with low dimensions and high representational ability. BoWC text mining for automated tagging of essays and short free-text
vectorizes a document according to the concepts’ information it responses [7]. Similarly in [8], text analysis is used for real-
contains, where it creates concepts by clustering word vectors, then time formative assessment purposes of short answer
uses the frequencies of these concept clusters to represent questions. Moreover, there are many important services and
document vectors. To enrich the resulted document applications in automating education services, such as
representation, new weighting functions are proposed for concept plagiarism detection systems [9] and paraphrasing programs
weighting based on statistics extracted from word embedding [10], which are used effectively in the education field. These
information. This work is a development of previous research in techniques serve to reduce the time and cost and improve the
which the proposed method was developed and tested on a text reliability and generalizability of the assessment process.
classification task. In this work, empirical tests were extended to
include tuning the parameters of the proposed method and To apply various text mining methods, raw text must be
analyzing the effect of each on the efficiency of the method. The converted into a format that a machine can understand [11].
proposed method has been tested in two data mining tasks; The first step to making text documents machine-readable is
document clustering and classification, with five various to perform text vectorization. Vectorization is a feature
benchmark data sets, and it was compared with several baselines, extraction process for constructing digital vector
including Bag-of-words, TF-IDF, Averaged word embeddings, representations of text in natural language [12].
Bag-of-Concepts, and VLAC. The proposed method outperforms
most baselines in terms of the minimum number of features and Traditional text vectorization methods such as TF-IDF and
maximum classification and clustering accuracy. bag-of-words are effective and characterized by simplicity,
efficiency, and intuitive interpretability, but they suffer from
Keywords — Text vectorization, text mining, concept mining, the “curse of dimensionality”, and they do not take into
classification, clustering, educational data mining account the semantic relationship between words. On the
other hand, modern distributional semantic representation
I. INTRODUCTION methods effectively capture the hidden semantics, but they are
The widespread use of computers and the Internet in all computationally intensive, and uninterpretable [13].
areas of education, has resulted in the availability of large
One of the proposed solutions to overcome these problems
amounts of data. To take advantage of this huge source of
is to use concepts instead of words to represent the text. This
information, it is necessary to develop innovative techniques
approach relies on concept mining as a prerequisite for text
for mining and extracting hidden information from these data.
vectorization, and it is driven by the fact that concept can
The application of data mining techniques to improve
provide powerful insights into the meaning, provenance, and
educational engineering applications is not new and is referred
similarity of documents.
to as educational data mining (EDM). EDM is a research field
that aims to prepare techniques for exploring a specialized Several methods for exploiting concepts in generating
form of information collected from the educational sector and documents representations have been presented, and they can
to apply these techniques to improve understanding about be categorized according to the source of the information into
students and the environment in which they learn [1-3]. two approaches. The first relies on external sources such as
knowledge bases or lexical databases to discover concepts
Among the various forms of data, the text is the most
[14-17]. The methods based on this approach first extract the
common way for information transmission and interaction in
characteristic keywords of the document and then assign these
education. Accordingly, text mining is essential to take
words to the corresponding entries in the knowledge base,
advantage of this huge source of information, where it has
passing through a word disambiguation phase.

978-1-6654-0577-5/22/$31.00 ©2022 IEEE


The second approach is the distributed representation- representations in the implementation of two important text
based approach, which relies only on text as a source of mining tasks, namely classification and clustering.
information [18-20]. The methods of this approach extract This paper is organized as follows. First, we describe
concepts based on distributed representation and machine different text vectorization methods (section Ⅱ). Then in
learning algorithms, where the meaning of a word is inferred Section Ⅲ the problem statement is described. In section Ⅳ,
by analyzing its congruence with context features. While the our method for documents vectorization is presented and then
knowledge-based approach gives clearer concepts, it suffers detailed. First of all, in section Ⅳ.A we describe the
from the limitation of the knowledge base that does not algorithm for building a concepts dictionary. Then in section
contain all the vocabulary words, which may affect the
Ⅳ.B, we explain how to generate the document vector
completeness of the resulting representation.
(document representation): we describe the proposed
Motivated by the amazing advances made by word concepts weighting functions including concept frequency
embedding techniques [21] (distributed representation), in this CF, a novel concept weighting function CF-EDF, and finally
work a new concept-mining-based method for generating a heuristic function for extracting additional discriminatory
conceptual document representation is presented. The driving information for a concept. Experiments and evaluation results
challenge for the developed method is to generate a document for this method are presented in Section Ⅴ. Conclusion and
representation with fewer, interpretable, and more informative expectations are drawn in Section Ⅵ.
features capable of efficiently characterizing the content of the
document. II. RELATED WORKS
This work is an extension of the work presented in a A. Bag-of-words.
previous paper [22], in which a method named BoCW (bag of The Bag-of-words method is based on the assumption that
weighted concepts) has been developed. Following the the frequency of words in a document can appropriately reflect
distributed representation approach, BoCW represents text the similarities and differences between documents.
documents based on the hidden concepts' information it Therefore, document feature vectors generated by BOW
contains. This research includes the following main represent occurrences of each word in the document [23]. This
contributions: method has a serious drawback: frequent words can dominate
1. The weighting function based on the inverse document the feature space, while rare words can carry more valuable
frequency has been replaced by a monotonically decreasing information.
function, which allows better control of the influence of the To improve the representation of the "bag of words", the
document frequency at the corpus level; TF-IDF method has been proposed [24], which reflects the
2. To determine the importance of a concept in the significance of a word for a document in the corpus, that is, a
document, a heuristic function has been introduced. This weighing mechanism, where Term Frequency (TF) is the
function extracts additional discriminatory information for the number of times a term appears in the document, and the
concepts by measuring the similarity of the words that make inverse frequency of the IDF document measures rarity term
up the concept with the document words; in the entire corpus. By denoting the total number of
documents in a collection |D|, the concepts of term frequency
3. A software product1 has been developed to evaluate the and reciprocal document frequency are combined to obtain a
proposed method and compare it with other baselines. total weight for each term in each document as follows:
Thus, the vectors generated by the proposed method
combine term-level semantics with concept-level semantics 𝑡𝑓 − 𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) ⋅ 𝑖𝑑𝑓(𝑡, 𝐷) =
and hence, contain more valuable information, and most
importantly, the sizes of these vectors are much smaller. = ⋅ 𝑙𝑜𝑔
| |
, (1)
∑ 𝐷 ∈𝐷𝑡∈𝐷
This characterization of the proposed method has many
applications in education and can be used to develop students' where 𝒏𝒕 is the number of occurrences of the word 𝒕 in
skills in many key areas of AI research, such as natural document 𝒅 and 𝒏𝒌 is the total number of words in this
language interface design, providing automatic answers and document. 𝑫𝒊 – number of documents from collection D in
content recommendations, and data mining-based self- which t occurs.
learning methods. These applications are considered in many
educational programs, such as "Applied Systems of Artificial However, these methods have two main problems. The
Intelligence" in the discipline "Extraction and Representation first is the problem of dimensionality, since the number of
of Knowledge in Intelligent Systems" as well as within the features in the resulting vectors increases significantly as the
discipline "Artificial Intelligence Methods in Building size of the corpus increases. Consequently, if the size of the
Automated Information Systems" as part of the educational vectors becomes too large and sparse, traditional distance
program "Automated Information systems", in areas measures such as the Euclidean distance or the cosine distance
(09.04.01 Informatics and Computer Science and 09.04.03 become meaningless. The second problem is that the resulting
Applied Informatics). representation does not take into account the semantic
relationship between words.
In this paper, we aim to consolidate and enhance the
previously obtained results by expanding the experimental B. Word embeddings
tests section to include tunning the parameters of the proposed
method to analyze its impact on the quality of the resulting
1
The implementation code can be found on https://github.com/Ali-
MH-Mansour/BoWC-Method
Word embedding models are capable of extracting BOW and TF-IDF in the classification problem for finding the
semantically – enriched representations of words. Among two most similar documents among triples of documents [28].
them Word2Vec is a popular model for mapping words in a
document to a vector representation. It combines multiple D. Vector of Locally Aggregated Concepts.
two-layer neural networks to construct embeddings, namely Following the same approach as BOC, Vectors of Locally
the Continuous Bag-of-words (CBOW) and Skip-gram Aggregated Concepts (VLAC), groups words to form N
architectures [21]. In the CBOW architecture, the model concepts. However, instead of calculating the frequency of
predicts a target word 𝒘𝒊 based on the set of its context words. clustered word embeddings, VLAC adopts the same approach
In contrast, the Skip-gram architecture tries to predict a set of as the VLAD (Locally Aggregated Vector Descriptors)
context words given a target word 𝒘𝒊 . This task is posed as method which is a features generation method used for image
maximizing the average log probability of finding a word in classification [19, 29]. In VLAD, images are represented by
the context: the occurrence count of its clustered features (i.e.,
descriptors), where it leverages image feature clusters instead
∑ ∑ | | 𝑙𝑜𝑔𝑝 𝑤 𝑤   of words clusters, in addition to include first-order statistics in
the feature vectors. Similar to the original VLAD approach,
VLAC takes each cluster’s sum of residuals with respect to its
The hidden layer represents the word vectors as the
centroid and concatenates those to create a feature vector.
relationships between words and context are learned. The
main drawback of Word2Vec is that word embeddings are Given a word embedding 𝒘𝒊 of size D assigned to cluster
generated locally within documents while ignoring the global center ck. Then, for each word in a document, VLAC
representation of words across all documents. Models such as computes the element-wise sum of residuals of each word
GloVe (global vectors for word representation), solve this embedding to its assigned cluster center. This results in k
problem as they create a large co-occurrence count (word × feature vectors, one for each concept, and all of size D. All
context) in order to learn the global representation of a word feature vectors are then concatenated, power normalized, and
[25]. FastText [26] is an extension of Word2Vec proposed by finally, l2 normalization is applied as with the original VLAD
Facebook in 2016. Instead of entering individual words into approach. The resulting feature vectors contain more valuable
the neural network, FastText splits the words into several n- information than the BOC features vectors due to the
grams (subwords). Thus, it produces word embeddings for all additional inclusion of these first-order statistics. However,
the n-grams given the training data set. This allows rare words this method generates vectors of relatively large sizes with
to be properly represented since it is highly likely that some of high computational costs. If 30 concepts were to be created
their n-grams also appears in other words. The most from 300 words attachments, the resulting document vector
commonly technique to represent whole documents using would contain 30 × 300 values.
word embedding, is to simply calculate the average word
embedding vector of a document [27]. III. THE PROBLEM STATEMENT
C. Bag-of-Concepts (BOC) Let Ɗ be the text dictionary, i.e., a list of unique words that
appear in a collection of text documents. Let 𝒙𝒊 𝝐𝑹𝑫 be the
The Bag-of-Concepts (BOC) method described in [28] word embedding vector of the i-th word of the dictionary Ɗ,
generates document word vectors using a set of word2vec where D is the dimensionality of the word embedding. The set
models that embed semantically similar words into a of all word embedding vectors is denoted by 𝜺 = {𝒙𝒊 , 𝒊 =
neighboring area. This allows neighboring words to be 𝟏, … , |𝑫|}.
grouped into one common cluster of concepts. Bag-of-
Concepts generates word clusters by applying spherical k- Also, let N be the number of text documents to be encoded
means to word embeddings. The resulting clusters contain using BoWC method. Each document is described by the
words with a similar meaning and are therefore called embedding vectors of its words 𝑵𝒊 : 𝒙𝒊𝒋 ∈ 𝜺, (𝒊 = 𝟏, … , 𝑵, 𝒋 =
concepts. Similar to the bag of words, each document vector 𝟏, … , 𝑵𝒊 ) , where Ni is the number of words of the i-th
is represented by the frequency of each concept cluster in the document. Each word in the document has an embedding
document. To mitigate the influence of concepts that appear vector, that is, the feature vector 𝒙𝒊𝒋 - is the embedding vector
in most documents, a weighting scheme similar to TF_IDF is for the j-th word of the i-th text document. The number of the
used, replacing the term TF frequency with the CF concept extracted feature vectors may vary from one document to
frequency. Hence, it is called CF-IDF (Concept Frequency other. Thus, the task is to find a mapping 𝒚 = 𝒇(𝒙): 𝑹𝑫 → 𝑹𝑪
with Inverse Document Frequency) and is calculated based on such that the transformed feature vector 𝒚𝒊 𝝐𝑹𝑪 , preserves
the following equation: (most of) the information or structure in 𝑹𝑫 . The optimal
mapping 𝒚 = 𝒇(𝒙) will be one that does not increase the
| |
𝐶𝐹 − 𝐼𝐷𝐹 𝑐 , 𝑑 , 𝐷 = ⋅ 𝑙𝑜𝑔 ,  minimum probability of error. Accordingly, it is required to
∑ 𝑑∈𝐷𝑐 ∈𝑑 find a mapping from word space to concept space that allows
each document to be represented by a vector of fixed length.
where |𝑫| - the number of documents in the collection, and 𝑵𝒊 : 𝒄𝒊𝒋 = (𝒊 = 𝟏, … , 𝑲, 𝒋 = 𝟏, … , 𝑵𝒊 ), where 𝑲 is the number
the denominator is the number of documents from the of extracted concepts, and 𝒄𝒊𝒋 is a feature of the j-th concept
collection D, in which the concept c occurs; 𝒏𝒄 is the number
of the i-th document.
of occurrences of concept c in document d, and 𝒏𝒌 is the total
number of concepts in this document. This method solves the IV. THE BAG OF WEIGHTED CONCEPTS METHOD (BOWC)
problem of large dimensions in BOW method as it non-
linearly downsizes when converting word space to concept To accomplish this task, a fundamentally new method has
space based on semantic similarity. In addition, it was shown been developed for representing text documents in the form of
that BOC provides a better document representation than numerical vectors of fixed length. The proposed method
converts the document into a vector according to the concept
information it contains. To do this, a dictionary of concepts T B. Document represnetation
is generated, and then document vectors are generated based For a given document d, a feature vector of size k is
on the statistics of the frequencies of these concepts in the created, 𝑽𝒅 = (с𝒅𝟏 , с𝒅𝟐 , … , с𝒅𝑲 ) , where k is the number of
document. Fig. 1 illustrates the basic operations involved in
concepts and с𝒅𝒊 expresses the degree of significance (weight)
the proposed method.
of the i-th concept in a document. Similar to the BOC method,
A. Mining concepts and building the concept dictionary с𝒅𝒊 is calculated by comparing the words of the document with
First, the set of documents is processed and the unique the concepts, i.e., the similarity between the document word
terms that make up the vocabulary are extracted, then, the vector and the centroid vector of the cluster (representing the
embedding vectors are generated for them, in preparation for concept) is measured, and then the occurrence of the concept
building the concepts dictionary. All word embedding vectors that exceeds a certain threshold Ɵ is recorded. Threshold value
are clustered into Nk clusters. Then, a spherical k-means Ɵ is determined experimentally as it is discussed later in the
algorithm is applied, in which cosine similarity is used as a section V. Concept frequency is given by the following
distance metric. formula:
It is necessary to ensure a minimum number of words in
each cluster that represents the concept; For this, the resulting 𝐶𝐹 𝑐 , 𝑑 , 𝐷 = ∑  
cluster size is checked and expanded with synonyms to reach
the acceptable size. Next, the words in each cluster are where 𝒏𝒌 is the total number of concepts in the document, and
arranged according to proximity to the center of the cluster, 𝒏с is the number of occurrences of the concept c in the
and then the M-words closest to the centroid of the cluster are document, which is calculated by the following binary
selected. As a result, a group of concepts is obtained, each of function 𝒈(𝒔):
which is represented by M words belonging to one general
concept or having a common hyperonym. The concept 1, 𝑠 > Ɵ
dictionary is represented as follows, 𝑪𝑫 = 𝑔(𝑠) = . (5)
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑧𝑒
(𝒘𝟏𝟏 , 𝒘𝟏𝟐 , … , 𝒘𝟐𝐌 , 𝒘𝟐𝟏 , 𝒘𝟐𝟐 , … , 𝒘𝟐𝐌 , … , 𝒘𝑲𝟏 , 𝒘𝑲
𝟐 , … , 𝒘𝑲
𝐌 ) , where
𝒋
𝒘𝐢 is the i-th word of the j-th cluster. The similarity 𝒔 between two words vectors X and Y is
calculated as:
Algorithm 1. Building the dictionary of concepts

Input Set 𝑫 = {𝒅𝟏 , 𝒅𝟐 , … , 𝒅𝑵 } of N documents 𝑆 = 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑋, 𝑌) = 𝑐𝑜𝑠(𝜃) = ‖ ‖⋅‖ ‖
=
𝑴𝒅𝒆𝒔𝒊𝒓𝒆𝒅 // the minimum number of words per
cluster ∑
Output Concept dictionary 𝑪𝑫 = [ ] = , (6)
∑ ∑
1: Represent each document as a sequence of words
2: Create a vocabulary of words Ɗ
3: Initialize word embedding Ɛ with Ɗ where Xi and Yi are the components of the vector of words X
Initialize Dictionary CD, by running spherical k- and Y, respectively.
4:
means on Ɛ CF-EDF weighting function. To reduce the influence of
5: 𝑪 = 𝑐𝑜𝑢𝑛𝑡(𝑪𝑫) // number of concepts concepts that appear in most documents, the BOC method
6: 𝑾𝒉𝒊𝒍𝒆 𝑖 < 𝐶 𝒅𝒐 uses a weighting scheme similar to TF-IDF, in which the
7: 𝑴𝒓𝒆𝒂𝒍 = 𝑐𝑜𝑢𝑛𝑡(𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠[𝒊]) concept frequency CF is weighted by the inverse document
// the number of words needed to expand frequency IDF (3). However, using this formula at the concept
the cluster level is ineffective because the concept frequency at the
8: 𝑳 = 𝑴𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑴𝒓𝒆𝒂𝒍 collection level is much higher than the frequency of one of
9: 𝒊𝒇(𝐿 > 0) 𝒕𝒉𝒆𝒏 its terms. Mathematically, this leads to very small values of
10: 𝒇𝒐𝒓 𝑗 = 0 𝑡𝑜 𝐿 𝒅𝒐 the logarithm – close to zero, which indicates a false
// get synonym for the word j conclusion.
11: 𝒘 ≔ most_similar (𝑪𝑫[𝒊, 𝒋]) To overcome this problem, the authors propose a new
12: 𝒘∗ = 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝒘) weighting function based on a previously established
// Add this word to the current monotonic decrease function [30], which allows improving
cluster the quality of adjusting the influence of the document
13: 𝑪𝑫[𝒊] ← 𝒘∗ frequency at the document collection level. The function is
𝒆𝒏𝒅 given by the following formula:
𝒆𝒍𝒔𝒆
14: 𝑠𝑜𝑟𝑡(𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠[𝒊]) 𝑓(𝐹) = 𝑒 ⋅
 
15: 𝑪𝑫[ ] ← 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠[𝑴𝒅𝒆𝒔𝒊𝒓𝒆𝒅 ]
𝒆𝒏𝒅 where α is a constant, and F is the frequency of the document,
𝒆𝒏𝒅 calculated by the formula:
16: Save clusters as a dictionary CD
𝑑∈𝐷𝑐 ∈𝑑
𝐹=( | |
) 

The exponential function was chosen to ensure that the 𝒇


value is in the range [0,1].
Fig. 1. The main functions involved in the implementation of BoWC

embedding models was adopted which is the pre-trained


The resulted weighting function, which is called CF-EDF GloVe embedding [25]. The results showed that BoWC
(concept frequency – exponential document frequency), takes outperformed most of the baselines according to V-measure,
the following form: however, the relationship between the number of concepts
and classification accuracy was not clear and the correlation
𝑑∈𝐷𝑐 ∈𝑑 curve was fluctuating, due to the difference of concepts
𝐶𝐹 − 𝐸𝐷𝐹 𝑐 , 𝑑 , 𝐷 = ⋅ 𝑒𝑥𝑝 (− ).  generated with each run of the concepts dictionary building
∑ | |
step. Also, the effect of the similarity threshold on the quality
Enhancing concepts using similarity values. In order to of the resulting vectors was not analyzed but was manually
determine the importance of the concept in the document, the set for all experiments
authors propose a heuristic function for extracting additional In this work we aim to expand the experimental tests
discriminatory information for the concept. This function section to include tunning the parameters of the proposed
measures the similarity of the words that make up the concept method and analyzing its impact on the quality of the
with the words in the document to which it belongs. For resulting representations. In addition, the performance of the
document d with N words belonging to the c-th concept, the method is analyzed with the use of both types of embeddings:
similarity is calculated as follows: self- and pretrained embeddings.
For this purpose, our proposed method BoWC was
𝑆 = ∑ 𝑚𝑎𝑥 𝑠𝑖𝑚(𝑢 , 𝑣 )  evaluated using five text data sets, performing two text mining
tasks; clustering and classification. BoWC is dependent on the
quality of word embeddings, the number of concepts created
where 𝒖𝒊 is the i-th word of the document belonging to the and the threshold Ɵ used to detect the document concepts.
concept; v is the j-th word of the concept. The max function Therefore, several BoWC implementations were tested
returns the highest similarity score resulting from comparing against each other with different numbers of concepts and
the similarity of the document word with the words of the different values of Ɵ in both the classification and clustering
concept to which it belongs. The final formula for concept tasks. This experiment served as a way to explore how the
weighing is as follows: BoWC is affected by the number of concepts generated and
how Ɵ affects the representational ability of the resulting
𝐵𝑜𝑊𝐶 = 𝑓 𝑐 ,𝑑 ,𝐷 = vectors. The second series of experiments were concerned
with analyzing the effect of word embedding used on the
= 𝐶𝐹 − 𝐸𝐷𝐹(𝑐 , 𝑑 , 𝐷) ⋅ 𝑒𝑥 𝑝 𝑆   quality of the BoWC-generated vectors. Below, the used data
sets and the evaluation metrics are briefly described then
V. EXPERIMENTAL RESEARCH results are discussed.

In the experimental section of the previous work [22], the A. Data sets and preprocessing
efficiency of the BoWC-generated representations was To evaluate the proposed method, we have used five data
evaluated by using them in a document classification task sets [32, 33] presented in Table 1.
with five benchmark data sets, and their performance was
evaluated using V-measure [31]. One type of words
TABLE I. AN OVERVIEW OF DATA SETS STATISTICS Google News corpus (3 billion words) and Glove model which
# # Mean number # was trained on the Common Crawl data set and contains
Dataset Documents Classes of in– Documents vectors for 1.9 million English words.
vocabulary with < 10
words per embedded Since we are interested in the semantic relationship
document words between terms in search of concepts, the window size used to
BBC 2225 5 207 0% train self-embedding models was set to 10, since a window
RE 8491 8 66 1.97% with a size smaller than 5 gives results that are more syntactic
OH in nature, and a window with a size larger than 5 gives more
5380 7 79 24.38%
semantic results.
20NG 18821 20 135 2.19%
WebKB 4199 4 98 4.23% B. Evaluation metrics
F1-measure. The F-measure is one of the common
measures to rate how successful a classifier is [37]. It is the
BBC data set [34], consists of 2225 documents from the harmonic average between precision and recall, which are
BBC news website covering stories in five topical areas from given by followed formulas:
2004-2005.
Reuters (RE) data set contains articles from the Reuters 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (12)
news feed. This paper uses the R8 partition of the Reuters data
set, which contains 8491 documents.
20Newsgroups (20NG) is a collection of 18,821  𝑅𝑒𝑐𝑎𝑙𝑙 =  
documents distributed into 20 different newsgroup categories
with an approximate uniform distribution. ( )⋅ ⋅
 𝐹 − 𝑠𝑐𝑜𝑟𝑒 =  

OHSUMED (OH) data set is a subset of clinical paper
abstracts from the Medline database. It consists of the 23
Medical Subject Headings (MeSH) diseases categories. In this Here, C is the class label, True Positive TP is the count of
work, a partition of 5380 documents was used. documents correctly labeled by the classifier to be in class C,
and False Positive FP is the number of documents incorrectly
WebKB data set contains web pages from various sections labeled by the classifier to be in class C. Meanwhile, False
of computer science collected by the World Wide Knowledge Negative FN is the number of documents that belong to class
Base (Web->Kb) project of the CMU text learning group [35]. C and which are incorrectly identified by the classifier not to
These web pages have been divided into 7 different classes: be in class C, and True Negative TN is the number of
students, faculty, staff, etc. In this work, a preprocessed documents does not belong to class C correctly labelled by the
WebKB data set was used, which contains 4 different classes classifier not to be in class C.
and a total of 4199 documents.
The β coefficient can be adjusted to provide either
The same preprocessing steps were applied to all data sets Precision or Recall of the clustering algorithm. If β is greater
by converting the text to lowercase and breaking it into the than 1, Precision is evaluated more strongly in the calculation,
longest non-empty alphanumeric character sequences that if β is less than 1, Recall is evaluated more strongly. β is 1.0
contain at least three-letter characters. Numbers, white spaces, by default.
punctuations, accent marks, stop words, rare words (that occur
in corpus less than 20 times), and words without embeddings Due to the imbalance and different sizes of the used data
(out-of-vocabulary) have been removed. sets, F1-score measure with micro-average parameter was
used in this research [38], and it is defined by the following
Both types of word embedding models were used: the pre- formula:
trained models and the models trained on the same data set,
referred to in this work as self-embedding. 𝑀𝑖𝑐𝑟𝑜 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
The lemmatization was performed with self-trained
embeddings only, in order to reduce the number of out-of- ( )⋅∑
   
vocabulary words when using the pre-trained models. On the ( )⋅∑ ∑ ∑
other hand, stemming was not performed because it is only a
heuristic process that cuts off the ends of words, and often V-measure. For the clustering task, knowing the ground-
includes the removal of derivational affixes. However, a truth labels, an intuitive V-measure is determined using
preprocessed (stemmed and lemmatized) version of WebKB conditional entropy analysis. V-measure is an entropy-based
data set was used because the original data set could not be measure, that explicitly measures how successfully the criteria
retrieved. for homogeneity H, when each cluster contains only members
We have used three of the most efficient embedding of one class, and completeness C, when all members of a given
models: glove, word2vec, and FastText. As a pre-trained class belong to the same cluster, were satisfied. The V-
models, we have used, FastText model [36] with 1-million- measure is computed as the harmonic mean for various
word vectors trained with sub-word information on Wikipedia estimates of homogeneity and completeness, just as Precision
2017, UMBC webbase corpus, and statmt.org news data set and Recall are usually combined into an F-measure [31].
(16B tokens); word2vec model 2 which was trained with
2
We used the GloVe and word2vec models from genism
https://radimrehurek.com/gensim/models/word2vec.html
𝑉 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
( )⋅ ⋅
  D. Results and discussions
( ⋅ )
The results of comparing the proposed BoWC with the
The beta coefficient is set exactly as in the F1-score considered baselines in the documents classification and
formula. In this research the default value of one was adopted. clustering tasks are presented in Table 2 and Table 3.

C. Documents vectorization As can be seen for the Reuters data set classification task,
our method outperformed all baselines with a significant
For the implementation of the BoWC method, first the pre- difference in performance (0.05-1% minimum). In the BBC
processing of documents was carried out and then the concept data set classification task, our method outperformed BOC,
dictionary was built according to Algorithm 1. For this Bag of Words, and self-averaged word2vec with only 100
purpose, all the unique terms in the corpus were extracted and concepts, while it performed similarly to the TF-IDF for 200
the words representation generated using an embedding model concepts. However, BoWC could not outperform VLAC
with word vectors of size 300. We get the so-called dictionary (9000) on this data set.
of terms. Based on this dictionary, a concept dictionary was
built by applying the spherical K-mean algorithm with a Moreover, it is noted that all methods achieve high
similarity threshold of 0.6 for word clustering. classification accuracy in this data set, which, in the authors'
opinion, is due to the fact that the average number of in-
To study and analyze the efficiency of the proposed vocabulary words is relatively large (see Table 1), which
method, several BoWC implementations were tested against means that in each document there are enough words to extract
each other in two text mining tasks; classification and clear semantics about each concept, thus ensuring the
clustering. generation of a more quality concept vector.
For the classification task, the number of concepts was Knowing that in this work a preprocessed version of the
systematically increased from 10 to 300, and the threshold WebKB data set was accessed (where the words were
value Ɵ from 0.2 to 0.8. Linear Support Vector Machines stemmed in a way that is not compatible with our method), our
SVM was used in these experiments as a classifier on top of method achieved the best performance (87.77%) after TF-IDF
the feature generation methods. For our method, the “POLY” (89.5%). This is an additional point that distinguishes our
kernel was best suited for resulted vectors. For the other method from the others, which is that, regardless of the
methods, the results were validated with all SVM kernel types complexity of the used data set, BoWC achieves very good
(POLY, Linear, and RBF) and the highest value among them results, competing with strong baselines.
was selected. Moreover, 10-fold cross-validation was applied
in each prediction instance to decrease the chance of TABLE II. CLASSIFICATION RESULTS, MEASURED BY MICRO-
overfitting the data and creating biased results. AVERAGE F1.

Correspondingly for the clustering task, the previous Methods BBC RE OH 20NG WebKB
experiment was repeated with 100 then 200 concepts with an BoWCglove (100) 97.01 92.86 62.997 69.00 76.25
increasing similarity threshold from 0.2 to 0.8 for each data
BoWCglove (200) 97.61 91.23 65.35 74.42 81.07
set. k-means algorithm was used as the clustering algorithm.
For the VLAC 3 method, the attached API was used to re- BoWCw2v (100) 96.86 92.4 65.99 68.58 78.55
implement experiments on data sets due to differences in the BoWCw2v (200) 96.65 93.15 63.00 74.31 83.13
data source and evaluation criteria. Thirty concepts (9000
BoWCFastText (100) 95.51 91.2 66.23 66.19 77.74
features) were accepted, taking into account that a larger
number of concepts requires large computational costs [9]. BoWCFastText (200) 96.26 92.02 68.94 68.48 81.92
VLAC was implemented with pre-trained GloVe embedding Self BoWCw2v (100) 96.84 93.49 64.11 66.02 85.25
as it performed best among all other embedding types [9]. Self BoWCw2v (200) 97.29 94.26 68.28 73.18 86.08
Considering that BOC is a special case of the proposed Self BoWCglove (100) 95.8 92.29 64.78 84.95 43.44
BoWC method, it was implemented through the code of the Self BoWCglove (200) 96.26 92.72 68.62 85.43 52.33
BoWC method. As shown in [8] that Bag-of-Concepts, in Self BoWCFastText
comparison with TF-IDF, require at least 100 concepts to (200)
96.99 93.65 68.36 70.07 86.24
achieve competitive results. So, to maximize Bag-of-Concepts Self BoWCFastText
97.89 94.75 69.95 77.07 87.77
accuracy, the number of concepts was set to 200. The (200)
similarity threshold value used with BOC was not mentioned TF-IDF 97.89 94.7 74.31 91.49 89.5
by the author, so it was empirically determined and set to 0.4 BoW 96.98 94.15 69.89 83.85 85.56
for all experiments. For the Averaged word embeddings Averaged GloVe
method, pre-trained GloVe and self word2vec word 97.75 93.32 67.39 78.7 80.22
(300)
embeddings were used. self-averaged w2v
91.599 87.81 56.78 62.37 83.27
(300)
All methods were realized and implemented using Python
BOCCF_IDF (200) 95.48 90.9 48.8 69.44 78.26
language with Google Collab environment. Asus notebook
with Intel Core i7 processor 1.99 GHz, was used. VLAC (9000) 98.2 94.06 69.37 86.4 85.54
a.
Underlined values are the highest results for the proposed method, whereas bold values are the best
results compared to all other methods for a single data set.

3
We used the implementation provided by authors on
https://github.com/MaartenGr/VLAC
In the OHSUMED series of experiments, it is interesting With so many differences between data sets it is hard to
that BOC performs the worst among all other baselines, pin point the exact reason for the differences in performance.
achieving a micro-average F1 value of 48.8% versus 69.95% One obvious reason could be the number of out-of-vocabulary
for our method. This is a very important point, especially since words per document, the proportion of which varies among
our method follows the same approach used in BOC, and this the data sets used (see Table 1).
confirms the efficiency of our weighting function and its
As mentioned above, the BoWC method is dependent on
ability to capture semantics at the terms and concepts level at
the same time. it is also worth noting that, our method the word embeddings type, the number of concepts and the
similarity threshold used to detect the document concepts.
outperformed Bag-of-Concepts on all data sets by a significant
percentage (between 2-20% depending on the dataset) with Following we discuss in detail the effect of each of these
parameters on the performance of the method. We illustrate
50% savings in the number of features.
the dependence relationship using only one data set (Reuters),
For 20Newsgroups, which is a relatively large data set since it shows a general character on the whole.
compared to the others, all concept-based vectorization
methods performed poorly. Since 20Newsgroups documents The similarity threshold and the type of embeddings. The
role of the similarity threshold is to verify the occurrence of a
are categorized into categories and subcategories, i.e., there is
concept within the document so that when the similarity ratio
a similarity between categories, which requires a large number
between a document word and the center of the cluster is
of concepts to clearly separate the categories. For this reason,
greater than the threshold, M clusters words closest to the
the TF-IDF method has achieved the best performance (91%),
since it is based on terms, not concepts. However, Although center will be matched with that word. Accordingly, the higher
the threshold, the fewer matching operations (Fig. 2). In other
the VLAC method works at the conceptual level, its vectors
(9,000 features) contain detailed information about the words, the higher the threshold value, the less likely a concept
is to appear in a document, which results in irrational weights
concept words, and this explains why it has achieved an
acceptable result (85.6%). of concepts and thus affects the accuracy of
classification/clustering.
Clustering is a more complex task that requires vectors to
be highly expressive so that they can clearly separate classes. In general, it is observed that for threshold values higher
than 0.5 the accuracy decreases significantly over all data sets.
However, depending on the word embeddings that were used,
our BoWC method for only 100 concepts was able to However, by looking at each type of embeddings separately,
we see that for the word2vec and FastText embeddings, the
outperform all other baselines on most data sets, except for
threshold values (from 0.2 to 0.7) all give close results
20Newsgroups where 200 concepts were required (Table 3).
The VLAC clustering experiment with the 20Newsgroups according to the micro average F1-measure.
data set was ruled out due to its high computational costs, as Moreover, the effect of the threshold value also depends
the clustering algorithm failed due to a full load of 12.72 GB on the type of embedding, and this is normal, given that the
of RAM. representational ability of the embedding vectors varies from
one model to another according to the nature and size of the
TABLE III. CLUSTERING RESULTS, MEASURED BY V-MEASURE data on which the embedding model was trained.
Methods BBC RE OH 20NG WebKB Fig. 4 and Fig. 5 confirm this result, as it is noted that the
BoWCglove (100) 0.728 0.449 0.104 0.373 0.109 overall performance of the BoWC is significantly affected by
the type of word embedding used, and that the self-trained
BoWCglove (200) 0.733 0.45 0.118 0.385 0.149
word embeddings used for BoWC performed significantly
BoWCw2v (100) 0.666 0.445 0.16 0.345 0.128 better than pre-trained word embeddings.
BoWCw2v (200) 0.616 0.452 0.127 0.338 0.123
BoWCFastText (100) 0.515 0.379 0.143 0.275 0.166
BoWCFastText (200) 0.625 0.375 0.155 0.301 0.206
Self BoWCw2v
0.54 0.505 0.123 0.211 0.337
(100)
Self BoWCw2v (200) 0.536 0.493 0.125 0.232 0.333
Self BoWCglove
0.794 0.574 0.114 0.311 0.329
(100)
Self BoWCglove
0.810 0.574 0.117 0.396 0.340
(200)
Self BoWCFastText
0.768 0.591 0.129 0.316 0.318
(200)
Self BoWCFastText
0.771 0.545 0.140 0.391 0.328
(200)
Fig. 2. Dependence of Reuters classification accuracy (in terms of micro-
TF-IDF 0.663 0.513 0.122 0.362 0.313
average F1 score) on the similarity threshold values
BoW 0.209 0.248 0.027 0.021 0.021
Averaged GloVe
0.774 0.481 0.109 0.381 0.219
(300)
self-averaged w2v
0.631 0.557 0.115 0.325 0.324
(300)
BOCCF_IDF (200) 0.743 0.527 0.1 0.394 0.088
VLAC (9000) 0.808 0.456 0.118 ꟷ 0.286
the use of contextual word embeddings models, such as,
Embeddings from Language Models (ELMo) [39] and
Bidirectional Encoder Representations from Transformers
(BERT) [40] that create, for a single word, different word
embeddings depending on its contexts. The use of such
models makes the concept clusters purer since only words
with similar meanings and context will be grouped. On the
other hand, the document words will be represented with
better accuracy and it becomes possible to apply the condition
Fig. 3. Dependence of the five data sets classification accuracy (in terms of
that the word belongs to only one concept (hard attribution).
micro-average F1 score) on the type of embedding model
VI. CONCLUSION
Increasing the efficiency of various applications and
information systems for engineering education is associated
with the development and use of text analysis methods to
solve such tasks as the classification and clustering of text,
creation of question-answering systems or recommending
content, etc., to enhance learning and meet the needs of
students and teachers.
Fig. 4. Dependence of the five data sets clustering accuracy (in terms of V1 In connection with the importance of these techniques, this
score) on the type of embedding model paper has introduced a novel method for the generation of
textual features, namely Bag of Weighted Concepts (BOWC).
The number of concepts. To analysis the dependence BoWC vectorizes a document according to the concepts’
relationship between the number of concepts (for different information it contains, by clustering word vectors to create
threshold values) and classification accuracy (in terms of concepts, then uses the frequencies of these concept clusters,
Micro-average score F1), a series of Reuters classification enriching them with similarity information of clusters’ words
experiments were carried out using BoWC with self FastText. to represent the document vectors. The method is
The number of concepts has been gradually increased from 10 characterized by a new concept weighting function that
to 300 with the similarity threshold values increased from 0.2 combines term level semantics with concept level semantics,
to 0.6. It is noticeable that BoWC begins to provide allowing the production of more valuable features and low-
comparable performance for only 50 concepts. Precisely, it dimensional vectors which mean better accuracy and lower
outperformed BOC by 1% for just 20 concepts, and it also computational cost when applied with text mining algorithms
outperformed averaged GloVe for 90 concepts. We can also such as classification and clustering.
notice that for a number of concepts greater than 200, the
accuracy begins to decrease slightly, due to the fact that the In two experiments the performance of BoWC was
concepts become overlapping and therefore non- measured and benchmarked based on Reuters,
discriminatory. 20Newsgroups, BBC, OHSUMED, and WebKB data sets
Another important point related to word embedding, is that using SVM and KNN classifiers. It was tested against several
in the proposed method implementation algorithm, a word baselines including Bag-of-words, TF-IDF, Bag-of-Concepts,
from a document can belong to more than one concept at the averaged word embeddings, and VLAC. On average, BoWC
same time (soft attribution). This is because the embedding was shown to outperform most baselines in terms of the
models used in this research generate a single vector of the minimum number of features and maximum classification and
word regardless of its context, and therefore, it cannot be clustering accuracy. While this work has focused on text
definitively judged that a word belongs to a concept and not classification and clustering, many other tasks, such as
to another. Thus to check that a concept occurs in a particular document retrieval, topic extraction, and educational content
document, every word in the document has to be matched with recommendation can be solved by BoWC.
all concepts. Future research is planned with the aim of reducing
computational cost, by developing a mechanism to reduce the
number of matches between document words and concepts.
One planned action is to carry out the 'feature selection' phase
by applying a term weighting process in order to represent a
document by only its most characteristic words. Another
direction could focus on testing contextual word embedding
models such as ELMo and BERT, which rely on word
contexts when creating embeddings. This could allow the
application of hard matching so that each document word can
belong to only one concept and thus the formation of clusters
with better representation ability.
ACKNOWLEDGMENT
Fig. 5. Dependence of accuracy (in terms of micro-average F1 score) on the The study was performed by the grant from the Russian
number of concepts. Science Foundation № 22-21-00316,
https://rscf.ru/project/22-21-00316/, in the Southern Federal
However, it may be possible to control the number of
University.
comparisons between concepts and document words through
View publication stats

REFERENCES [24] A. Aizawa, "An information-theoretic perspective of tf–idf


measures," Information Processing Management, vol. 39, no. 1, pp.
[1] C. Romero, S. J. W. I. R. D. M. Ventura, and K. Discovery, 45-65, 2003.
"Educational data mining and learning analytics: An updated survey," [25] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors
vol. 10, no. 3, p. e1355, 2020. for word representation," in Proceedings of the 2014 conference on
[2] R. S. Baker and P. S. Inventado, "Educational data mining and empirical methods in natural language processing (EMNLP), 2014,
learning analytics," in Learning analytics: Springer, 2014, pp. 61-75. pp. 1532-1543.
[3] L. Araujo, F. López-Ostenero, J. Martínez-Romo, and L. Plaza, [26] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word
"Deep-Learning Approach to Educational Text Mining and vectors with subword information," Transactions of the Association
Application to the Analysis of Topics’ Difficulty," IEEE Access, vol. for Computational Linguistics, vol. 5, pp. 135-146, 2017.
8, pp. 218002-218014, 2020. [27] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word
[4] N. Zanini and V. Dhawan, "Text Mining: An introduction to theory embeddings to document distances," in International conference on
and some applications," Research Matters, vol. 19, pp. 38-45, 2015. machine learning, 2015, pp. 957-966.
[5] A. Ortigosa, J. M. Martín, and R. M. Carro, "Sentiment analysis in [28] H. K. Kim, H.-j. Kim, and Cho, "Bag-of-concepts: Comprehending
Facebook and its application to e-learning," Computers in human document representation through clustering words in distributed
behavior, vol. 31, pp. 527-541, 2014. representation," Neurocomputing, vol. 266, pp. 336-352, 2017.
[6] K. Z. Aung, N. N. J. I. A. t. I. C. o. C. Myo, and I. Science, "Sentiment [29] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo, "Evaluating
analysis of students' comment using lexicon based approach," pp. 149- bag-of-visual-words representations in scene classification," in
154, 2017. Proceedings of the international workshop on Workshop on
[7] J. Z. Sukkarieh, S. G. Pulman, and N. Raikes, "Auto-marking: using multimedia information retrieval, 2007, pp. 197-206.
computational linguistics to score short, free text responses. Paper [30] Y. Li, D. McLean, Z. Bandar, J. O'Shea, K. Crockett, and D.
presented at the 29th annual conference," in of the International Engineering, "Sentence similarity based on semantic nets and corpus
Association for Educational Assessment (IAEA, 2003: Citeseer. statistics," IEEE Transactions on Knowledge, vol. 18, pp. 1138-1150,
[8] S. P. Leeman-Munk, E. N. Wiebe, and J. C. Lester, "Assessing 2006.
elementary students' science competency with text analytics," in [31] C. J. L. a. o. i. Van Rijsbergen, "Information retrieval 2nd edition
Proceedings of the fourth international conference on learning butterworths," 1979.
analytics and knowledge, 2014, pp. 143-147. [32] A. Cardoso-Cachopo and A. L. Oliveira, "Semi-supervised single-
[9] K. Xylogiannopoulos, P. Karampelas, and R. Alhajj, "Text mining for label text categorization using centroid-based classifiers," in
plagiarism detection: multivariate pattern detection for recognition of Proceedings of the 2007 ACM symposium on Applied computing,
text similarities," in 2018 IEEE/ACM International Conference on 2007, pp. 844-851.
Advances in Social Networks Analysis and Mining (ASONAM), 2018, [33] J. Rennie and K. J. A. i. w. p. Lang, "The 20 Newsgroups data set,"
pp. 938-945: IEEE. 2008.
[10] B. Agarwal, H. Ramampiaro, H. Langseth, and M. Ruocco, "A deep [34] D. Greene and P. Cunningham, "Practical solutions to the problem of
network model for paraphrase detection in short text messages," diagonal dominance in kernel document clustering," in Proceedings
Information Processing Management, vol. 54, no. 6, pp. 922-937, of the 23rd international conference on Machine learning, 2006, pp.
2018. 377-384.
[11] B. Bengforth, R. Bilbro, T. J. M. l. Ojeda, and b. n. l. p. a. S. P. Peter, [35] M. Craven et al., "Learning to construct knowledge bases from the
"Applied analysis of text data in Python," 2019. World Wide Web," Artificial intelligence, vol. 118, no. 1-2, pp. 69-
[12] A. I. Spivak, S. V. Lapshin, and I. S. Lebedev, "CLASSIFICATION 113, 2000.
OF SHORT MESSAGES USING ELMO-BASED [36] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin,
VECTORIZATION," Izvestia Tula State University. Technical "Advances in pre-training distributed word representations," arXiv
science, no. 10, 2019. preprint arXiv:.09405, 2017.
[13] Z. Liu, Y. Lin, and M. Sun, Representation Learning for Natural [37] T. Sabbah et al., "Modified frequency-based term weighting schemes
Language Processing. Springer Nature, 2020. for text classification," Applied Soft Computing, vol. 58, pp. 193-206,
[14] S. Gillani Andleeb, "From text mining to knowledge mining: An 2017.
integrated framework of concept extraction and categorization for [38] M. Lan, C. L. Tan, J. Su, and Y. Lu, "Supervised and traditional term
domain ontology," Budapesti Corvinus Egyetem, 2015. weighting methods for automatic text categorization," IEEE
[15] P. Mahalakshmi and N. S. Fatima, "An art of review on Conceptual transactions on pattern analysis machine intelligence, vol. 31, no. 4,
based Information Retrieval," Webology Journal, vol. 18, pp. 51-61, pp. 721-735, 2008.
2021. [39] M. E. Peters et al., "Deep contextualized word representations," arXiv
[16] C. Musat, J. Velcin, M.-A. Rizoiu, and S. Trausan-Matu, "Concept- preprint arXiv:.05365, 2018.
based topic model improvement," in Emerging intelligent [40] H. Jégou, M. Douze, C. Schmid, and P. Pérez, "Aggregating local
technologies in industry: Springer, 2011, pp. 133-142. descriptors into a compact image representation," in 2010 IEEE
[17] L. Huang, D. Milne, E. Frank, and I. H. Witten, "Learning a concept‐ computer society conference on computer vision and pattern
based document similarity measure," Journal of the American Society recognition, 2010, pp. 3304-3311: IEEE.
for Information Science Technology, vol. 63, no. 8, pp. 1593-1608,
2012.
Mansour Ali Mahmoud
[18] F. N. AL-Aswadi, H. Y. Chan, and K. H. J. a. p. a. Gan, "Extracting Federal State-Owned Autonomous Educational Establishment of
Semantic Concepts and Relations from Scientific Publications by Higher Education “Southern Federal University”. The Department of
Using Deep Learning," 2020.
Computer Aided Design, Postgraduate student. E-mail:
[19] M. Grootendorst and J. Vanschoren, "Beyond Bag-of-Concepts: mansur@sfedu.com. Address: 44, Nekrasovskiy lane, Taganrog,
Vectors of Locally Aggregated Concepts," in Joint European 347928, Russia. Phone: +7(988)015-8697.
Conference on Machine Learning and Knowledge Discovery in
Mohammad Juman Hussain
Databases, 2019, pp. 681-696: Springer. Federal State-Owned Autonomous Educational Establishment of
[20] K. Li, H. Zha, Y. Su, and X. J. I. I. C. o. D. M. Yan, "Concept Mining Higher Education “Southern Federal University”. The Department of
via Embedding," pp. 267-276, 2018.
Computer Aided Design, Postgraduate student. E-mail:
[21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, zmohammad@sfedu.ru. Address: 44, Nekrasovskiy lane, Taganrog,
"Distributed representations of words and phrases and their 347928, Russia. Phone: +7(918)543-3526.
compositionality," in Advances in neural information processing
Kravchenko Yury Alekseevich
systems, 2013, pp. 3111-3119. Federal State-Owned Autonomous Educational Establishment of
[22] A. M. Mansour, J. H. Mohammad, and Y. A. Kravchenko, "TEXT Higher Education “Southern Federal University”. The Department of
VECTORIZATION USING DATA MINING METHODS," Izvestia
Computer Aided Design, associate professor. E-mail:
SFedU. Technical science., no. 2, 2021. yakravchenko@sfedu.ru.
[23] Y. Zhang, R. Jin, and Z.-H. Zhou, "Understanding bag-of-words
Address: 44, Nekrasovskiy lane, Taganrog, 347928, Russia.
model: a statistical framework," International Journal of Machine
Phone: +7 928 908 0151.
Learning Cybernetics, vol. 1, no. 1-4, pp. 43-52, 2010.

You might also like