0% found this document useful (0 votes)
3 views

19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning

Uploaded by

thegreatref
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning

Uploaded by

thegreatref
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340118190

Article Classification using Natural Language Processing and Machine Learning

Conference Paper · November 2019


DOI: 10.1109/ACOMP.2019.00019

CITATIONS READS
11 977

3 authors, including:

Dien Tran Thanh Nguyen Thai-Nghe


Can Tho University Can Tho University
27 PUBLICATIONS 58 CITATIONS 93 PUBLICATIONS 1,422 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Geodata for Agriculture and Water in Vietnam/ Angiang, Soctrang provinces View project

All content following this page was uploaded by Nguyen Thai-Nghe on 15 October 2022.

The user has requested enhancement of the downloaded file.


2019 International Conference on Advanced Computing and Applications (ACOMP)

Article classification using natural language


processing and machine learning
Tran Thanh Dien Bui Huu Loc Nguyen Thai-Nghe
Can Tho University Can Tho University Can Tho University
Can Tho city, Vietnam Can Tho city, Vietnam Can Tho city, Vietnam
[email protected] [email protected] [email protected]
network… that solve text classification problem ([4], [5], [8],
Abstract- Text classification is an important task which may
help human reducing time and effort. This work is aimed to
[13], [14], [15], [16], [24], [25], [26])
propose an approach for text classification, especially for This study proposes an approach of automated
articles. The proposed method can automatically extract classification of articles submitted via an online submission
information and categorize articles on suitable topics. The input system. When the file of the manuscript (e.g. *.doc(x), *.pdf)
data were pre-processed, extracted, vectorized and classified is submitted, the system will extract the author’s information,
using machine learning techniques including Support Vector title, and abstract, especially determine the article’s topic (e.g.
Machines, Naïve Bayes, and k-Nearest Neighbors. The
information technology, environment, fisheries, etc.).
experiments were carried out on two data sets of articles showed
that with the accuracy of over 91%, using natural language II. REVIEW OF LITERATURE AND RELATED
processing and support vector machines technique proved its
STUDIES
feasibility for developing the automatic classification system of
articles. A. Text classification
Keywords- Article classification, information extraction,
In the context of rapid development of digital information,
natural language processing, support vector machines. text mining techniques play an important role in information
and knowledge management, attracting the attention of
I. INTRODUCTION researchers [1]. Automated text classification (or
categorization) is the division of an input textual dataset into
With the explosive development of information and the
two or more categories, in which each text can belong to one
simultaneous development of automatic computing, data
or more categories. Text classification is carried out for the
classification, especially text classification, has been found to
purpose of assigning a predefined label or class to a text. For
be particularly important [1]. Text classification is a
instance, a new article posted on an online newspaper can be
supervised learning technique, (e.g. automatic metadata
assigned one of the categories such as technology, sports, or
extraction), widely used in reality [2]. In machine learning and
entertainment; each article published in a journal can be
natural language processing (NLP), text classification is a
automatically classified into the topics of information
classical text processing problem, aimed at assigning a new
technology, environment, or fisheries, etc.
text into a group of given texts based on its similarities to the
group [3]. According to [23], text classification is the From a set of documents D = {d1,…, dn} called a training
assignment of labels on a new text based on the text’s set, in which document di is labeled under cj belonging to set
similarities to the labeled texts in the training set. Automated of categories C = {c1,..., cm}, classification model is
text classification makes storing and searching information determined for classifying any document dk into an
faster. In addition, with a large number of texts, sorting each appropriate category of set C. In other words, the problem is
of them takes a lot of time and efforts, not to mention the to find function f:
possibility of inaccurate categorization due to people’s
subjectivity. There are several applications of text ݂ǣ ‫ ܥݔܦ‬՜ ‫݈݊ܽ݁݋݋ܤ‬
classification such as spam email filtering, news classification ‫݁ݑݎݐ‬ǡ݂ܾ݈݅݀݁‫ܿݏݏ݈ܽܿ݋ݐݏ݃݊݋‬
݂ሺ݀ǡ ܿሻ ൌ ൜
by topics in online newspapers, knowledge management and ݂݈ܽ‫݁ݏ‬ǡ݂݅݀݀‫ܿݏݏ݈ܽܿ݋ݐݏ݃݊݋݈ܾ݁ݐ݋݊ݏ݁݋‬
support for search tools on the Internet, etc. [1].
B. Text classification algorithms
Another important text classification application is article There are many text classification algorithms. In this
classification. An issue in this area is concerned by both the study, the algorithms used are kNN, Naïve Bayes, SVM
authors and the editorial board of magazines/journals is how algorithms, of which classification performances are
to classify a submitted manuscript into a relevant topic of the evaluated effective.
magazine/journal. For example, the submission system can
automatically classify texts, extract relevant information when 1) K-nearest neighbor technique
the manuscripts are uploaded. For the magazine/journal kNN algorithm is used to classify an object based on the
having large numbers of topics such as Association for closest distance between the object and its nearest neighbors
Computing Machinery (ACM) with more than 2,000 topics, in the training set [20]. Accordingly, when a new document
the author may take a lot of time to determine the topic(s) of needs classifying, the algorithm will calculate the distances of
the article. Therefore, automatic classification of an article’s all documents in the training set to this new document to
topic is necessary. develop the kNN set. To classify a class-unknown document
x, firstly, the kNN classifier will calculate the distances from
The issue of text classification is concerned by many text x to all documents in the training set, thereby, determine
scientists with different approaches. In which, machine set N(x, D, k) consisting of k examples having the closest
learning is commonly used, with many algorithms such as k- distance to x.
nearest neighbors (kNN), Naïve Bayes, support vector
machines (SVM), decision tree, and artificial neural The kNN algorithm is described as follows:

978-1-7281-4723-9/19/$31.00 ©2019 IEEE 78


DOI 10.1109/ACOMP.2019.00019
1. Determining the value of parameter k (number of corresponding separate classes, called positive (+) and
nearest neighbors). negative (-) classes. Then, the SVM classifier is a hyperplane
that separates the positive class examples from the negative
2. Calculating the distances between the object with all class examples with the largest difference. This difference,
examples in the training set (often using Manhattan or also known as the margin distance, is determined by the
Euclidean distance). nearest hyperplane distance between the (+) example and the
3. Arranging the distances in ascending order and (-) example (Fig. 1). The larger this distance is, the more
determining k elements having the smallest distance (k nearest clearly divided the examples of the two classes are; this means
neighbors). that a good classification result is achieved. The objective of
SVM algorithm is to find the maximum margin distance to
4. Statistics on k neighbors’ frequency of classes: the most obtain good classification results.
common class is the class of the object.
2) Naïve Bayes technique
Naïve Bayes algorithm [17] is a common machine
learning algorithm that is evaluated as one of the most
effective methods of text classification by [16]. The typical
idea of this approach is using conditional probability between
words or phrases and topics to predict the topic probability of
a classified text.
Naïve Bayes algorithm based on Bayes theorem that is
presented as follows:
ܲሺܺȁܻሻܲሺܻሻ
ܲሺܻȁܺሻ ൌ
ܲሺܺሻ
Using Bayes theorem, we can find the probability of Y
happening, given that X has occurred. Here, X is the evidence
and Y is the hypothesis. Fig. 1: Maximum hyperplane margin in two-dimensional space [11]

The general idea of Naive Bayes: Hyperplane equation contains vector x in the object space
as follows: ‫ݓ‬Ǥ ‫ ݔ‬൅ ܾ ൌ Ͳ, where, w is the weight vector, b is
1. Represent a document X as a set of (w, a frequency of the bias. The direction and distance from the origin of the
w) pairs. coordinate to the hyperplane change when w and b are
2. For each label y, build a probabilistic model P(X| Y = y) changed.
of documents in class y. SVM classifier is defined as follows: f(x) = sign(w.x + b)
3. To classify, select label y which is most likely to ݂ሺ‫ݔ‬ሻ ൌ  ൅ͳǡ ‫ݓ‬Ǥ ‫ ݔ‬൅ ܾ ൒ Ͳ
generate X: Where, ൜
݂ሺ‫ݔ‬ሻ ൌ െͳǡ‫ݓ‬Ǥ ‫ ݔ‬൅ ܾ ൏ Ͳ
‫ݕ‬ො ൌ ܽ݃‫ܲ ݔܽ݉ݎ‬ሺܺȁ‫ݕ‬ሻ ‫ܲ כ‬ሺ‫ݕ‬ሻ Given yi with +1 or -1 value. If yi = +1, x belongs to class

(+), whereas yi = -1, x belongs to class (-). The two hyperplanes
We apply to articles classification: separate the examples into two parts described by the
- Data set vectorized D = (d1, d2, ... dn) equations: ‫ݓ‬Ǥ ‫ ݔ‬൅ ܾ ൌ ͳ and ‫ݓ‬Ǥ ‫ ݔ‬൅ ܾ ൌ  െͳ

- Set of classes C = (C1, C2, ..., Cm) By geometry, it is possible to calculate the distance

between these two hyperplanes: ԡ௪ԡ
Then,
ܲሺܺȁ‫ܥ‬௜ ሻܲሺ‫ܥ‬௜ ሻ For the maximum margin distance, it is necessary to find
ܲሺ‫ܥ‬௜ ȁܺሻ ൌ the smallest value of ȁȁ‫ݓ‬ȁȁ. At the same time, preventing data
ܲሺܺሻ
points from falling inside the boundary, the following

constraints are needed:
ܲሺܺȁ‫ܥ‬௜ ሻ ൌ ෑ ܲሺ‫ݔ‬௞ ȁ‫ܥ‬௜ ሻ
௞ୀଵ
‫ݓ‬Ǥ ‫ݔ‬௜  ൅ ܾ ൒ ͳǡ™‹–Š‡šƒ’އሺ൅ሻ

Where P(xk | Ci) is probability of the kth attribute that has ‫ݓ‬Ǥ ‫ݔ‬௜  ൅ ܾ ൑  െͳǡ™‹–Š‡šƒ’އሺെሻ
the value xk given that X belongs to class i. That can be rewritten as: ‫ݕ‬௜ ሺ‫ݓ‬Ǥ ‫ݔ‬௜  ൅ ܾሻ ൒ ͳǡ ‫א ݄݅ݐ݅ݓ‬
ሺͳǡ ݊ሻ
3) Support vector machine technique
SVM algorithm was first introduced by [11]. SVM is quite Then, finding hyperplane h is equivalent to solving the
effective for solving problems with multi-dimensional data problem of finding Min||w|| with w and b satisfying the
like text representation vectors. It is also considered the most following conditions: ‫ א ݅׊‬ሺͳǡ ݊ሻǣ ‫ݕ‬௜ ሺ‫ݓ‬Ǥ ‫ݔ‬௜ ൅ ܾሻ ൒ ͳ.
accurate classifier for text classification problem [19] owing From SVM for binary classification above mentioned, we
to fast and effective classification speed. can extend to apply SVM to the multi-class scenario.
For this method, a given training set is represented in
4) Algorithm evaluation
vector space, where each text is considered as a point in this
Several common indicators are used to evaluate a machine
space. This method is to find a hyperplane h that makes the
learning algorithm. Supposing to evaluate a classifier
best decision to divide the points on this space into two
temporarily called positive and negative, there are:

79
- TP (True positive) is the number of positive elements Another study related semantic relation extraction and
classified positive; FN (False negative) is the number of classification in scientific paper abstracts was proposed by
positive elements classified negative; TN (True negative) is [27]. The authors presented the setup and results of semantic
the number of negative elements classified negative; and FP relation extraction and classification in scientific papers. The
(False positive) is the number of negative elements classified task is divided into three subtasks: classification on clean data,
positive. classification on noisy data, and a combined extrac-tion and
classification scenario. They also presented the dataset used
- Precision is defined as the ratio of the number of TP for the challenge: a subset of abstracts of published papers in
elements to those classified positive (TP + FP): the ACL Anthology Reference Corpus, annotated for domain
ܶܲ specificentities and semantic relations.
ܲ‫ ݊݋݅ݏ݅ܿ݁ݎ‬ൌ
ܶܲ ൅ ‫ܲܨ‬
In this work, we propose an approach of automated
- Recall is defined as the ratio of the number of TP classification of articles submitted via an online submission
elements to the elements that are actually positive (TP + FN): system. Accordingly, when an article (e.g. *.doc(x), *.pdf) is
ܶܲ submitted, the system will extract the author’s information,
ܴ݈݈݁ܿܽ ൌ title, and abstract, especially classify the article’s topic. In this
ܶܲ ൅ ‫ܰܨ‬
approach, first, natural languge processing is used to pre-
High precision means that the accuracy of the found process the data, then machine learning techniques are used
elements is high. High recall means that the rate of missing for topic classification.
actually positive elements is low.
III. PROPOSED METHOD
F1 (or F-score) is the balance between precision and
recall. If the precision and recall are high and balanced, the A. Classification system model
F1 is high, whereas precision and recall are low and The proposed overall system of extracting information and
unbalanced, F1 is low. Thus, the higher F1 is, the better the classifying article is modeled in Fig. 2:
classifier is. When both recall and precision are equal to 1
(best possible), F1 = 1. When both recall and precision are low, Author
for example, 0.1, F1 = 0.1.
Extract
ʹ ‫݈݈ܴܽܿ݁ כ ݊݋݅ݏ݅ܿ݁ݎܲ כ‬ Title
‫ܨ‬ଵ ൌ
ܲ‫ ݊݋݅ݏ݅ܿ݁ݎ‬൅ ܴ݈݈݁ܿܽ
C. Related studies on text classification Submitted Abstract
Proposed
Article
Many studies on text classification have been applied to system
solve problems in practice. For example, [21] applied SVM
and decision tree to solve text classification problem, and Classify
compared their effectiveness with that of classical decision Topic(s)
tree algorithm. In addition, singular value decomposition
technique was applied into SVM algorithm to shorten the Fig. 2: Model of extraction and classification of articles
dimension of characteristic space and reduce noise, making
the classification process more effective. In the pre-processing With this model, when a new article (.doc(x), .pdf) is
phase, maximum matching segmentation (MMSEG) submitted to the system, it is automatically extracted
algorithm [9] was used to segment words. After segmentation, information such as author’s name, title, abstract, and
the text was modeled into a vector form, using the vectorized classified into an appropriate topic based on the trained
TF*IDF; then classified using two algorithms of SVM and machine learning model.
decision trees in Weka software. With the dataset of 7,842 Since each article is a pre-formatted template, it is easy to
texts in 10 various topics, 500 texts of each topic were extract the author’s information, title, and abstract. Therefore,
randomly chosen to train, the remaining texts were to verify this study only focuses on solving the problem of topic
independence. The results showed that the classification by classification of the articles submitted to the system.
SVM is really better than that by decision tree. Moreover,
using singular value decomposition to analyze and shorten the B. Steps of article classification
dimension of characteristic space helped improve The automated classification of articles is divided into two
effectiveness of SVM classifier. phases. At training phase, relying on the collected dataset with
In the study on classification of Vietnamese documents machine learning algorithms, classification model generated
with Naïve Bayes algorithm, [22] developed a word is described in Fig. 3:
segmentation module from N-gram model, then modeled the Training Pre- Classification
segmented text by TF*IDF vector. After being modeled into Vectorization
dataset processing model
vector, the dataset was classified using Naïve Bayes method.
A classification software was developed, integrated with the Fig. 3: Training phase
functions of managing, editing and deleting articles to conduct
experiments on the dataset of 281 scientific articles in the topic At testing phase, based on the classification model
of information technology. The classification result was quite generated at the training step, articles are classified in the
satisfactory; however, the study was still limited in the dataset, testing dataset. This stage is described in Fig. 4.
and there was no comparison of Naïve Bayes classifier with Testing Pre- Testing based on
other classification methods. dataset processing Vectorization Classification model

80
Fig. 4: Testing phase bag of words model, and graph-based model. In this study, the
vector space model [10] was applied. The vector space model
1) Data pre-processing can represent an unformatted text document as simple and
a) File format conversion and word standardization: formulaic notation. Because of the its advantage, lots of
Because the dataset used is .doc(x) files, it is necessary to researches on vector space model are being actively carried
convert them to plain text (.txt) for easy use in most algorithms out [28]. According to this model, each article is represented
and for libraries serving automated classification. Converting as a vector; each component of the vector is a separate term
format of an input article is based on Apache POI. and is assigned a value that is the weight of this separate term.
Accordingly, Apache POI is used to perform read operations The problem of text representation by vector space model
on .doc(x) file, then write the readable content to .txt file. After is as follows: the input is a set of j documents in application
converting file format, word standardization is proceeded to domain D, with D = {d1, d2,… dj} and m terms (or words) in
convert all text characters into lowercase and delete spaces. each document T = {t1, t2,… tm}. In the output, the weight for
For example, the sentence “Xӱ Lý Ngôn Ngӳ Tӵ nhiên là each term is in turn determined, then the weighting matrix wij,
1 nhánh cӫa Trí tuӋ nhân tҥo” is standardized to “xӱ lý ngôn the weight of term ti in the document dj € D, is developed.
ngӳ tӵ nhiên là 1 nhánh cӫa trí tuӋ nhân tҥo”. For determining the weight of word ti in text dj, TF (term
b) Word segmentation: In Vietnamese, space does not frequency) and IDF (inverse document frequency) are
segment words but separate syllables. Therefore, the commonly used.
segmentation phase is quite important in NLP. TF is used to estimate the occurrence frequency of a term
Currently, many tools have been successfully developed in a certain document. Due to each document has its own
to segment Vietnamese words with relatively high accuracy. length and number of words, appearance frequency of terms
In this study, the VnTokenizer segmentation tool by [18] was is different. To measure the weight of a term, the number of
used. The tool was developed based on the integrated methods occurrences of the term is divided by the length of the
of maximum matching, weighted finite-state transducer and document (the number of words).
regular expression parsing, using the dataset of Vietnamese
݊‫ݐ݉ݎ݁ݐ݂݋ݏ݁ܿ݊݁ݎݎݑܿܿ݋݂݋ݎܾ݁݉ݑ‬௜ ݅݊݀‫݀ݐ݊݁݉ݑܿ݋‬௝
syllabary and Vietnamese vocabulary dictionary. This ܶ‫ܨ‬ሺ‫ݐ‬௜ ǡ ݀௝ ሻൌ
‫݀ݐ݊݁݉ݑܿ݋݀݊݅ݏ݉ݎ݁ݐ݂݋ݎܾ݁݉ݑ݈݊ܽݐ݋ݐ‬௝
automated Vietnamese segmentation tool segments
Vietnamese text into vocabulary units (words, names, The values wij are calculated based on the occurrence
numbers, dates and other regular expressions) with > 95% frequency (number of times) of term in document. Given fij is
accuracy. The process of word segmentation using the number of occurrences of term ti in document dj, then wij
VnTokenizer is described in Fig. 5. is calculated by one of three following basic formulas:

Dataset of Word Vocabulary ‫ݓ ۓ‬௜௝ ൌ ݂௜௝ 


articles segmentation units ۖ‫ݓ‬௜௝ ൌ ͳ ൅ Ž‘‰ሺ݂௜௝ ሻ
‫۔‬
ۖ ‫ݓ‬௜௝ ൌ  ට݂௜௝ 
‫ە‬
Where, f͓j the number of occurrences of term ti in document
Dictionary dj. If ti occurs in document dj, ‫ݓ‬௜௝ ൌ ͳ, if not, ‫ݓ‬௜௝ ൌ Ͳ.

Fig. 5: Process of word segmentation using VnTokenizer [18] IDF is used to estimate the importance of a term in a
document. When calculating the TF, all the terms have the
For example, the sentence “xӱ lý ngôn ngӳ tӵ nhiên là 1 same importance. However, it is found that not all terms in a
nhánh cӫa trí tuӋ nhân tҥo” is segmented into “xӱ_lý dataset are important. Accordingly, the terms that do not have
ngôn_ngӳ tӵ_nhiên là 1 nhánh cӫa trí_tuӋ nhân_tҥo”. a high degree of importance are connecting terms (such as
c) Removing stop words: Stop words are the words that “nhѭng”, “bên cҥnh ÿó”, “vì thӃ”, etc.), determiners (“kìa”,
“ÿó”, “ҩy”, etc.); and prepositions (“trên”, “trong”, “ngoài”,
commonly appear in all texts of all categories in the dataset,
etc.). It is necessary to reduce the importance of those terms
or the words that appear only in one and several texts. It means
by calculating IDF by the formula as follows:
that stop words do not make sense or do not contain
information worth using. In text classification, the appearance ܶ‫ܦ݊݅ݐ݊݁݉ݑܿ݋݈݀ܽݐ݋‬
‫ܨܦܫ‬ሺ‫ݐ‬௜ ǡ ‫ܦ‬ሻ ൌ ݈‫݃݋‬
of stop words (e.g. thì, là, mà, và, hoһc, bӣi, etc.) not only do ܰ‫ݐ݃݊݅݀ݑ݈ܿ݊݅ݐ݊݁݉ݑܿ݋݂݀݋ݎܾ݁݉ݑ‬௜
not help assessing the classification but also make noises and The terms commonly occurring in many documents
reduce the accuracy of the classification process. having low weight. The weights are determined as formula:
For example, when being removed stop words, the
sentence “xӱ_lý ngôn_ngӳ tӵ_nhiên là 1 nhánh cӫa trí_tuӋ ݉
‫ݓ‬௜௝ ൌ ݈‫ ݃݋‬൬ ൰ ൌ ݈‫݃݋‬ሺ݉ሻ െ ݈‫݃݋‬ሺ݂݀௜ ሻǡ݂݅ܶ‫ܨ‬௜௝ ൒ ͳ
nhân_tҥo” becomes “xӱ_lý ngôn_ngӳ tӵ_nhiên nhánh trí_tuӋ ቐ ݂݀௜
nhân_tҥo”. ‫ݓ‬௜௝ ൌ Ͳǡ݂݅ܶ‫ܨ‬௜௝ ൌ Ͳ
In this study, after converting the article from .doc(x) to TF*IDF is the integration between TF and IDF. This
.txt format and segmenting words, the stop word dictionary common method is to calculate the TF*IDF value of a term
was used to remove stop words. through its importance in a document belonging to a ducument
set.
2) Text vectorization
There are a number of text representation models, i.e. High IF*IDF terms commonly occurr in a certain
vector space model based on the frequency weighting method, document but less occur in other document s. Through this

81
method, it is possible to filter out common words and retain TABLE 2: COMPARISON OF CLASSIFICATION RESULTS FOR SCIENTIFIC
ARTICLES AMONG ALGORITHMS OF SVM, NAÏVE BAYES, AND KNN
high value words.
ܶ‫ܨܦܫ כ ܨ‬ሺ‫ݐ‬௜ ǡ ݀௝ ǡ ‫ܦ‬ሻ  ൌ ܶ‫ܨ‬ሺ‫ݐ‬௜ ǡ ݀௝ ሻ‫ܨܦܫݔ‬ሺ‫ݐ‬௜ ǡ ‫ܦ‬ሻ SVM Naïve Bayes kNN

Precision

Precision

Precision
Weight calculation formula:

Topics

Recall

Recall

Recall
F1

F1

F1
ܰ
‫ݓ‬௜௝ ൌ ൫ͳ ൅ Ž‘‰൫݂௜௝ ൯൯ Ž‘‰ ൬ ൰ ǡ݂݂݅௜௝ ൒ ͳ
ቐ ݂݀௜
Technology 0.857 0.857 0.857 0.857 0.857 0.857 0.400 0.571 0.471
‫ݓ‬௜௝ ൌ Ͳǡ݂݂݅௜௝ ൌ Ͳ
Environment 1.000 0.333 0.500 0.400 0.333 0.364 0.667 0.333 0.444
IV. EXPERIMENTAL RESULTS Natural Sciences 0.750 1.000 0.857 0.667 0.667 0.667 0.600 1.000 0.750
Animal
This study used two experimental datasets including 680 husbandry
1.000 1.000 1.000 1.000 0.500 0.667 1.000 0.500 0.667
scientific articles (10 topics) and 10,000 articles of
Biotechnology 1.000 0.500 0.667 1.000 0.500 0.667 1.000 0.500 0.667
newsletters.
Agriculture 0.786 1.000 0.880 0.846 1.000 0.917 0.733 1.000 0.846
A. Experimental dataset 1: the scientific articles Fisheries 0.947 1.000 0.973 0.857 1.000 0.923 0.947 1.000 0.973
The experimental dataset consists of 680 scientific articles, Education 1.000 1.000 1.000 1.000 0.500 0.667 1.000 1.000 1.000
being published in Can Tho University Journal of Science Social sciences
1.000 1.000 1.000 0.600 0.750 0.667 0.600 0.750 0.667
from 2016 to 2018, belonging to 10 topics as described in and Humanities
Table 1. Economics 1.000 1.000 1.000 0.900 0.818 0.857 1.000 0.545 0.706
Average accuracy rate 91.2% 80.9% 76.5%
The articles were pre-processed by converting .doc(x) to
.txt files, then proceeded word segmentation [18]. The
removal of stop words was done using a stop word dictionary- However, not all topics are well classified. Even with the
based approach, there remained 4,095 terms (words). In the best classification algorithm (SVM), there are still some topics
process of modeling each document was a weighted word of poor classification, such as “Environment” and
vector. Therefore, the modeled dataset was the TF*IDF matrix “Biotechnology” with F1 < 67% compared to other topics (F1
of terms with a size of 680 * 4,095 elements. > 85%). The reason is that the recall of these two topics is
quite low (<= 0.5), in other words, the rate of the articles
TABLE 1: DISTRIBUTION OF SCIENTIFIC ARTICLES IN 10 TOPICS accurately predicted to belong to these two topics is not high
Training Testing Total samples in reality. This is explained by the overlap of the topics, in
No Topics other words, an article can belong to both similar topics (e.g.
samples samples (articles)
1 Technology 45 5 50 “Agriculture” and “Fisheries”) which was called multi-class
2 Environment 54 6 60 classification. The remaining topics have relatively high
3 Natural Sciences 54 6 60
precision and recall, indicating their distinctions with others.
4 Animal husbandry 36 4 40 B. Experimental dataset 2: the articles of newsletters
5 Biotechnology 27 3 30 The experimental dataset consists of 10,000 articles from
6 Agriculture 90 10 100 Vnexpress newsletters that was collected in 2018, belonging
7 Fisheries 135 15 150 to 10 topics as described in Table 3.
8 Education 36 4 40
Social sciences and TABLE 3: DISTRIBUTION OF ARTICLES OF VNEXPRESS NEWSLETTERS IN 10
9 72 8 80
Humanities TOPICS
10 Economics 63 7 70
Training Testing Total samples
Total 612 68 680 No Topics
samples samples (articles)
1 Business 800 200 1,000
After pre-processing and vectorization, the dataset of
2 Culture 800 200 1,000
articles is trained with automated text classification algorithms
3 Health 800 200 1,000
of SVM, Naïve Bayes, and kNN. The dataset is automatically 4 IT 800 200 1,000
separated, in which 612 articles (90%) were used as a training 5 Law 800 200 1,000
set; 68 remaining articles (10%) as a testing set. The three 6 Life 800 200 1,000
machine learning algorithms of SVM, Naïve Bayes, and kNN 7 Politics 800 200 1,000
were used to do text classification. The evaluation is based on 8 Science 800 200 1,000
precision, recall and F1 score shown in Table 2. 9 Sports 800 200 1,000
It can be seen from Table 2 that the classification 10 World 800 200 1,000
performances of the algorithms show their relative Total 8,000 2,000 10,000
effectiveness. In which, SVM algorithm gives the best In experimental dataset 2, the pre-processing steps carried
classification results with the accuracy > 91%, it is feasible for out as work of experimental dataset 1. The dataset was also
developing automated classification system of scientific automatically separated, in which 8,000 articles (90%) were
articles, contributing to the faster and more accurate process used as a training set and 2,000 remaining articles (10%) as a
of classifying articles. This result is also consistent with many testing set.
studies by [6], [7], [12] and [23] that SVM method gives text
classification results equivalent or significantly better than The stop words were removed using dictionary-based
those of other classifiers. approach. For the training set, there remained 3,485 terms.
The process of modeling each document was a weighted word

82
vector. Therefore, the modeled dataset was the TF*IDF matrix VI. ACKNOWLEDGEMENT
of words with a size of 8,000 * 3,485 elements. For the testing This study is funded in part by the Can Tho University
set, there remained 3,961 words. In the process of modeling, Improvement Project VN14-P6, supported by a Japanese
each document was a weighted word vector. Therefore, the ODA loan.
modeled dataset was the TF*IDF matrix of terms with a size
of 8,000 * 3,961 elements. VII. REFERENCES
We used k-fold cross validation method with k=3 (one of [1] K. Thaoroijam, "A Study on Document Classification using M
cross-validation commonly used). The three machine learning Learning Techniques", IJCSI International Journal of Computer
algorithms as above were used to classify. The evaluation is Issues, vol. 11, no. 2, pp. 1694-0784, 2014.
based on precision, recall and F1 score shown in Table 4. [2] Y. Li, L. Zhang, Y. Xu, Y. Yao, R. Y. K. Lau and Y. Wu, "Enhancin
Classification by Modeling Uncertain Boundary in Three-Way Decis
TABLE 4: COMPARISON OF CLASSIFICATION RESULTS FOR ARTICLES OF IEEE Transactions on Knowledge and Data Engineering, vol. 29, n
VNEXPRESS NEWSLETTERS AMONG ALGORITHMS OF SVM, NAÏVE BAYES, 1438-1451, 2017.
AND KNN
[3] F. Sebastiani, "Machine learning in automated text categorization
SVM Naïve Bayes kNN Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[4] C. Aggarwal and C. Zhai, "A Survey of Text Classification Algo
Precision

Precision

Precision
Mining Text Data, pp. 163-222, 2012.
Topics

Recall

Recall

Recall
F1

F1

F1
[5] M. Bijaksana, Y. Li and A. Algarni, "A Pattern Based Two-Sta
Classifier", Machine Learning and Data Mining in Pattern Recogni
Business 0.775 0.915 0.839 0.807 0.880 0.842 0.513 0.766 0.615 169-182, 2013.
Culture 0.922 0.950 0.936 0.919 0.910 0.915 0.655 0.660 0.658 [6] B. Boser, I. Guyon and V. Vapnik, "A training algorithm for optima
Health 0.968 0.910 0.938 0.911 0.870 0.890 0.784 0.621 0.693 classifiers", Proceedings of the fifth annual workshop on Compu
learning theory - COLT '92, 1992.
IT 0.906 0.920 0.913 0.887 0.940 0.913 0.461 0.600 0.521
Law 0.989 0.865 0.923 0.854 0.880 0.867 0.825 0.561 0.668 [7] C. Burges, Data Mining and Knowledge Discovery, vol. 2, no. 2,
167, 1998.
Life 0.973 0.910 0.941 0.917 0.825 0.868 0.687 0.811 0.744
Politics 0.776 0.850 0.811 0.679 0.815 0.741 0.582 0.453 0.509 [8] J. Chen, H. Huang, S. Tian and Y. Qu, "Feature selection
classification with Naïve Bayes", Expert Systems with Applications,
Science 0.918 0.845 0.880 0.896 0.815 0.853 0.623 0.435 0.512 no. 3, pp. 5432-5435, 2009.
Sports 0.985 0.975 0.980 0.989 0.935 0.961 0.938 0.753 0.835
[9] C.-H, Tsai, "MMSEG: A Word Identification System for Mandarin
World 0.917 0.935 0.926 0.946 0.870 0.906 0.530 0.670 0.592 Text Based on Two Variants of the Maximum Matching Algorithm
Average accuracy rate 90.1% 87.6% 46.7% Available: http://technology.chtsai.org/mmseg/.
[10] S. P. Christian, "Machine Learning: Cosine Similarity for Vecto
SVM algorithm got significantly relative values of Models (Part III)", 2013. A
precision, recall, thus F1. The classification results with http://blog.christianperone.com/2013/09/machine-learning-cosine-
accuracy was greater than 90%. The second effective similarity-for-vector-space-models-part-iii/.
algorithm was Naïve Bayes with accuracy was 87.6%. [11] C. Cortes and V. Vapnik, "Support-vector networks", Machine Learn
However, kNN algorithm classifier implementation gave low 20, no. 3, pp. 273-297, 1995.
accuracy with 46.7%. Normally, kNN performs better with a [12] S. Dumais, J. Platt, D. Heckerman and M. Sahami, "Inductive
lower number of features than a large number of features. algorithms and representations for text categorization", Proceeding
When the number of features increases, it leads to increase in seventh international conference on Information and knowledge man
dimension, thus leads to the problem of overfitting. - CIKM '98, 1998.
[13] M. Haddoud, A. Mokhtari, T. Lecroq and S. Abdeddaïm, "Co
The above results shows that the combination of natural supervised term-weighting metrics for SVM text classification with e
language processing and machine learning algorithm (e.g., term representation", Knowledge and Information Systems, vol. 49, n
SVM) is effective for developing automated classification 909-931, 2016.
system of articles in general. [14] H. J. George and P. Langley, "Estimating continuous distribu
Bayesian classifiers", Proceedings of the Eleventh conference on Unc
V. CONCLUSIONS in artificial intelligence (UAI'95), 1995.
Based on natural language processing and machine [15] B. Liu, Y. Dai, X. Li, W. S. Lee and P. S. Yu, "Building text classifie
learning algorithms, this study proposed a solution of positive and unlabeled examples," Third IEEE International Confer
automated classification of articles to help authors/editors Data Mining, Melbourne, FL, USA, pp. 179-186, 2003.
save time and efforts when processing articles on the system. [16] H. KDE Group, "A Comparison of Event Models for Naive Bay
Data pre-processing steps are significant to make Classification", AAAI-98 Workshop on Learning for Text Categoriza
classification dataset in a standardized format for running the 41-48, 1998.
three algorithms of SVM, Naïve Bayes, and kNN. The results [17] T. Mitchell, "Machine Learning", 1997. A
showed that SVM algorithm gives better classification https://dl.acm.org/citation.cfm?id=541177
performance than the remaining classifiers. [18] N. T. M. Huyen, V. X. Luong and L. H. Phuong, "VnTokenizer
Available: http://vntokenizer.sourceforge.net/.
With the proposed model, it is feasible to extract
information and automatically classify articles being [19] S. Chakrabarti, Mining the Web. Burlington: Morgan Kaufmann, 20
submitted to a classification system. The experiments on [20] P. Tan, M. Steinbach and V. Kumar, "Introduction to Data Mining
larger datasets should be conducted in the future. Available: https://www-users.cs.umn.edu/~kumar001/dmbook/index
[21] T. C. De and P. N. Khang, "Text classification with support vector
and decision tree", Can Tho University Journal of Science, vol. 21a
63, 2012 (in Vietnamese).

83
[22] T. T. T. Thao and V. T. Chinh, "Development of Vietnam
classification system", Report of scientific research, 2012 (in Vietna
[23] Y. Yang and O. Pedersen, "A Comparative Study on Feature Selection
Categorization", Proceedings of the Fourteenth International Confer
Machine Learning (ICML '97), pp. 412-420, 1997.
[24] L. Zhang, Y. Li, C. Sun and W. Nadee, "Rough Set Based Approach
Classification", 2013 IEEE/WIC/ACM International Joint Confere
Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 20
[25] M. Haddoud, A. Mokhtari, T. Lecroq and S. Abdeddaïm, "Co
supervised term-weighting metrics for SVM text classification with e
term representation", Knowledge and Information Systems, vol. 49, n
909-931, 2016.
[26] N. Thai-Nghe and Q. D. Truong, “An Approach for Building A
Automatic Online Consultancy System”. Proceedings of Inter
Conference on Advanced Computing and Applications (ACOMP 2
51-58, ISBN-13: 978-1-4673-8234-2, IEEE Xplore. 2015.
[27] K. Gábor, D. Buscaldi, A. Schumann, B. QasemiZadeh, H. Zargayo
T. Charnois, "SemEval-2018 Task 7: Semantic Relation Extract
Classification in Scientific Papers", Proceedings of The 12th Inter
Workshop on Semantic Evaluation, 2018.
[28] J. Chang and I. Kim, "Analysis and Evaluation of Current Graph-Ba
Mining Researches", Advanced Science and Technology Letters, 201

84

View publication stats

You might also like