19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
net/publication/340118190
CITATIONS READS
11 977
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Geodata for Agriculture and Water in Vietnam/ Angiang, Soctrang provinces View project
All content following this page was uploaded by Nguyen Thai-Nghe on 15 October 2022.
The general idea of Naive Bayes: Hyperplane equation contains vector x in the object space
as follows: ݓǤ ݔ ܾ ൌ Ͳ, where, w is the weight vector, b is
1. Represent a document X as a set of (w, a frequency of the bias. The direction and distance from the origin of the
w) pairs. coordinate to the hyperplane change when w and b are
2. For each label y, build a probabilistic model P(X| Y = y) changed.
of documents in class y. SVM classifier is defined as follows: f(x) = sign(w.x + b)
3. To classify, select label y which is most likely to ݂ሺݔሻ ൌ ͳǡ ݓǤ ݔ ܾ Ͳ
generate X: Where, ൜
݂ሺݔሻ ൌ െͳǡݓǤ ݔ ܾ ൏ Ͳ
ݕො ൌ ܽ݃ܲ ݔܽ݉ݎሺܺȁݕሻ ܲ כሺݕሻ Given yi with +1 or -1 value. If yi = +1, x belongs to class
௬
(+), whereas yi = -1, x belongs to class (-). The two hyperplanes
We apply to articles classification: separate the examples into two parts described by the
- Data set vectorized D = (d1, d2, ... dn) equations: ݓǤ ݔ ܾ ൌ ͳ and ݓǤ ݔ ܾ ൌ െͳ
- Set of classes C = (C1, C2, ..., Cm) By geometry, it is possible to calculate the distance
ଶ
between these two hyperplanes: ԡ௪ԡ
Then,
ܲሺܺȁܥ ሻܲሺܥ ሻ For the maximum margin distance, it is necessary to find
ܲሺܥ ȁܺሻ ൌ the smallest value of ȁȁݓȁȁ. At the same time, preventing data
ܲሺܺሻ
points from falling inside the boundary, the following
constraints are needed:
ܲሺܺȁܥ ሻ ൌ ෑ ܲሺݔ ȁܥ ሻ
ୀଵ
ݓǤ ݔ ܾ ͳǡሺሻ
൜
Where P(xk | Ci) is probability of the kth attribute that has ݓǤ ݔ ܾ െͳǡሺെሻ
the value xk given that X belongs to class i. That can be rewritten as: ݕ ሺݓǤ ݔ ܾሻ ͳǡ א ݄݅ݐ݅ݓ
ሺͳǡ ݊ሻ
3) Support vector machine technique
SVM algorithm was first introduced by [11]. SVM is quite Then, finding hyperplane h is equivalent to solving the
effective for solving problems with multi-dimensional data problem of finding Min||w|| with w and b satisfying the
like text representation vectors. It is also considered the most following conditions: א ݅ሺͳǡ ݊ሻǣ ݕ ሺݓǤ ݔ ܾሻ ͳ.
accurate classifier for text classification problem [19] owing From SVM for binary classification above mentioned, we
to fast and effective classification speed. can extend to apply SVM to the multi-class scenario.
For this method, a given training set is represented in
4) Algorithm evaluation
vector space, where each text is considered as a point in this
Several common indicators are used to evaluate a machine
space. This method is to find a hyperplane h that makes the
learning algorithm. Supposing to evaluate a classifier
best decision to divide the points on this space into two
temporarily called positive and negative, there are:
79
- TP (True positive) is the number of positive elements Another study related semantic relation extraction and
classified positive; FN (False negative) is the number of classification in scientific paper abstracts was proposed by
positive elements classified negative; TN (True negative) is [27]. The authors presented the setup and results of semantic
the number of negative elements classified negative; and FP relation extraction and classification in scientific papers. The
(False positive) is the number of negative elements classified task is divided into three subtasks: classification on clean data,
positive. classification on noisy data, and a combined extrac-tion and
classification scenario. They also presented the dataset used
- Precision is defined as the ratio of the number of TP for the challenge: a subset of abstracts of published papers in
elements to those classified positive (TP + FP): the ACL Anthology Reference Corpus, annotated for domain
ܶܲ specificentities and semantic relations.
ܲ ݊݅ݏ݅ܿ݁ݎൌ
ܶܲ ܲܨ
In this work, we propose an approach of automated
- Recall is defined as the ratio of the number of TP classification of articles submitted via an online submission
elements to the elements that are actually positive (TP + FN): system. Accordingly, when an article (e.g. *.doc(x), *.pdf) is
ܶܲ submitted, the system will extract the author’s information,
ܴ݈݈݁ܿܽ ൌ title, and abstract, especially classify the article’s topic. In this
ܶܲ ܰܨ
approach, first, natural languge processing is used to pre-
High precision means that the accuracy of the found process the data, then machine learning techniques are used
elements is high. High recall means that the rate of missing for topic classification.
actually positive elements is low.
III. PROPOSED METHOD
F1 (or F-score) is the balance between precision and
recall. If the precision and recall are high and balanced, the A. Classification system model
F1 is high, whereas precision and recall are low and The proposed overall system of extracting information and
unbalanced, F1 is low. Thus, the higher F1 is, the better the classifying article is modeled in Fig. 2:
classifier is. When both recall and precision are equal to 1
(best possible), F1 = 1. When both recall and precision are low, Author
for example, 0.1, F1 = 0.1.
Extract
ʹ ݈݈ܴܽܿ݁ כ ݊݅ݏ݅ܿ݁ݎܲ כ Title
ܨଵ ൌ
ܲ ݊݅ݏ݅ܿ݁ݎ ܴ݈݈݁ܿܽ
C. Related studies on text classification Submitted Abstract
Proposed
Article
Many studies on text classification have been applied to system
solve problems in practice. For example, [21] applied SVM
and decision tree to solve text classification problem, and Classify
compared their effectiveness with that of classical decision Topic(s)
tree algorithm. In addition, singular value decomposition
technique was applied into SVM algorithm to shorten the Fig. 2: Model of extraction and classification of articles
dimension of characteristic space and reduce noise, making
the classification process more effective. In the pre-processing With this model, when a new article (.doc(x), .pdf) is
phase, maximum matching segmentation (MMSEG) submitted to the system, it is automatically extracted
algorithm [9] was used to segment words. After segmentation, information such as author’s name, title, abstract, and
the text was modeled into a vector form, using the vectorized classified into an appropriate topic based on the trained
TF*IDF; then classified using two algorithms of SVM and machine learning model.
decision trees in Weka software. With the dataset of 7,842 Since each article is a pre-formatted template, it is easy to
texts in 10 various topics, 500 texts of each topic were extract the author’s information, title, and abstract. Therefore,
randomly chosen to train, the remaining texts were to verify this study only focuses on solving the problem of topic
independence. The results showed that the classification by classification of the articles submitted to the system.
SVM is really better than that by decision tree. Moreover,
using singular value decomposition to analyze and shorten the B. Steps of article classification
dimension of characteristic space helped improve The automated classification of articles is divided into two
effectiveness of SVM classifier. phases. At training phase, relying on the collected dataset with
In the study on classification of Vietnamese documents machine learning algorithms, classification model generated
with Naïve Bayes algorithm, [22] developed a word is described in Fig. 3:
segmentation module from N-gram model, then modeled the Training Pre- Classification
segmented text by TF*IDF vector. After being modeled into Vectorization
dataset processing model
vector, the dataset was classified using Naïve Bayes method.
A classification software was developed, integrated with the Fig. 3: Training phase
functions of managing, editing and deleting articles to conduct
experiments on the dataset of 281 scientific articles in the topic At testing phase, based on the classification model
of information technology. The classification result was quite generated at the training step, articles are classified in the
satisfactory; however, the study was still limited in the dataset, testing dataset. This stage is described in Fig. 4.
and there was no comparison of Naïve Bayes classifier with Testing Pre- Testing based on
other classification methods. dataset processing Vectorization Classification model
80
Fig. 4: Testing phase bag of words model, and graph-based model. In this study, the
vector space model [10] was applied. The vector space model
1) Data pre-processing can represent an unformatted text document as simple and
a) File format conversion and word standardization: formulaic notation. Because of the its advantage, lots of
Because the dataset used is .doc(x) files, it is necessary to researches on vector space model are being actively carried
convert them to plain text (.txt) for easy use in most algorithms out [28]. According to this model, each article is represented
and for libraries serving automated classification. Converting as a vector; each component of the vector is a separate term
format of an input article is based on Apache POI. and is assigned a value that is the weight of this separate term.
Accordingly, Apache POI is used to perform read operations The problem of text representation by vector space model
on .doc(x) file, then write the readable content to .txt file. After is as follows: the input is a set of j documents in application
converting file format, word standardization is proceeded to domain D, with D = {d1, d2,… dj} and m terms (or words) in
convert all text characters into lowercase and delete spaces. each document T = {t1, t2,… tm}. In the output, the weight for
For example, the sentence “Xӱ Lý Ngôn Ngӳ Tӵ nhiên là each term is in turn determined, then the weighting matrix wij,
1 nhánh cӫa Trí tuӋ nhân tҥo” is standardized to “xӱ lý ngôn the weight of term ti in the document dj € D, is developed.
ngӳ tӵ nhiên là 1 nhánh cӫa trí tuӋ nhân tҥo”. For determining the weight of word ti in text dj, TF (term
b) Word segmentation: In Vietnamese, space does not frequency) and IDF (inverse document frequency) are
segment words but separate syllables. Therefore, the commonly used.
segmentation phase is quite important in NLP. TF is used to estimate the occurrence frequency of a term
Currently, many tools have been successfully developed in a certain document. Due to each document has its own
to segment Vietnamese words with relatively high accuracy. length and number of words, appearance frequency of terms
In this study, the VnTokenizer segmentation tool by [18] was is different. To measure the weight of a term, the number of
used. The tool was developed based on the integrated methods occurrences of the term is divided by the length of the
of maximum matching, weighted finite-state transducer and document (the number of words).
regular expression parsing, using the dataset of Vietnamese
݊ݐ݉ݎ݁ݐ݂ݏ݁ܿ݊݁ݎݎݑ݂ܿܿݎܾ݁݉ݑ ݅݊݀݀ݐ݊݁݉ݑܿ
syllabary and Vietnamese vocabulary dictionary. This ܶܨሺݐ ǡ ݀ ሻൌ
݀ݐ݊݁݉ݑܿ݀݊݅ݏ݉ݎ݁ݐ݂ݎܾ݁݉ݑ݈݊ܽݐݐ
automated Vietnamese segmentation tool segments
Vietnamese text into vocabulary units (words, names, The values wij are calculated based on the occurrence
numbers, dates and other regular expressions) with > 95% frequency (number of times) of term in document. Given fij is
accuracy. The process of word segmentation using the number of occurrences of term ti in document dj, then wij
VnTokenizer is described in Fig. 5. is calculated by one of three following basic formulas:
Fig. 5: Process of word segmentation using VnTokenizer [18] IDF is used to estimate the importance of a term in a
document. When calculating the TF, all the terms have the
For example, the sentence “xӱ lý ngôn ngӳ tӵ nhiên là 1 same importance. However, it is found that not all terms in a
nhánh cӫa trí tuӋ nhân tҥo” is segmented into “xӱ_lý dataset are important. Accordingly, the terms that do not have
ngôn_ngӳ tӵ_nhiên là 1 nhánh cӫa trí_tuӋ nhân_tҥo”. a high degree of importance are connecting terms (such as
c) Removing stop words: Stop words are the words that “nhѭng”, “bên cҥnh ÿó”, “vì thӃ”, etc.), determiners (“kìa”,
“ÿó”, “ҩy”, etc.); and prepositions (“trên”, “trong”, “ngoài”,
commonly appear in all texts of all categories in the dataset,
etc.). It is necessary to reduce the importance of those terms
or the words that appear only in one and several texts. It means
by calculating IDF by the formula as follows:
that stop words do not make sense or do not contain
information worth using. In text classification, the appearance ܶܦ݊݅ݐ݊݁݉ݑ݈ܿ݀ܽݐ
ܨܦܫሺݐ ǡ ܦሻ ൌ ݈݃
of stop words (e.g. thì, là, mà, và, hoһc, bӣi, etc.) not only do ܰݐ݃݊݅݀ݑ݈ܿ݊݅ݐ݊݁݉ݑ݂ܿ݀ݎܾ݁݉ݑ
not help assessing the classification but also make noises and The terms commonly occurring in many documents
reduce the accuracy of the classification process. having low weight. The weights are determined as formula:
For example, when being removed stop words, the
sentence “xӱ_lý ngôn_ngӳ tӵ_nhiên là 1 nhánh cӫa trí_tuӋ ݉
ݓ ൌ ݈ ݃൬ ൰ ൌ ݈݃ሺ݉ሻ െ ݈݃ሺ݂݀ ሻǡ݂݅ܶܨ ͳ
nhân_tҥo” becomes “xӱ_lý ngôn_ngӳ tӵ_nhiên nhánh trí_tuӋ ቐ ݂݀
nhân_tҥo”. ݓ ൌ Ͳǡ݂݅ܶܨ ൌ Ͳ
In this study, after converting the article from .doc(x) to TF*IDF is the integration between TF and IDF. This
.txt format and segmenting words, the stop word dictionary common method is to calculate the TF*IDF value of a term
was used to remove stop words. through its importance in a document belonging to a ducument
set.
2) Text vectorization
There are a number of text representation models, i.e. High IF*IDF terms commonly occurr in a certain
vector space model based on the frequency weighting method, document but less occur in other document s. Through this
81
method, it is possible to filter out common words and retain TABLE 2: COMPARISON OF CLASSIFICATION RESULTS FOR SCIENTIFIC
ARTICLES AMONG ALGORITHMS OF SVM, NAÏVE BAYES, AND KNN
high value words.
ܶܨܦܫ כ ܨሺݐ ǡ ݀ ǡ ܦሻ ൌ ܶܨሺݐ ǡ ݀ ሻܨܦܫݔሺݐ ǡ ܦሻ SVM Naïve Bayes kNN
Precision
Precision
Precision
Weight calculation formula:
Topics
Recall
Recall
Recall
F1
F1
F1
ܰ
ݓ ൌ ൫ͳ ൫݂ ൯൯ ൬ ൰ ǡ݂݂݅ ͳ
ቐ ݂݀
Technology 0.857 0.857 0.857 0.857 0.857 0.857 0.400 0.571 0.471
ݓ ൌ Ͳǡ݂݂݅ ൌ Ͳ
Environment 1.000 0.333 0.500 0.400 0.333 0.364 0.667 0.333 0.444
IV. EXPERIMENTAL RESULTS Natural Sciences 0.750 1.000 0.857 0.667 0.667 0.667 0.600 1.000 0.750
Animal
This study used two experimental datasets including 680 husbandry
1.000 1.000 1.000 1.000 0.500 0.667 1.000 0.500 0.667
scientific articles (10 topics) and 10,000 articles of
Biotechnology 1.000 0.500 0.667 1.000 0.500 0.667 1.000 0.500 0.667
newsletters.
Agriculture 0.786 1.000 0.880 0.846 1.000 0.917 0.733 1.000 0.846
A. Experimental dataset 1: the scientific articles Fisheries 0.947 1.000 0.973 0.857 1.000 0.923 0.947 1.000 0.973
The experimental dataset consists of 680 scientific articles, Education 1.000 1.000 1.000 1.000 0.500 0.667 1.000 1.000 1.000
being published in Can Tho University Journal of Science Social sciences
1.000 1.000 1.000 0.600 0.750 0.667 0.600 0.750 0.667
from 2016 to 2018, belonging to 10 topics as described in and Humanities
Table 1. Economics 1.000 1.000 1.000 0.900 0.818 0.857 1.000 0.545 0.706
Average accuracy rate 91.2% 80.9% 76.5%
The articles were pre-processed by converting .doc(x) to
.txt files, then proceeded word segmentation [18]. The
removal of stop words was done using a stop word dictionary- However, not all topics are well classified. Even with the
based approach, there remained 4,095 terms (words). In the best classification algorithm (SVM), there are still some topics
process of modeling each document was a weighted word of poor classification, such as “Environment” and
vector. Therefore, the modeled dataset was the TF*IDF matrix “Biotechnology” with F1 < 67% compared to other topics (F1
of terms with a size of 680 * 4,095 elements. > 85%). The reason is that the recall of these two topics is
quite low (<= 0.5), in other words, the rate of the articles
TABLE 1: DISTRIBUTION OF SCIENTIFIC ARTICLES IN 10 TOPICS accurately predicted to belong to these two topics is not high
Training Testing Total samples in reality. This is explained by the overlap of the topics, in
No Topics other words, an article can belong to both similar topics (e.g.
samples samples (articles)
1 Technology 45 5 50 “Agriculture” and “Fisheries”) which was called multi-class
2 Environment 54 6 60 classification. The remaining topics have relatively high
3 Natural Sciences 54 6 60
precision and recall, indicating their distinctions with others.
4 Animal husbandry 36 4 40 B. Experimental dataset 2: the articles of newsletters
5 Biotechnology 27 3 30 The experimental dataset consists of 10,000 articles from
6 Agriculture 90 10 100 Vnexpress newsletters that was collected in 2018, belonging
7 Fisheries 135 15 150 to 10 topics as described in Table 3.
8 Education 36 4 40
Social sciences and TABLE 3: DISTRIBUTION OF ARTICLES OF VNEXPRESS NEWSLETTERS IN 10
9 72 8 80
Humanities TOPICS
10 Economics 63 7 70
Training Testing Total samples
Total 612 68 680 No Topics
samples samples (articles)
1 Business 800 200 1,000
After pre-processing and vectorization, the dataset of
2 Culture 800 200 1,000
articles is trained with automated text classification algorithms
3 Health 800 200 1,000
of SVM, Naïve Bayes, and kNN. The dataset is automatically 4 IT 800 200 1,000
separated, in which 612 articles (90%) were used as a training 5 Law 800 200 1,000
set; 68 remaining articles (10%) as a testing set. The three 6 Life 800 200 1,000
machine learning algorithms of SVM, Naïve Bayes, and kNN 7 Politics 800 200 1,000
were used to do text classification. The evaluation is based on 8 Science 800 200 1,000
precision, recall and F1 score shown in Table 2. 9 Sports 800 200 1,000
It can be seen from Table 2 that the classification 10 World 800 200 1,000
performances of the algorithms show their relative Total 8,000 2,000 10,000
effectiveness. In which, SVM algorithm gives the best In experimental dataset 2, the pre-processing steps carried
classification results with the accuracy > 91%, it is feasible for out as work of experimental dataset 1. The dataset was also
developing automated classification system of scientific automatically separated, in which 8,000 articles (90%) were
articles, contributing to the faster and more accurate process used as a training set and 2,000 remaining articles (10%) as a
of classifying articles. This result is also consistent with many testing set.
studies by [6], [7], [12] and [23] that SVM method gives text
classification results equivalent or significantly better than The stop words were removed using dictionary-based
those of other classifiers. approach. For the training set, there remained 3,485 terms.
The process of modeling each document was a weighted word
82
vector. Therefore, the modeled dataset was the TF*IDF matrix VI. ACKNOWLEDGEMENT
of words with a size of 8,000 * 3,485 elements. For the testing This study is funded in part by the Can Tho University
set, there remained 3,961 words. In the process of modeling, Improvement Project VN14-P6, supported by a Japanese
each document was a weighted word vector. Therefore, the ODA loan.
modeled dataset was the TF*IDF matrix of terms with a size
of 8,000 * 3,961 elements. VII. REFERENCES
We used k-fold cross validation method with k=3 (one of [1] K. Thaoroijam, "A Study on Document Classification using M
cross-validation commonly used). The three machine learning Learning Techniques", IJCSI International Journal of Computer
algorithms as above were used to classify. The evaluation is Issues, vol. 11, no. 2, pp. 1694-0784, 2014.
based on precision, recall and F1 score shown in Table 4. [2] Y. Li, L. Zhang, Y. Xu, Y. Yao, R. Y. K. Lau and Y. Wu, "Enhancin
Classification by Modeling Uncertain Boundary in Three-Way Decis
TABLE 4: COMPARISON OF CLASSIFICATION RESULTS FOR ARTICLES OF IEEE Transactions on Knowledge and Data Engineering, vol. 29, n
VNEXPRESS NEWSLETTERS AMONG ALGORITHMS OF SVM, NAÏVE BAYES, 1438-1451, 2017.
AND KNN
[3] F. Sebastiani, "Machine learning in automated text categorization
SVM Naïve Bayes kNN Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[4] C. Aggarwal and C. Zhai, "A Survey of Text Classification Algo
Precision
Precision
Precision
Mining Text Data, pp. 163-222, 2012.
Topics
Recall
Recall
Recall
F1
F1
F1
[5] M. Bijaksana, Y. Li and A. Algarni, "A Pattern Based Two-Sta
Classifier", Machine Learning and Data Mining in Pattern Recogni
Business 0.775 0.915 0.839 0.807 0.880 0.842 0.513 0.766 0.615 169-182, 2013.
Culture 0.922 0.950 0.936 0.919 0.910 0.915 0.655 0.660 0.658 [6] B. Boser, I. Guyon and V. Vapnik, "A training algorithm for optima
Health 0.968 0.910 0.938 0.911 0.870 0.890 0.784 0.621 0.693 classifiers", Proceedings of the fifth annual workshop on Compu
learning theory - COLT '92, 1992.
IT 0.906 0.920 0.913 0.887 0.940 0.913 0.461 0.600 0.521
Law 0.989 0.865 0.923 0.854 0.880 0.867 0.825 0.561 0.668 [7] C. Burges, Data Mining and Knowledge Discovery, vol. 2, no. 2,
167, 1998.
Life 0.973 0.910 0.941 0.917 0.825 0.868 0.687 0.811 0.744
Politics 0.776 0.850 0.811 0.679 0.815 0.741 0.582 0.453 0.509 [8] J. Chen, H. Huang, S. Tian and Y. Qu, "Feature selection
classification with Naïve Bayes", Expert Systems with Applications,
Science 0.918 0.845 0.880 0.896 0.815 0.853 0.623 0.435 0.512 no. 3, pp. 5432-5435, 2009.
Sports 0.985 0.975 0.980 0.989 0.935 0.961 0.938 0.753 0.835
[9] C.-H, Tsai, "MMSEG: A Word Identification System for Mandarin
World 0.917 0.935 0.926 0.946 0.870 0.906 0.530 0.670 0.592 Text Based on Two Variants of the Maximum Matching Algorithm
Average accuracy rate 90.1% 87.6% 46.7% Available: http://technology.chtsai.org/mmseg/.
[10] S. P. Christian, "Machine Learning: Cosine Similarity for Vecto
SVM algorithm got significantly relative values of Models (Part III)", 2013. A
precision, recall, thus F1. The classification results with http://blog.christianperone.com/2013/09/machine-learning-cosine-
accuracy was greater than 90%. The second effective similarity-for-vector-space-models-part-iii/.
algorithm was Naïve Bayes with accuracy was 87.6%. [11] C. Cortes and V. Vapnik, "Support-vector networks", Machine Learn
However, kNN algorithm classifier implementation gave low 20, no. 3, pp. 273-297, 1995.
accuracy with 46.7%. Normally, kNN performs better with a [12] S. Dumais, J. Platt, D. Heckerman and M. Sahami, "Inductive
lower number of features than a large number of features. algorithms and representations for text categorization", Proceeding
When the number of features increases, it leads to increase in seventh international conference on Information and knowledge man
dimension, thus leads to the problem of overfitting. - CIKM '98, 1998.
[13] M. Haddoud, A. Mokhtari, T. Lecroq and S. Abdeddaïm, "Co
The above results shows that the combination of natural supervised term-weighting metrics for SVM text classification with e
language processing and machine learning algorithm (e.g., term representation", Knowledge and Information Systems, vol. 49, n
SVM) is effective for developing automated classification 909-931, 2016.
system of articles in general. [14] H. J. George and P. Langley, "Estimating continuous distribu
Bayesian classifiers", Proceedings of the Eleventh conference on Unc
V. CONCLUSIONS in artificial intelligence (UAI'95), 1995.
Based on natural language processing and machine [15] B. Liu, Y. Dai, X. Li, W. S. Lee and P. S. Yu, "Building text classifie
learning algorithms, this study proposed a solution of positive and unlabeled examples," Third IEEE International Confer
automated classification of articles to help authors/editors Data Mining, Melbourne, FL, USA, pp. 179-186, 2003.
save time and efforts when processing articles on the system. [16] H. KDE Group, "A Comparison of Event Models for Naive Bay
Data pre-processing steps are significant to make Classification", AAAI-98 Workshop on Learning for Text Categoriza
classification dataset in a standardized format for running the 41-48, 1998.
three algorithms of SVM, Naïve Bayes, and kNN. The results [17] T. Mitchell, "Machine Learning", 1997. A
showed that SVM algorithm gives better classification https://dl.acm.org/citation.cfm?id=541177
performance than the remaining classifiers. [18] N. T. M. Huyen, V. X. Luong and L. H. Phuong, "VnTokenizer
Available: http://vntokenizer.sourceforge.net/.
With the proposed model, it is feasible to extract
information and automatically classify articles being [19] S. Chakrabarti, Mining the Web. Burlington: Morgan Kaufmann, 20
submitted to a classification system. The experiments on [20] P. Tan, M. Steinbach and V. Kumar, "Introduction to Data Mining
larger datasets should be conducted in the future. Available: https://www-users.cs.umn.edu/~kumar001/dmbook/index
[21] T. C. De and P. N. Khang, "Text classification with support vector
and decision tree", Can Tho University Journal of Science, vol. 21a
63, 2012 (in Vietnamese).
83
[22] T. T. T. Thao and V. T. Chinh, "Development of Vietnam
classification system", Report of scientific research, 2012 (in Vietna
[23] Y. Yang and O. Pedersen, "A Comparative Study on Feature Selection
Categorization", Proceedings of the Fourteenth International Confer
Machine Learning (ICML '97), pp. 412-420, 1997.
[24] L. Zhang, Y. Li, C. Sun and W. Nadee, "Rough Set Based Approach
Classification", 2013 IEEE/WIC/ACM International Joint Confere
Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 20
[25] M. Haddoud, A. Mokhtari, T. Lecroq and S. Abdeddaïm, "Co
supervised term-weighting metrics for SVM text classification with e
term representation", Knowledge and Information Systems, vol. 49, n
909-931, 2016.
[26] N. Thai-Nghe and Q. D. Truong, “An Approach for Building A
Automatic Online Consultancy System”. Proceedings of Inter
Conference on Advanced Computing and Applications (ACOMP 2
51-58, ISBN-13: 978-1-4673-8234-2, IEEE Xplore. 2015.
[27] K. Gábor, D. Buscaldi, A. Schumann, B. QasemiZadeh, H. Zargayo
T. Charnois, "SemEval-2018 Task 7: Semantic Relation Extract
Classification in Scientific Papers", Proceedings of The 12th Inter
Workshop on Semantic Evaluation, 2018.
[28] J. Chang and I. Kim, "Analysis and Evaluation of Current Graph-Ba
Mining Researches", Advanced Science and Technology Letters, 201
84