Text Classification Research Based On Bert Model and Bayesian Network
Text Classification Research Based On Bert Model and Bayesian Network
Abstract—The Bert model is a pre-training model based on results show that the method can really improve the
deep learning. It has refreshed the best performance of 11 NLP classification effect.
missions as soon as it appears, and it also has a wide range of
applications. The text data of people's livelihood governance is With the research of text classification, the development
huge, and there is a large amount of unstructured data, which trend can be summarized as the following three aspects: The
makes the traditional text analysis and mining technology first is the emergence of new classification algorithms, and
increase in the space-time complexity of computing; so it is very this situation depends on the emergence and development of
important to choose appropriate text classification technology. new algorithms and technologies in the fields of deep learning
Here we propose to use the Bert model in combination with the and machine learning. The second is the further improvement
Bayesian network to achieve one new classification of text. That of traditional methods in text classification, including the
is, the Bayesian network is used to perform the classification of improvement of principles and algorithms of natural language
two categories first, and we can get the approximate category processing. The third of improving the classification
range of each text, and then the Bert model is used to classify the performance is to synthesis of methods in various fields
text into specific categories. The combination of these two according to the actual application scenarios, such as
methods can greatly reduce the errors caused by the combining machine learning and deep learning. This paper is
classification defects of using only one of the methods. Thereby such a classification system that combines the Bert model with
achieving an improvement in the accuracy of text classification. the Bayesian networks. The Bayesian network's powerful
reasoning ability for uncertain events and the high accuracy of
Keywords—Bert model, Bayesian network, text classification
the semantic matching of the Bert model are utilized to
I. INTRODUCTION effectively judge the categories of the texts.
The people's livelihood service hotline is one of the II. BERT AND BERT CLASSIFIER
important channels for the general public to reflect the public
opinion to the government. The data contains a lot of A. Brief Analysis of Bert Model
information about local social hotspots. In order to fully Bert is a large-scale pre-training language model based on
exploit this information in the data. It is necessary to propose bidirectional transformer. The so-called pre-training language
a feasible and reasonable classification method for the text of representation model is to use this model to train the
the people's livelihood hotline, so that it can quickly focus on representation model of the language on the "large dataset that
the social hot issues that the people care about, and provide can be independent of the final task", and then to fine-tune the
scientific and reliable support for the government's social learned knowledge (representation) using the task-related
governance and people's livelihood management. dataset for "fine-tuning", to make it applicable to the target
task.
Text classification is an important research part in the field
of natural language processing. There are many theoretical The input content of the Bert can be a single text sentence
approaches in this part. There are: one of them is k-Nearest or a pair of sentences (for example, [question, answer]), the
Neighbor (kNN), by Cover et al., the algorithm is a machine input content needs to be represented in a linear sequence, and
learning algorithm with perfect theoretical foundation and for each word in the input content, the input representation is
widely used in the field of text classification [1]. The Naive performed by three parts of embedding summed up. The
Bayes (NB) algorithm, which originated from the Bayesian visual representation of embedding is shown in Figure 2
probability formula of statistics [2], Eyheramendy et al. used below [9].
it in text categorization and achieved good results [3]. Support
Vector Machine (SVM) [4], proposed by Vapnik et al., its
basic idea is to find the optimal classification hyperplane,
which is also widely used in the field of text mining [5,6].
The above algorithms mainly belong to traditional
machine learning. In recent years, with the development of
algorithms and techniques, text classification has begun to use
the method of deep learning. Yujun Zhou et al. [7] proposed a
hybrid model based on LSTM. The model integrating Fig. 1. Schematic diagram of the BERT model input.
character features into word features compensates for the lack
of semantic information caused by word segmentation errors, Among them:
thus improving classification performance. Liu Zejin et al. [8] • Token embeddings represents the word vector, the first
proposed a short text classification method based on random word is the CLS flag, which will be used in the
forest and convolutional neural networks, which can fully classification task, and the non-classification task can
extract the features of short texts by convolutional neural ignore the word vector.
networks of double word vectors, and then using random
forests as high-order features. The classifier’s experimental
5843
that, the Bayesian network's variable reasoning ability is very
powerful. So we can consider using the Bayesian network to
use the characteristic words as nodes for probabilistic
reasoning to distinguish "≁⭏ḕ䈒㊫" and then classify the
texts of other categories by other methods.
III. A CLASSIFIER BASED ON THE COMBINATION OF BAYESIAN
NETWORK AND BERT MODEL
A. Bayesian Network
Bayesian network is one of the most typical classification
models in text classification. It uses the theoretical basis of
probability to make its classification model have good Fig. 3. Bayesian network topology.
accuracy, and it is also one of the most widely used theories.
From the concept of Bayesian network, it can be seen that
From the point of view of the graph theory, the Bayesian
the first step is to determine the interdependence between
network model is actually a directed acyclic graph structure
variables, and the second step is to determine the conditional
composed of some nodes with certain dependence on causality
probability of these interdependence according to the obtained
and edges representing the relationship between them. The
Bayesian network structure. The steps to build the network
strength of the correlation between these nodes can be
topology diagram are as follows:
described by the conditional probability distribution data
represented by the probability parameters, and then through 1) Defining variables: Selecting appropriate variables, or
the mining and analysis of the dependence causal relationship selecting important factors from them.
between these variables, the dependency between the data in 2) Structure learning: Constructing a directed acyclic
the relevant fields is obtained. And we can also do some work
graph to reflect the dependence or independence between
such as reasoning and prediction of data in related fields
through the result. variables.
3) Parameter learning: After obtaining the network
ൌ ሺܫǡ ܧሻ represents a directed acyclic graphˈWhere I is structure of the Bayesian network, parameters of local
a set representing all the nodes in the graph, and E is a set distribution can be estimated, that is, parameter learning.
representing the directed connection segments, ൌ ሺ ሻǡ ݅ א After constructing the Bayesian network, the following is
ܫis a random variable that represents node i in its directed the use of Bayesian networks for reasoning. The Bayesian
acyclic graph. If the joint probability of node X can be network's reasoning is how to infer the state of another
expressed as the following formula: variable when given the state of other variables as evidence.
ሺሻ ൌ ςאூ ሺݔ ȁݔሺሻ ሻ (2) B. Bayesian Network Combined with Bert's Classifier
According to the actual situation of social governance text,
Then, X is a Bayesian network relative to a directed this paper puts forward a kind of two-level classification
acyclic graph G, where pa(i) represents the "cause" of node i, system: at first, we use Bayesian network to classify the
or that pa(i) is the parents of i. category tendency of the text (which is mainly classified into
"≁⭏ḕ䈒㊫" and "䶎≁⭏ḕ䈒㊫", because the Bert model
In addition, for any random variable, the joint probability can not distinguish the text of "≁⭏ḕ䈒㊫" very well), and
can be obtained by multiplying the respective local
then use Bert to classify the specific categories under "䶎≁⭏
conditional probability distributions:
ḕ 䈒 ㊫ ". This method can overcome the classification
shortcomings of the Bert model (the Bert model does not have
ሺଵ ǡ ǥ ǡ ݔ ሻ ൌ ሺݔ ȁݔଵ ǡ ǥ ǡ ݔିଵ ሻ ǥ ሺݔଶ ȁݔଵ ሻሺݔଵ ሻ (3)
a good effect on the recognition of the characteristic words of
"≁⭏ḕ䈒㊫"), and also avoid the low accuracy caused by
As shown in Figure 3 below, it is a simple Bayesian
network structure. ݔൌ ሼݔଵ ǡ ݔଶ ǡ ݔଷ ǥ ݔሽ is 7 nodes, the use of only Bayesian networks. So we think that the
representing 7 random variables. The set of directed edges classification results will be improved compared with one of
the algorithms alone. This section focuses on the classification
ܧൌ ሼ݁ଵସ ǡ ݁ଵହ ǡ ݁ଶସ ǡ ݁ଷସ ǡ ݁ଷହ ǡ ݁ସ ǡ ݁ସ ǡ ݁ହ ሽ represents the
process of social governance texts by Bert model combined
dependency between variables, which is the conditional
with Bayesian network’s classifier.
probability. Then the joint probability of the nodes
ݔଵ ǡ ݔଶ ǡ ݔଷ ǥ ݔis as shown in the following formula: The main idea of classifier based on Bert and Bayesian
networks is hierarchical classification, which mainly have two
ሺݔଵ ǡ ǥ ǡ ݔሻ ൌ stages: the first stage is Bayesian network classification, the
ሺݔଵ ሻሺݔଶ ሻሺݔଷ ሻሺݔସ ȁݔଵ ǡ ݔଶ ǡ ݔଷ ሻ ǥ ሺ ݔȁݔସ ǡ ݔହ ሻ (4) second is Bert model classification, and then the two
classification’s results are combined to determine the category
of every text. The model framework is shown in Figure 4
below:
5844
words obtained after sparse filtering; the featured words are
the unique words under each category but not in other
categories. If the quantity of these words is too large, we
should sort them by term-frequency value and select the
appropriate amount of words. The words obtained above are
used as nodes to construct a Bayesian network. The following
are classification accuracy values of Bayesian network using
different words.
TABLE III.
Number of The number of The number of Accuracy
nodes the sparse- featured words
(excluding filtered words
category name)
80 29 51 67.3%
90 37 53 76.9%
90 29 61 79.7%
100 29 71 87.2%
Fig. 4. Bayesian network combined with Bert's classifier.
100 37 63 89.5%
110 29 81 91.5%
In the Bayesian network stage, all the texts are classified 110 37 73 92.2%
into two camps to realize the classification of the approximate 110 51 59 93.1%
category range of each text; the Bert stage is used for further … … …
classification of the text under "䶎≁⭏ḕ䈒㊫". This method Just like the above table, the selection of two kinds of
theoretically makes the classification of each text more words (the words obtained by sparse-filtering and the featured
reasonable, avoiding the errors caused by the shortcomings of words of each category) is adjusted little by little. After dozens
a single classifier. of data testing, we find that when the number of nodes is 110,
and the number of two kinds of words is 51, 59, respectively.
The steps for constructing the Bayesian network classifier the accuracy value converges to 93.1%, and the accuracy
are as follows: value of each category is 94.6% and 91.6%, So we choose this
1) Firstly, the data should be processed by word set of nodes for the later Bayesian experiment.
segmentation. And after that, words are used as nodes of the Using the Bayesian network to distinguish the text of "≁
Bayesian network. However, the number of words under the ⭏ḕ䈒㊫" in the data, and the rest data is the text of "䶎≁⭏
social governance text set may be quite large, so the matrix ḕ䈒㊫", and then using Bert to do further classification for
of text-words will be large. Dimensionality reduction is the rest data. The Bert classification steps have already
necessary. Therefore, it is necessary to select a small number appeared in the second section of this paper, it will not be
of characteristic words first. repeated here. At last, calculating the accuracy rate. To verify
2) The category nodes is used as the parent nodes of the the feasibility of this method proposed in this paper, a
non-category nodes, and then carrying out the structure comparative experiment is performed below.
learning of the Bayesian network. C. Comparison of Experimental Results
3) Using the obtained matrix of the text-node to carry out
Three experiments were carried out separately: the
the parameter learning of the Bayesian network.
Bayesian network was used to classify the texts which have
4) The test text is predicted using the constructed 28 categories (Note: This is a classification of 28 categories,
Bayesian network. not the previous 2 categories), and Bert was used to classify
the texts which have 28 categories, the Bayesian network
combined with Bert's classifier to classify the texts which have
28 categories. The data set size is 50,000. The results are
shown in Table IV below:
TABLE IV.
Classifier Accuracy
Bayesian Network 76.53%
Bert Model 91.36%
The Bayesian network combined 94.59%
with Bert's classifier
The experimental results show that the classification
Fig. 5. Classification flow chart of Bayesian network. accuracy of Bayesian network combined with Bert model is
94.59%, which is 18.06% higher than that of using Bayesian
The success of Bayesian network construction is closely network alone and 3.23% higher than that of using Bert model
related to the selection of nodes. The following focuses on the alone. As a traditional machine learning, Bayesian network
selection method and process of nodes (here means the focuses on the causality between characteristic words and
selection of words). categories, while neglecting the semantic and structural
The nodes mainly include the sparse-filtered words and information of the context, while the Bert model based on
the featured words of each category, and the sparse-filtered deep learning takes into account the semantic and contextual
information of the text, so the accuracy is 14.83% higher than
5845
that of the traditional Bayesian network. The combination of can not only grasp the context well, but also make use of the
Bert model and Bayesian network not only considers the inference ability of Bayesian Network, so as to realize a
semantic information of context, but also grasps the causality classifier of social governance texts with excellent
between characteristic words and categories, which combines classification performance.
the advantages of the two methods, so the accuracy has been
further improved. IV. CONCLUSION
The following is to verify one thing: even if the size of the This paper mainly realizes the method of the Bayesian
dataset changes, the method still works very well. The three network combined with the Bert model to deal with the
methods used the same training set and test set in the data with classification of the social governance texts. The accuracy of
the number of texts of 10000, 30000, 50000, 70000 and the Bayesian network in the multi-classification is not high,
100000, respectively. the classification accuracy is shown in while the Bert model is much higher, but the Bert model has a
the following table: fatal disadvantage: the text of " ≁ ⭏ ḕ 䈒 ㊫ " cannot be
accurately distinguished from the text when the social
TABLE V. governance text is classified, resulting in a large number of
such text being misclassified to other categories. Therefore,
Number Bayesian Bert Bayesian
of texts network network +Bert
using the method of the "word-category reasoning" by the
Bayesian network, the text of this category is extracted first,
10000 69.72% 90.57% 92.88%
and then carrying out the classification for other texts by Bert.
30000 73.63% 91.00% 93.26% In this case, the text of "≁⭏ḕ䈒㊫" is prevented from being
50000 76.53% 91.36% 94.59%
classified into other categories by the Bert model to a certain
extent. So, the accuracy of text classification is improved.
70000 76.22% 91.38% 94.36%
The experimental result has confirmed that the
100000 77.71% 91.39% 95.01% classification method of Bayesian network combined with
As shown in Table V above, the size of the dataset has a Bert model is very good, and the accuracy is improved
certain impact on the classification accuracy. The larger the compared with the use of one method alone, which indicates
data set size, the better the classification effect. that the classification method is feasible and effective.
ACKNOWLEDGMENT
Bayesian network Bert Bayesian network+Bert I would like to express my gratitude to all those
who helped me during the writing of this thesis. I
99.00%
gratefully acknowledge the help of my supervisor.
94.00% Without his patience, encouragement and professional
89.00%
instructions, it would not be possible for me to complete this
thesis. I also owe much to my friends and classmates for
84.00% their valuable suggestions and critiques which are of
79.00% help and importance in making the thesis a reality.
74.00% REFERENCES
69.00% [1] Maillo J, Ramríez S, Triguero I, et al. kNN-IS: An Iterative Spark-
10000 30000 50000 70000 100000 based design of the k-Nearest Neighbors Classifier for Big Data[J].
Knowledge-Based Systems, 2017.
Fig. 6. Trends in classification accuracy as the size of the dataset changes. [2] Tang B, He H, Baggenstoss P M, et al. A Bayesian Classification
Approach Using Class-Specific Features for Text Categorization[J].
As shown in Figure 6 above, we can find that the IEEE Transactions on Knowledge & Data Engineering, 2016,
experimental accuracy of Bayesian network combined with 28(6):1602-1606.
Bert model has always been the best under different dataset [3] Bielza C. Discrete Bayesian Network Classifiers:A Survey[J]. Acm
sizes. So we can say that this method is superior to the other Computing Surveys, 2014, 47(1):1-43.
two methods on these data, which proves the feasibility of the [4] KANJ S, ADBALLAH F, DENDENOEUX T, et al. Editing training
method more fully. data for multi-label classification with the k-nearest neighbor rule[J].
Pattern Analysis and Applications, 2016, 19(1):145-161.
For traditional methods, some relatively mature text [5] Zhang X, Zhao J, Lecun Y. Character-level convolutional networks for
classification algorithms are as follows: LDA topic model, text classification[C]// International Conference on Neural Information
which is a typical word bag model and cannot grasp the Processing Systems. MIT Press, 2015:649-657.
context semantics; The classifier based on SVM cannot select [6] Cui L, Meng F, Shi Y, et al. A Hierarchy Method Based on LDA and
SVM for News Classification[C]// IEEE International Conference on
the correct and effective kernel function in text classification, Data Mining Workshop. IEEE, 2015:60-64.
which will affect the efficiency of the classifier. KNN has [7] Zhou Y, Xu B, Xu J, et al. Compositional Recurrent Neural Networks
higher requirements on the quality of text sets, so it has poor for Chinese Short Text Classification[C]// Ieee/wic/acm International
classification performance for the social governance texts with Conference on Web Intelligence. IEEE, 2017:137-144.
irregular grammar. The classification effect of neural network [8] /iu Z J, Wang J. A short-text classification method based on
algorithm is poor for short texts with high sparsity. Due to the convolution neural network and random forest:, CN 107066553 A[P].
above reasons, they are not suitable for the classification of 2017.
social governance texts. The method proposed in this paper is [9] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep
a combination of deep learning and machine learning, which Bidirectional Transformers for Language Understanding[J]. arXiv
preprint arXiv:1810.04805, 2018.
5846