0% found this document useful (0 votes)
11 views

Research_on_Web_text_classification_algorithm_based_on_improved_CNN_and_SVM

This paper presents a novel web text classification algorithm that combines an improved Convolutional Neural Network (CNN) and Support Vector Machine (SVM) to enhance classification accuracy. The proposed method addresses shortcomings in existing models by optimizing the CNN structure and effectively utilizing pre-trained word vectors. Experimental results demonstrate a significant improvement in classification performance compared to traditional methods, achieving higher precision and F-measure metrics.

Uploaded by

srushraotole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Research_on_Web_text_classification_algorithm_based_on_improved_CNN_and_SVM

This paper presents a novel web text classification algorithm that combines an improved Convolutional Neural Network (CNN) and Support Vector Machine (SVM) to enhance classification accuracy. The proposed method addresses shortcomings in existing models by optimizing the CNN structure and effectively utilizing pre-trained word vectors. Experimental results demonstrate a significant improvement in classification performance compared to traditional methods, achieving higher precision and F-measure metrics.

Uploaded by

srushraotole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2017 17th IEEE International Conference on Communication Technology

Research on Web Text Classification Algorithm Based on Improved CNN and SVM

Zhiquan Wang, Zhiyi Qu


School of Information Science & Engineering, Lanzhou University
Lanzhou, China
e-mail: wangzhiquan870109@126.com, quzy@lzu.edu.cn

Abstract—Web text classification is one of the research focuses as semantic analysis and topic classification. The results
and core technologies in Web information retrieval and data were good. However, he didn’t re-train the word vector for
mining, and it has been widely concerned and developed the data set, so a lot of words did not have their own word
rapidly in recent years. The convolutional neural network vectors, which resulted in an incomplete description of the
(CNN), as a kind of deep learning model, can extract the original data features; Johnson [8] et al. trained the CNN
features of the text data accurately and reduce the complexity model directly from the original data and then carried out the
of models at the same time. The support vector machine (SVM) convolution operation. Although they used the
has always had the advantages of being effective and stable in representations similar to the bags of words to the input data
traditional machine learning algorithms. According to the
to save the space and reduce the number of parameters
characteristics of CNN and SVM, this paper proposes a new
method of Web text classification based on the improved CNN
required by the network to learn. The final effect was not
and SVM, using the CNN model with the five-layer network very satisfactory.
structure to extract text feature and then classify and predict In order to solve the problems and shortcomings above,
by using SVM. Finally, it will obtain an excellent effect on this paper tries to combine the improved CNN and SVM
mixed text data set. algorithms to complete the Web text classification task in the
natural language processing. Experiments have showed that
Keywords-web text classification; deep learning; CNN; SVM the Web text classification algorithm proposed in this paper
can greatly improve the classification accuracy compared
I. INTRODUCTION with the traditional methods.
With the rapid development of computer technology, the II. RELATED WORK
popularity of the Internet and the rapid increase in electronic
information on Web, how to retrieve the required content A. Convolutional Neural Network (CNN)
from the great deal of information fast and accurately has Convolutional neural network (CNN) is a kind of typical
been a common concern. Most of the online information artificial neural network. In this kind of network, the output
exists in the form of texts, so the text classification has of each layer is used as the input of the next layer of neuron.
become the key to Web information retrieval and Multi-layer convolution operation is used to transform the
information filtering. In recent years, text classification results of each layer by nonlinear until the output layer. In
technology has been widely used in various fields. It has general, the convolution neural network model used in text
been an important method for people to process massive text analysis is shown in Fig. 1, which includes four parts:
information, which has broad development prospects. At embedding layer, convolutional layer, pooling layer and fully
present, many scholars at home and abroad have studied the connected layer. Compared with the traditional models for
text classification technology by using two main methods, image analysis, the difference is that the input layer of the
including the traditional machine learning and the deep CNN model used in text analysis is the word vector.
learning which is popular currently.
Compared with other traditional machine learning
algorithms, the support vector machine (SVM) has always
had the advantages of being effective and stable [1], [2], but
there are also many problems, such as difficulty in
implementing the large-scale training samples to deal with
the multi-classification data. The deep learning (DL) is a new
field of the machine learning research, which aims to
establish a neural network to simulate human brain for
analysis and learning. As a superior model of the deep
learning technology, the convolutional neural network (CNN)
has become one of the research focuses in many fields such Figure 1. Typical CNN model structure
as image recognition [3], [4], speech analysis [5], [6] and
natural language processing [7], [8]. Kim [7] combined the The model structure is described as follows:
word vector with the convolutional neural network and then x Embedding layer. The embedded layer is a matrix in
applied them in many natural language processing tasks such which the word vectors corresponding to the words

978-1-5090-3944-9/17/$31.00 ©2017 IEEE 1958


Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 28,2024 at 06:21:33 UTC from IEEE Xplore. Restrictions apply.
in the sentence are arranged in order from top to x Polynomial kernel function:
bottom. Assuming that the sentence has m words, the
dimension of the word vector is n, then the matrix is K(x1,x2) = [(x1∙x2)+C]μ  
m × n.
x Convolutional layer. The embedding layer can x Radial basis function:
obtain several feature maps by convolution operation,
in which the convolutional window is k × n. The k K(x1,x2) = exp(-ǁx1-x2ǁ2⁄2σ2)  
represents the number of longitudinal words, while n
represents the dimension of the word vector. By
convolutional window, a number of feature maps III. THE METHOD BASED ON IMPROVED CNN AND SVM
with 1 column will be obtained.
x Pooling layer. The pooling layer is also called the Based on the structure of traditional convolution neural
sub-sampling layer, which can reduce the size of the network, the proposed model is optimized from five parts:
input data. There are many ways of pooling in the embedded layer, convolutional layer, activation layer,
CNN, but the most common one is the max-pooling. pooling layer and fully connected layer. First of all, the word
x Fully connected layer. The last layer is usually embedding of each word is trained for the dataset. And then
connected to one or more fully connected layers and it is used as an input feature of the CNN model, which is
the output of the full connection layer is the final trained iteratively with other network parameters. Finally,
output. the extracted features are inputted into the SVM classifier for
The three obvious characteristics of convolutional neural training and then output the final result.
networks are local connection, spatial sampling and weight A. Embedded Layer Input
sharing. In CNN, local connection is used among the neurons,
which greatly reduces the parameter size of the neural The convolution neural network is used for image
recognition at first, and the image data is composed of two-
network architecture On the whole, people can use the
dimensional data. Therefore, we firstly need to preprocess
convolution to get all the feature data for training and
the text data and combine it into a two-dimensional data
classification, but it will produce a great amount of
computation. Therefore it is necessary to reduce the matrix to enter the model.
dimension of the feature by spatial sampling method after The text preprocessing work is divided into three steps:
obtaining the convolution feature of the text, which not only The first step is to convert the original text into a sequence of
reduces the computational complexity from the upper hidden words. The second step is to first read the word vector using
layer, but also enhances the robustness to the displacement. the Google Open Source tool word2vec pre-trained, and then
In addition, the weight sharing network structure of CNN convert the sequence of words into words with the sequence
makes it more similar to biological neural network, which number, where each word has a unique number in the
avoids the complex feature extraction and data vocabulary. The third step is to expand each word in the
reconstruction process in traditional machine learning sequence of words into the form of word vector, while
algorithms. Thus it has been widely used in the field of creating a word vector matrix so that each word has its
Natural Language Processing. corresponding word vector.
B. Support Vector Machine (SVM) B. Model Structure
Support Vector Machine (SVM) is a classical machine The model structure used in this paper is shown in Fig. 2.
learning algorithm based on linear model, whose basic The first layer is the data input layer and the second one is
thought is to transform the input space into a high- the embedded layer, whose work are text preprocessing.
dimensional feature space by nonlinear transformation and to Then there are 5 parallel CNN models, each of which
find the optimal linear interface in the new space. In general, consists of a double convolution, that is, a combination of
the higher dimension will lead to the complexity of convolutional layer and pooling layer. In order to fully
computation, but the SVM algorithm solves the problem consider the information of each word before and after, so as
after introducing the kernel function, which not only does not to extract the size of the local characteristics of different
increase the computational complexity, but also avoids the sizes, the thesis designs five convolution structures with
"Curse of dimensionality". different sizes of 4 × 100, 5 × 100, 6 × 100, 7 × 100 and 8 ×
Different kernel functions can be used to construct 100 by changing the window length k as the input of the next
different SVM classifiers, and therefore the choice of kernel layer. After the processing of the convolutional layer, the
function is very important. There are three common kernel characteristics of text classification are more advantageous.
functions: Based on this step, the pooling layer is further screened from
x Linear kernel function: the global perspective, where the max pooling is used.
Besides, the role of the flattening layer is to make
K(x1,x2) = (x1∙x2)   multidimensional inputs one-dimensional.

1959
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 28,2024 at 06:21:33 UTC from IEEE Xplore. Restrictions apply.
which can be as far as possible to find the optimal solution to
accelerate the training speed. After repeated experiments, the
size of the batch is set to 165 to get a better effect.
IV. EXPERIMENT AND RESULT ANALYSIS

A. Data Set
This experiment used a mix of text dataset, some of
which were selected from the 20-NewsGroup corpus. It was
composed of 19997 messages posted by Internet users on
Usenet, all content of which were evenly distributed in 20
different categories and each category contained 1000
messages. The other was the news page on the Internet,
which were downloaded through the web crawler Heritrix
and subjected to simple preprocessing. A total of 11968
Figure 2. The model structure in this thesis samples were divided into 10 categories, including 7180
training samples and 4788 test samples. At the same time,
The convolution kernel ω completes the convolution the 10% data were randomly taken as the validation sample
operation in the window with length k, and the output in the training process.
features are:
B. Evaluation Criteria
yi = f(ω∙xiĩi+k-1+b)    In order to evaluate the classification effect, we adopt the
most general evaluation methods, which are: Precision (P),
where xiĩi+k-1 is the information of the input matrix X within Recall (R) and F-Measure (Fθ). The formula is as follows:
a filter sliding window and ω is the weight matrix. Moreover,
b is the bias factor and f is the activation function. There are P = C/A 
many activation functions in neural network, such as sigmod
function, tanh function and so on. In this experiment, the
tanh function is used as the general activation function. R = C/B 
The following is merge layer, full connected layer and
batch normalization layer. For this thesis, two full connected
layers have been used in it. One is used to do the feature Fθ = (θ2+1)PR⁄(θ2P+R)  
weighting of the previous data, and the other is used to
extract the features of the output. The specific calculation is where A represents the total number of texts belonging to a
as follows: category in the forecast model and B represents the total
number of texts actually belonging to a category.
V' = f(ωV+b)    Furthermore, C represents the total number of text, where the
parameter θ usually takes the value 1. At the same time, the
where V is the vector of the upper layer and b is the bias accuracy is used to evaluate the overall experimental effect,
factor. The f represents the advanced activation function whose purpose is to measure the input proportion of the
Leaky ReLU, whose role is to add non-linear factors in the correct marking of the classifier on the test set.
model and enhance the expression of the model, while
removing redundant data and maximizing the retention of C. Experimental Design
data features. In addition, the batch normalization layer is set Based on the improved CNN model, this paper attempts
up to accelerate the convergence of feature data and prevent to introduce SVM classifier and embeds pre-trained word
overfitting. vector. Specific experimental design is as follows:
Finally, the text data in a sliding window is converted x CNN+SVM. Firstly, we used word2vec to train the
into a fixed-length vector V", and then the extracted feature good word vector, then used the improved CNN
V" is processed by the softmax function and input into the model to extract features and finally adopted the
SVM classifier for training and output the classification linear kernel SVM classifier to train and output the
result. results.
x CNN. The traditional 3-layer network structure was
C. Model Training
used for this CNN model, while the same was
The training goal of this model is to minimize embedding pre-training word vector and making
Categorical Cross-Entropy Loss. In order to make the model other conditions unchanged. The purpose of the
faster convergence, the mini-batch gradient decent method of experiment was to verify the performance of
training is used here, that is, only a small number of samples improved CNN and SVM by comparing with the
are needed to participate in each update of the weights, results of the new model.

1960
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 28,2024 at 06:21:33 UTC from IEEE Xplore. Restrictions apply.
x Traditional machine learning model. On the same of the algorithm is increased from 87.6% to 92.5% and f-
dataset, two traditional machine learning algorithms measure is increased from 87.9% to 93.2%. At the same time,
SVM and KNN were used as a comparison to prove its effect is much better than the SVM algorithm, which is
the advantages of the model in the final effect. relatively good in the traditional machine learning algorithm.
D. Experimental Results and Analysis TABLE I. EXPERIMENTAL RESULTS
The accuracy, f-measure and loss of the model in the Model Precision Recall F-Measure Accuracy
training process are shown in Fig. 3 and Fig. 4, where acc,
KNN 0.674 0.662 0.668 0.664
fmeasure and loss show the accuracy, f-measure and loss of
the training set in the training process, while val_acc, SVM 0.823 0.820 0.821 0.815
val_fmeasure and val_loss show the accuracy, f-measure and CNN 0.884 0.874 0.879 0.876
loss of the validation set in the training process. As can be CNN+SVM 0.934 0.931 0.932 0.925
seen from Fig. 3, the change curve of the four indicators has
been basically stable after 25 epochs. Meanwhile, it can be V. CONCLUSIONS
seen from Fig. 4 that the loss has been basically reduced to a
minimum level after 25 epochs. The text classification has always been one of the
important tasks in the natural language processing. The deep
learning technology has become one of research focuses in
the field of artificial intelligence since its establishment. The
most representative model is convolutional neural network.
This paper tries to solve the problem of text classification by
combining deep learning and traditional machine learning
algorithms. Firstly, we embeds pre-trained word vector in the
improved CNN model. Secondly, we use the improved CNN
and SVM algorithms to complete the classification task.
Experiments have showed that the algorithm mentioned in
this paper can improve the classification accuracy and f-
measure greatly compared to other methods. In the future,
users’ emotional factors will be considered based on this
model, which will be verified on the public data set to get
better results.
REFERENCES
[1] Jie Cao, Zhiyi Fang, Dan Zhang, and Guannan Qu, “Network Traffic
Figure 3. The accuracy and f-measure in training Classification Using Feature Selection and Parameter Optimization,"
Journal of Communications, vol. 10, no. 10, pp. 828-835, 2015.
[2] Faquan Yang et al., “A Novel Method for Wireless Communication
Signal Modulation Recognition in Smart Grid," Journal of
Communications, vol. 11, no. 9, pp. 813-818, 2016.
[3] A. Krizhevsky, I. Sutskever, and H. Geoffrey E., “ImageNet
Classification with Deep Convolutional Neural Networks,” in
Advances in Neural Information Processing Systems 25 (NIPS2012),
2012, pp. 1–9.
[4] L. Zhang, Z. He, and Y. Liu, “Deep object recognition across
domains based on adaptive extreme learning machine,”
Neurocomputing, vol. 239, pp. 194–203, 2017.
[5] G. Hinton et al., “Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups,” IEEE
Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[6] O. Abdel-Hamid et al., “Convolutional Neural Networks for Speech
Recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol.
22, no. 10, pp. 1533–1545, 2014.
[7] Y. Kim, “Convolutional Neural Networks for Sentence
Classification,” in Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2014), 2014, pp.
1746–1751.
Figure 4. The loss in training
[8] R. Johnson and T. Zhang, “Effective Use of Word Order for Text
Categorization with Convolutional Neural Networks,” in Human
The final results of the four sets of models are shown in Language Technologies: The 2015 Annual Conference of the North
Table I. As can be seen from the data in the table, the American Chapter of the ACL, 2015, pp. 103–112.
proposed algorithm has achieved the best results in this
thesis. Compared with the traditional CNN model, accuracy

1961
Authorized licensed use limited to: K K Wagh Inst of Engg Education and Research. Downloaded on August 28,2024 at 06:21:33 UTC from IEEE Xplore. Restrictions apply.

You might also like