Electronics 12 03087
Electronics 12 03087
Article
Text3D: 3D Convolutional Neural Networks for Text Classification
Jinrui Wang , Jie Li * and Yirui Zhang
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts
and Telecommunications, Beijing 100876, China; [email protected] (J.W.); [email protected] (Y.Z.)
* Correspondence: [email protected]
1. Introduction
Text classification is a fundamental problem in Natural Language Processing (NLP).
Many studies can be defined as text classification tasks, such as topic categorization,
sentiment analysis and relation extraction. Deep neural networks have been widely used
in text classification tasks since word order information can be utilized and more semantic
Citation: Wang, J.; Li, J.; Zhang, Y. features can be captured, compared with traditional bag-of-words/n-grams models [1].
Text3D: 3D Convolutional Neural A CNN is a feedforward network consisting of convolution layers and pooling layers.
Networks for Text Classification. While simple and shallow convolutional neural networks (CNNs) [2,3] were proposed
Electronics 2023, 12, 3087. https:// earlier, deeper and more complex neural networks have also been studied. Examples are
doi.org/10.3390/electronics12143087 deep character-level CNNs [4,5], a complex combination of CNNs and recurrent neural
Academic Editor: Yung-Lyul Lee
networks (RNNs) [6] and RNNs in a word-sentence hierarchy [7]. The Convolutional
Neural Network family has some influential architectures. Essentially, the convolution
Received: 21 June 2023 layer converts every small patch of data into vectors at every location, which can be in
Revised: 13 July 2023 parallel. The data can be either the original text/image or the output of the previous
Accepted: 14 July 2023 layer and the location can be a three word window around each word. There have been
Published: 16 July 2023
several recent studies of CNN for text categorization in large data settings. For example,
Conneau et al. [5] have found that very deep 32-layer character-level CNNs outperform
deep 9-layer character-level CNNs. However, very shallow 1-layer word-level CNNs [3]
were shown to be more accurate and much faster than the very deep 32-layer character-
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
level CNNs. Zhang et al. [8] extended Kim Yoon’s [2] model and empirically identified
This article is an open access article
the hyper parameter setting. Kim’s TextCNN [2] inspired part of the method proposed
distributed under the terms and
in this paper. DPCNN [9] uses the deepening of word level neural network to capture
conditions of the Creative Commons the global representation of the text, and proposes a deep pyramid convolutional neural
Attribution (CC BY) license (https:// network, which achieves excellent accuracy by increasing the network depth. A special
creativecommons.org/licenses/by/ Convolutional Neural Network, ResNet [10], proposed by Microsoft Research, performs
4.0/). well in image-classification and object-detection tasks. The characteristic of ResNet is that
it is easy to optimize and can improve the accuracy of result by considerably increasing
depth. The internal residual blocks of ResNet use skip connections, which alleviates the
vanishing gradient problem caused by increasing the depth in neural network. VGG [11] is
a work proposed by the Visual Geometry Group team at Oxford, focusing on the impact
of convolutional network depth on the accuracy of large-scale image recognition. The
highlight of the network is the use of multiple small convolutional filters (3×3) replacing
the large-scale convolution kernel to reduce the number of layers of the network. The
original motivation of the Inception [12] method proposed by Google may be ’want all’.
We either use many filters of the same size or add a pooling layer in a regular CNN, while
the Inception method aggregates all these operations in single layer.
Going further than the 2D CNNs mentioned above, 3D CNNs have the ability for extra
feature learning. Previous work [13] has confirmed that 3D CNNs are good feature-learning
machines in the field of CV because of spatiotemporal feature learning. However, 3D CNN
is little utilized in text classification task because the feature of the third dimension is not
easy to decide.
For classification tasks, the linguistic features provided to the classifier are crucial
as they encode relevant information about the input in the embedding space. Recently,
large pretraining language models such as ELMO [14], BERT [15], XLNet [16] and so on
have also shown their outstanding performance in all kinds of NLP tasks, including text
classification. Transformer-based architectures [17] and BERT [15] in particular achieve
remarkable results in natural language processing. The experiments have shown that BERT
outperforms previous state-of-the-art models in eleven NLP tasks in GLUE benchmark [18]
by a significant margin. For classification tasks, the most common approach is to feed
the [CLS] representation obtained from the last layer (the hidden state of the [CLS] token
from the last transformer encoder layer) to softmax [19], which outputs the probability of
a label. In addition to BERT, many researchers also focus on other pretrained language
models. Pre-training or fine-tuning BERT on downstream datasets may be computationally
expensive because it contains a large number of parameters and the effectiveness behind it
is the ambiguity. ALBERT [20] proposed a lightweight version of BERT that uses parameter
reduction technology to reduce memory consumption, which shows that there is a signifi-
cant improvement in the scaling of multi statement input. On the other hand, RoBERTa [21]
presented a study on BERT pretraining, and applied alternative pretraining objectives and
tasks, showing their (and pretraining data) impact on performance. XLNet [16] improved
BERT using auto-regressive pretraining, and achieved bidirectional contextual learning
using expected likelihood, and addressed the limitations of BERT caused by ignoring
the dependencies between masked positions. SMART [22] proposed a learning frame-
work that uses smooth induced regularization and Bregman proximal point optimization
to achieve effective fine-tuning and better generalization. SMART+RoBERTa [22] and
SMART+BERT [22] show performance improvements over BERT [15] and RoBERTa [21] in
various NLP tasks.
However, The BERT-based methods and techniques proposed in the past few years
in the natural language processing field have not been combined in any way with 3D
Convolutional Neural Networks which are widely used in the Computer Vision field. Word
embedding learned by the last layer of BERT is directly fed to a simple linear classifier in
previous text classification tasks, ignoring structural information about language learned
by BERT. Jawahar et al. [23] have found that layers of BERT encode a rich hierarchy of
linguistic information on inputting words, with surface features in bottom layers, syntactic
features in middle layers and semantic features in top layers.
We propose an effective approach for hierarchy feature learning in text classification,
using three-dimensional convolution network (Text3D). Text3D has a multi-filters structure
that focuses on contexts between words and connections between hierarchies. The novel
Text3D utilizes word order, word embedding and hierarchy information of BERT encoder
layers as features of three dimensions. We empirically show that these learned features
Electronics 2023, 12, 3087 3 of 10
with a simple linear classifier can yield good performance on text classification tasks. The
contributions of this paper are summarized as follows:
• We propose a three-dimensional convolution network (Text3D) which uses word
order, word embedding and hierarchy information of BERT encoder layers as three
dimension features.
• We utilize a well-designed 3D convolution mechanism and multiple filters to capture
text representations structured in three dimensions produced by pretrained language
model BERT.
• We conduct extensive experiments on datasets of different scales and types. The
results show that our proposed model produces significant improvements on baseline
models. Furthermore, we conduct additional experiments to confirm that hierarchy
features of different BERT layers encode different types of word knowledge.
2. Our Approach
Overview of Text3D: We try to identify a good architecture for 3D ConvNets. Our
Text3D classification model is composed of two parts: text representation and a 3D CNN
model. The overall structure can be seen in Figure 1. We fix the word-receptive field to
h × k × s, where s is the depth of the 3D convolution kernels. Text representation is produced
by multiple layers of BERT encoder. The key constructions of Text3D are as follows:
• Text representation within three dimensions, including word order, word embedding
and hierarchy information of BERT.
• Well-designed 3D convolution mechanism and filters to extract features of three
dimensions from text representation.
Figure 1. The structure of Text3D classification model, which is composed of text representation and
a 3D CNN model.
ci = f (W · xi:i+h−1 + b) (3)
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent.
This kernel is applied to each possible words in the sentence { x1:h , x2:h+1 · · · , xn−h+1:n } to
produce a feature map:
c = [ c 1 , c 2 · · · , c n − h +1 ] (4)
with c ∈ R(n−h+1)(d−s+1) . We then apply a max-overtime pooling operation [25] over the
feature map and take the maximum value ĉ = max (c) as the feature corresponding to this
particular kernel. The idea is to capture the most important feature with the highest value
for each feature map.
The model uses multiple filters (with varying cube sizes) to obtain multiple features.
These features form the penultimate layer and are fed to a fully connected softmax layer
whose output is the probability distribution over labels. In the single channel architecture,
illustrated in Figure 1, filters are applied and the results are added to calculate ci in
Equation (3). We employ dropout on the second last layer with a constraint on l2-norms of
the weight vectors [26] for regularization. Dropout prevents co-adaptation of hidden units
by randomly dropping out—i.e., setting to zero—a proportion p of the hidden units during
foward–back propagation.
3. Experiments
We evaluate our proposed Text3D on two different text classification tasks, including
sentiment analysis and topic categorization. We report the experiments with Text3D in
comparison with previous models and baseline. We will make the code publicly available.
Table 2. Accuracy on all datasets. The best results are shown in bold.
Experiment results on the four text classification datasets are shown in Table 2, which
is the classification accuracy on sentiment analysis and topic categorization tasks. Our
proposed Text3D achieves the best performance among all the baseline models. For Yelp P.
dataset, Text3D ranks the 2nd best and has comparable performance with DPCNN [9], a
deep pyramid convolution neural networks for text categorization.
Compared with CNN-based models, Text3D performs better than TextCNN, shallow
CNN (word/character-level) and deep CNN (DPCNN) with a significant margin in topic
categorization tasks. The reason is that Text3D is able to capture structural features, which is
the extra dimension not contained in 2D CNN. Comparing to the deep 2D convolution neu-
ral network, utilizing an extra dimension of features is more efficient in small datasets than
only increasing the layers of neural network. The shallow convolution network has greater
advantages in model architecture, computation complexity and computation efficiency.
Compared with the fine-tuned BERT model, Text3D greatly outperforms the fine-
tuning BERT in four datasets and two text classification tasks (sentiment analysis and
topic categorization). Text3D respectively exceeded the baseline BERT in accuracy by
0.51%, 1.13%, 1.17% and 0.60%. The 3D convolution neural network makes full use of
representations from different layers of BERT owing to the feature-extraction ability. The
Electronics 2023, 12, 3087 6 of 10
feature-extraction ability results in Text3D outperforming the fine-tuned BERT in text clas-
sification task. Structural information generated by different BERT layers can be integrated
through the filters in 3D CNN. The 3D convolution neural network makes more use of
BERT embedding. Feature extraction mechanism using both token order and semantic
representation of BERT performs better than a single classifier following behind BERT.
Both TextCNN and other CNN-based models (Char-level CNN et al.) of baselines
are influential works on text classification using 2D convolutional network. The test
performance indicates that the improvements to the baseline models are significant. For the
small dataset AGNews, our method outperforms TextCNN by 5.71%, exceeding DPCNN
by a margin of 0.66%. When it comes to larger dataset Yahoo, our method achieves the best
performance over the baselines, outperforming the CNNs by 5.75%. The Text3D increases
the accuracy by 3.09% on DBpedia compared with TextCNN. Text3D produces substantial
improvements in the accuracy of pretrained BERT model, i.e., 1.27% on Yelp P., which
verifies the effectiveness of the proposed framework.
4. Discussion
In this section, we present more in-depth analyses and discussions on Text3D. We
study the influence of some important hyper-parameters on the performance of Text3D,
including different hierarchies of BERT layers and depth of kernel. We analyze the effect of
hierarchy information from BERT in an ablation study.
Figure 2. Accuracy (%) of the text classification using different hierarchy of BERT.
For small dataset AGNews, the optimal depth is 2. For large datasets like Yahoo,
depth of 6 achieves the best performance. The larger filter depth will integrate more layers
which means more word linguistic information contained in the feature map, thus may
improve the influence of word knowledge. We can see from the Figure 3, for smaller dataset
Electronics 2023, 12, 3087 8 of 10
AGNews, filters with smaller depth have better performance. Further increasing the feature
information in extraction will increase model complexity and cause performance drop due
to overfitting. However, when it comes to larger dataset Yahoo, filters with larger depth
perform better. In this situation, we can apply the filters with larger depth to learn more
word linguistic information. The results show different depths for multiple filters in one
convolution has little improvement for the task effect. That means multiple filters varying
in more than one dimensions (depth and height) have little advantage on classification
effect due to overfitting caused by model complexity.
Figure 4. Accuracy (%) of the text classification using Glove and BERT separately.
Table 5. Ablation study on the contributions of BERT and 3D convolution network architecture.
Comparison between normal embedding and BERT token.
We replace the BERT embedding with normal word embedding Glove where each
token is unchangeable in the dictionary. The accuracy performance drops greatly in all
datasets. For larger dataset Yahoo with more sentences, the unchangeable token embedding
performs poorly. The specific embedding for certain word is limited and learns little
knowledge. This demonstrates the advantage of utilizing the semantic embedding for text.
The linguistic signals encoded by BERT actually contain richer word knowledge which can
improve the effectiveness of text classification.
Electronics 2023, 12, 3087 9 of 10
Author Contributions: J.L. contributed to the conception of the study; J.W. performed the experiment,
contributed significantly to analysis and wrote the manuscript; Y.Z. helped perform the analysis with
constructive discussions. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China grant num-
ber U22B2019.
Data Availability Statement: The data that support the findings of this study are openly available in
public repository.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wang, S.; Manning, C. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics, Jeju Island, Republic of Korea, 8 July 2012; pp. 90–94.
2. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing, Doha, Qatar, 25 October 2014; pp. 1746–1751.
3. Johnson, R.; Zhang, T. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of
the North American Chapter of the Association for Computational Linguistics Human Language Technologies, Denver, CO, USA,
31 May 2015.
4. Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Annual
Conference on Neural Information Processing Systems 28, Montreal, QC, Canada, 7 December 2015.
5. Conneau, A.; Schwenk, H.; Barrault, L.; LeCun, Y. Very deep convolutional networks for natural language processing. arXiv 2016,
arXiv:1606.01781v1.
6. Tang, D.; Qin, B.; Liu, T. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of
the Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17 September 2015.
7. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings
of the North American Chapter of the Association for Computational Linguistics Human Language Technologies, San Diego, CA,
USA, 12 June 2016.
8. Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classifica-
tion. In Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, Taiwan, 27 November 2016;
pp. 253–263.
Electronics 2023, 12, 3087 10 of 10
9. Johnson, R.; Zhang, T. Deep Pyramid Convolutional Neural Networks for Text Categorization. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July 2017.
10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June 2016.
11. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
12. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015,
arXiv:1512.00567.
13. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In
Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7 December 2015; pp. 4489–4497.
14. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations.
arXiv 2018, arXiv:1802.05365.
15. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Minneapolis, MN, USA, 2 June 2019; pp. 4171–4186.
16. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language
understanding. In Proceedings of the 2019 Conference on Neural Information Processing Systems, Vancouver, BC, Canada,
8 December 2019; pp. 5753–5763.
17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In
Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing
Systems, Long Beach, CA, USA, 4 December 2017; pp. 6000–6010.
18. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355.
19. Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune bert for text classification? In Proceedings of the 18th China National
Conference (CCL 2019), Kunming, China, 18 October 2019; pp. 194–206.
20. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language
representations. In Proceedings of the International Conference on Learning Representations, Online, 26 April 2020.
21. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly
optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
22. Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Zhao, T. SMART: Robust and efficient finetuning for pre-trained natural language mod-
els through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, Online, 6 July 2020; pp. 2177–2190.
23. Jawahar, G.; Sagot, B.; Seddah, D. What does BERT learn about the structure of language? In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019; pp. 3651–3657.
24. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th
International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8 December 2014; pp. 568–576.
25. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch.
J. Mach. Learn. Res. 2011, 12, 2493–2537.
26. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing
co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580.
27. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing, Doha, Qatar, 26 October 2014; pp. 1532–1543.
28. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415.
29. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
30. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2016, arXiv:1607.01795v3.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.