Sub-Character Tokenization for Chinese Pretrained Language Models

Si, Chenglei; Zhang, Zhengyan; Chen, Yingfa; Qi, Fanchao; Wang, Xiaozhi; Liu, Zhiyuan; Wang, Yasheng; Liu, Qun; Sun, Maosong

Computer Science > Computation and Language

arXiv:2106.00400 (cs)

[Submitted on 1 Jun 2021 (v1), last revised 14 Feb 2023 (this version, v3)]

Title:Sub-Character Tokenization for Chinese Pretrained Language Models

Authors:Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun

View PDF

Abstract:Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at this https URL to facilitate future work.

Comments:	Accepted at TACL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2106.00400 [cs.CL]
	(or arXiv:2106.00400v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2106.00400

Submission history

From: Chenglei Si [view email]
[v1] Tue, 1 Jun 2021 11:20:02 UTC (49 KB)
[v2] Thu, 23 Dec 2021 02:26:36 UTC (1,243 KB)
[v3] Tue, 14 Feb 2023 21:07:45 UTC (1,398 KB)

Computer Science > Computation and Language

Title:Sub-Character Tokenization for Chinese Pretrained Language Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Sub-Character Tokenization for Chinese Pretrained Language Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators