VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Zhang, Zhen-Ru; Tan, Chuanqi; Huang, Songfang; Huang, Fei

Computer Science > Computation and Language

arXiv:2304.08205 (cs)

[Submitted on 17 Apr 2023]

Title:VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Authors:Zhen-Ru Zhang, Chuanqi Tan, Songfang Huang, Fei Huang

View PDF

Abstract:Recent studies have demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. In addition to involving the masked language model objective, existing cross-lingual pre-training works leverage sentence-level contrastive learning or plugs in extra cross-attention module to complement the insufficient capabilities of cross-lingual alignment. Nonetheless, synonym pairs residing in bilingual corpus are not exploited and aligned, which is more crucial than sentence interdependence establishment for token-level tasks. In this work, we propose a cross-lingual pre-trained model VECO~2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. Then, token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance. Experiments show the effectiveness of the proposed strategy for cross-lingual model pre-training on the XTREME benchmark.

Comments:	Technical Report for AliceMind's VECO 2.0 (ranked 1st on the XTREME leaderboard on March 17, 2023)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2304.08205 [cs.CL]
	(or arXiv:2304.08205v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2304.08205

Submission history

From: Chuanqi Tan [view email]
[v1] Mon, 17 Apr 2023 12:23:41 UTC (271 KB)

Computer Science > Computation and Language

Title:VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators