Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Watson, Daniel; Zalmout, Nasser; Habash, Nizar

Computer Science > Computation and Language

arXiv:1809.01534 (cs)

[Submitted on 5 Sep 2018]

Title:Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Authors:Daniel Watson, Nasser Zalmout, Nizar Habash

View PDF

Abstract:Text normalization is an important enabling technology for several NLP tasks. Recently, neural-network-based approaches have outperformed well-established models in this task. However, in languages other than English, there has been little exploration in this direction. Both the scarcity of annotated data and the complexity of the language increase the difficulty of the problem. To address these challenges, we use a sequence-to-sequence model with character-based attention, which in addition to its self-learned character embeddings, uses word embeddings pre-trained with an approach that also models subword information. This provides the neural model with access to more linguistic information especially suitable for text normalization, without large parallel corpora. We show that providing the model with word-level features bridges the gap for the neural network approach to achieve a state-of-the-art F1 score on a standard Arabic language correction shared task dataset.

Comments:	Accepted in EMNLP 2018
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
ACM classes:	I.2.6
Cite as:	arXiv:1809.01534 [cs.CL]
	(or arXiv:1809.01534v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1809.01534

Submission history

From: Daniel Watson [view email]
[v1] Wed, 5 Sep 2018 16:44:04 UTC (60 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2018-09

Change to browse by:

cs
cs.LG
stat
stat.ML

References & Citations

DBLP - CS Bibliography

listing | bibtex

Daniel Watson
Nasser Zalmout
Nizar Habash

export BibTeX citation

Computer Science > Computation and Language

Title:Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators