Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

El-Kishky, Ahmed; Xu, Frank; Zhang, Aston; Han, Jiawei

Computer Science > Computation and Language

arXiv:1908.07832 (cs)

[Submitted on 18 Aug 2019 (v1), last revised 13 Nov 2019 (this version, v2)]

Title:Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

Authors:Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han

View PDF

Abstract:Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1908.07832 [cs.CL]
	(or arXiv:1908.07832v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1908.07832

Submission history

From: Ahmed El-Kishky [view email]
[v1] Sun, 18 Aug 2019 00:45:16 UTC (835 KB)
[v2] Wed, 13 Nov 2019 22:18:44 UTC (829 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2019-08

Change to browse by:

cs
cs.LG
stat
stat.ML

References & Citations

DBLP - CS Bibliography

listing | bibtex

Ahmed El-Kishky
Frank F. Xu
Aston Zhang
Jiawei Han

export BibTeX citation

Computer Science > Computation and Language

Title:Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators