Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

Kutuzov, Andrey; Kopotev, Mikhail; Sviridenko, Tatyana; Ivanova, Lyubov

Computer Science > Computation and Language

arXiv:1604.05372v1 (cs)

[Submitted on 18 Apr 2016]

Title:Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

Authors:Andrey Kutuzov, Mikhail Kopotev, Tatyana Sviridenko, Lyubov Ivanova

View PDF

Abstract:We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic texts, for which topics are their academic fields. In order to build language-independent semantic representations of these documents, we train neural distributional models on monolingual corpora and learn the optimal linear transformation of vectors from one language to another. The resulting vectors are then used to produce `semantic fingerprints' of documents, serving as input to a clustering algorithm.
The presented method is compared to several baselines including `orthographic translation' with Levenshtein edit distance and outperforms them by a large margin. We also show that language-independent `semantic fingerprints' are superior to multi-lingual clustering algorithms proposed in the previous work, at the same time requiring less linguistic resources.

Comments:	To be presented at 9th Workshop on Building and Using Comparable Corpora, co-located with LREC-2016 (this https URL)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1604.05372 [cs.CL]
	(or arXiv:1604.05372v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1604.05372

Submission history

From: Andrey Kutuzov [view email]
[v1] Mon, 18 Apr 2016 22:56:13 UTC (366 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2016-04

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Andrey Kutuzov
Mikhail Kopotev
Tatyana Sviridenko
Lyubov Ivanova

export BibTeX citation

Computer Science > Computation and Language

Title:Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators