How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia-Wei; Lowe, Ryan; Serban, Iulian V.; Noseworthy, Michael; Charlin, Laurent; Pineau, Joelle

Computer Science > Computation and Language

arXiv:1603.08023 (cs)

[Submitted on 25 Mar 2016 (v1), last revised 3 Jan 2017 (this version, v2)]

Title:How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Authors:Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau

View PDF

Abstract:We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Comments:	First 4 authors had equal contribution. 13 pages, 5 tables, 6 figures. EMNLP 2016
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:1603.08023 [cs.CL]
	(or arXiv:1603.08023v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1603.08023

Submission history

From: Ryan Lowe T. [view email]
[v1] Fri, 25 Mar 2016 20:32:21 UTC (787 KB)
[v2] Tue, 3 Jan 2017 18:28:32 UTC (723 KB)

Computer Science > Computation and Language

Title:How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators