-
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Authors:
Rujun Han,
Yuhao Zhang,
Peng Qi,
Yumo Xu,
Jenyuan Wang,
Lan Liu,
William Yang Wang,
Bonan Min,
Vittorio Castelli
Abstract:
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalizati…
▽ More
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification
Authors:
Fei Wang,
Chao Shang,
Sarthak Jain,
Shuai Wang,
Qiang Ning,
Bonan Min,
Vittorio Castelli,
Yassine Benajiba,
Dan Roth
Abstract:
User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the sat…
▽ More
User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
NewsQs: Multi-Source Question Generation for the Inquiring Mind
Authors:
Alyssa Hwang,
Kalpit Dixit,
Miguel Ballesteros,
Yassine Benajiba,
Vittorio Castelli,
Markus Dreyer,
Mohit Bansal,
Kathleen McKeown
Abstract:
We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judg…
▽ More
We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.
△ Less
Submitted 15 June, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Aplicacion de Robots Humanoides como Guias Interactivos en Museos: Una Simulacion con el Robot NAO
Authors:
Hiago Sodre,
Pablo Moraes,
Monica Rodriguez,
Victor Castelli,
Pamela Barboza,
Martin Mattos,
Guillermo Vivas,
Bruna de Vargas,
Tobias Dörnbach,
Ricardo Grando
Abstract:
This article presents an application that evaluates the feasibility of humanoid robots as interactive guides in art museums. The application entailes programming a NAO robot and a chatbot to provide information about art pieces in a simulated museum environment. In this controlled scenario, the learning employees interact with the robot and the chatbot. The result is a skilled participation in the…
▽ More
This article presents an application that evaluates the feasibility of humanoid robots as interactive guides in art museums. The application entailes programming a NAO robot and a chatbot to provide information about art pieces in a simulated museum environment. In this controlled scenario, the learning employees interact with the robot and the chatbot. The result is a skilled participation in the interactions, along with the effectiveness of the robot and chatbot that communicates the basic details of the art objects. You see natural and fluid interactions between the students and the robot. This suggests that the addition of humanoid robots to museums may provide a better experience for visitors, but also the need to continue to do more to optimize the quality of interaction. This study contributes to understanding the possibilities and requirements of applying humanoid technologies in a cultural context.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning
Authors:
Alexander Hanbo Li,
Mingyue Shang,
Evangelia Spiliopoulou,
Jie Ma,
Patrick Ng,
Zhiguo Wang,
Bonan Min,
William Wang,
Kathleen McKeown,
Vittorio Castelli,
Dan Roth,
Bing Xiang
Abstract:
We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph…
▽ More
We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph triples, and meaning representations. We demonstrate that our proposed approach can effectively adapt to new structured forms, and can improve performance in comparison to current methods. For example, our method resulted in a 66% improvement in zero-shot BLEU scores when transferring models trained on table inputs to a knowledge graph dataset. Our proposed method is an important step towards a more general data-to-text generation framework.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Authors:
Xingyu Fu,
Sheng Zhang,
Gukyeong Kwon,
Pramuditha Perera,
Henghui Zhu,
Yuhao Zhang,
Alexander Hanbo Li,
William Yang Wang,
Zhiguo Wang,
Vittorio Castelli,
Patrick Ng,
Dan Roth,
Bing Xiang
Abstract:
The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certa…
▽ More
The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality -- only models using GPT-3 can achieve the best result.
To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code and models are released at http://cogcomp.org/page/publication_view/1010
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Benchmarking Diverse-Modal Entity Linking with Generative Models
Authors:
Sijia Wang,
Alexander Hanbo Li,
Henry Zhu,
Sheng Zhang,
Chung-Wei Hang,
Pramuditha Perera,
Jie Ma,
William Wang,
Zhiguo Wang,
Vittorio Castelli,
Bing Xiang,
Patrick Ng
Abstract:
Entities can be expressed in diverse formats, such as texts, images, or column names and cell values in tables. While existing entity linking (EL) models work well on per modality configuration, such as text-only EL, visual grounding, or schema linking, it is more challenging to design a unified model for diverse modality configurations. To bring various modality configurations together, we constr…
▽ More
Entities can be expressed in diverse formats, such as texts, images, or column names and cell values in tables. While existing entity linking (EL) models work well on per modality configuration, such as text-only EL, visual grounding, or schema linking, it is more challenging to design a unified model for diverse modality configurations. To bring various modality configurations together, we constructed a benchmark for diverse-modal EL (DMEL) from existing EL datasets, covering all three modalities including text, image, and table. To approach the DMEL task, we proposed a generative diverse-modal model (GDMM) following a multimodal-encoder-decoder paradigm. Pre-training \Model with rich corpora builds a solid foundation for DMEL without storing the entire KB for inference. Fine-tuning GDMM builds a stronger DMEL baseline, outperforming state-of-the-art task-specific EL models by 8.51 F1 score on average. Additionally, extensive error analyses are conducted to highlight the challenges of DMEL, facilitating future research on this task.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
UNITE: A Unified Benchmark for Text-to-SQL Evaluation
Authors:
Wuwei Lan,
Zhiguo Wang,
Anuj Chauhan,
Henghui Zhu,
Alexander Li,
Jiang Guo,
Sheng Zhang,
Chung-Wei Hang,
Joseph Lilien,
Yiqun Hu,
Lin Pan,
Mingwen Dong,
Jun Wang,
Jiarong Jiang,
Stephen Ash,
Vittorio Castelli,
Patrick Ng,
Bing Xiang
Abstract:
A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains…
▽ More
A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark, we introduce $\sim$120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions. We conduct a systematic study of six state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that: 1) Codex performs surprisingly well on out-of-domain datasets; 2) specially designed decoding methods (e.g. constrained beam search) can improve performance for both in-domain and out-of-domain settings; 3) explicitly modeling the relationship between questions and schemas further improves the Seq2Seq models. More importantly, our benchmark presents key challenges towards compositional generalization and robustness issues -- which these SOTA models cannot address well. Our code and data processing script are available at https://github.com/awslabs/unified-text2sql-benchmark
△ Less
Submitted 14 July, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Pre-training Intent-Aware Encoders for Zero- and Few-Shot Intent Classification
Authors:
Mujeen Sung,
James Gung,
Elman Mansimov,
Nikolaos Pappas,
Raphael Shu,
Salvatore Romeo,
Yi Zhang,
Vittorio Castelli
Abstract:
Intent classification (IC) plays an important role in task-oriented dialogue systems. However, IC models often generalize poorly when training without sufficient annotated examples for each user intent. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks, reducing the need for manu…
▽ More
Intent classification (IC) plays an important role in task-oriented dialogue systems. However, IC models often generalize poorly when training without sufficient annotated examples for each user intent. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks, reducing the need for manual annotations. By applying this pre-training strategy, we also introduce Pre-trained Intent-aware Encoder (PIE), which is designed to align encodings of utterances with their intent names. Specifically, we first train a tagger to identify key phrases within utterances that are crucial for interpreting intents. We then use these extracted phrases to create examples for pre-training a text encoder in a contrastive manner. As a result, our PIE model achieves up to 5.4% and 4.0% higher accuracy than the previous state-of-the-art text encoder for the N-way zero- and one-shot settings on four IC datasets.
△ Less
Submitted 13 November, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Taxonomy Expansion for Named Entity Recognition
Authors:
Karthikeyan K,
Yogarshi Vyas,
Jie Ma,
Giovanni Paolini,
Neha Anna John,
Shuai Wang,
Yassine Benajiba,
Vittorio Castelli,
Dan Roth,
Miguel Ballesteros
Abstract:
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To re…
▽ More
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To remedy this, we propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets. We experiment with 6 diverse datasets and show that PLM consistently performs better than most other approaches (0.5 - 2.5 F1), including in novel settings for taxonomy expansion not considered in prior work. The gap between PLM and all other approaches is especially large in settings where there is limited data available for the additional entity types (as much as 11 F1), thus suggesting a more cost effective approaches to taxonomy expansion.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Comparing Biases and the Impact of Multilingual Training across Multiple Languages
Authors:
Sharon Levy,
Neha Anna John,
Ling Liu,
Yogarshi Vyas,
Jie Ma,
Yoshinari Fujinuma,
Miguel Ballesteros,
Vittorio Castelli,
Dan Roth
Abstract:
Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases com…
▽ More
Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases compare across languages and how the biases are affected when training a model on multilingual data versus monolingual data. We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task to observe whether specific demographics are viewed more positively. We study bias similarities and differences across these languages and investigate the impact of multilingual vs. monolingual training data. We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender. Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture (e.g. majority religions and nationalities). Additionally, we find an increased variation in predictions across protected groups, indicating bias amplification, after multilingual finetuning in comparison to multilingual pretraining.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Authors:
Shuaichen Chang,
Jun Wang,
Mingwen Dong,
Lin Pan,
Henghui Zhu,
Alexander Hanbo Li,
Wuwei Lan,
Sheng Zhang,
Jiarong Jiang,
Joseph Lilien,
Steve Ash,
William Yang Wang,
Zhiguo Wang,
Vittorio Castelli,
Patrick Ng,
Bing Xiang
Abstract:
Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain tex…
▽ More
Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.
△ Less
Submitted 28 January, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
Authors:
Yiyun Zhao,
Jiarong Jiang,
Yiqun Hu,
Wuwei Lan,
Henry Zhu,
Anuj Chauhan,
Alexander Li,
Lin Pan,
Jun Wang,
Chung-Wei Hang,
Sheng Zhang,
Marvin Dong,
Joe Lilien,
Patrick Ng,
Zhiguo Wang,
Vittorio Castelli,
Bing Xiang
Abstract:
Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independe…
▽ More
Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection
Authors:
Hardy Hardy,
Miguel Ballesteros,
Faisal Ladhak,
Muhammad Khalifa,
Vittorio Castelli,
Kathleen McKeown
Abstract:
Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed data…
▽ More
Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed dataset towards negative instances for extractive summarization; we thus adopt a margin ranking loss for extraction to encourage separation between positive and negative examples. Our extraction component operates at the constituent level; our approach to this problem enriches the text with spinal tree information which provides syntactic context (in the form of constituents) to the extraction model. We show an improvement of 3.71 Rouge-1 points over best results reported in prior work on an existing novel chapter dataset.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Synthetic Target Domain Supervision for Open Retrieval QA
Authors:
Revanth Gangi Reddy,
Bhavani Iyer,
Md Arafat Sultan,
Rong Zhang,
Avirup Sil,
Vittorio Castelli,
Radu Florian,
Salim Roukos
Abstract:
Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR) -- a state-of-the-art (SOTA) open domain neural retrieval model -- on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domai…
▽ More
Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR) -- a state-of-the-art (SOTA) open domain neural retrieval model -- on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domain shift, we explore its fine-tuning with synthetic training examples, which we generate from unlabeled target domain text using a text-to-text generator. In our experiments, this noisy but fully automated target domain supervision gives DPR a sizable advantage over BM25 in out-of-domain settings, making it a more viable model in practice. Finally, an ensemble of BM25 and our improved DPR model yields the best results, further pushing the SOTA for open retrieval QA on multiple out-of-domain test sets.
△ Less
Submitted 20 April, 2022;
originally announced April 2022.
-
Predicting Performance of SLAM Algorithms
Authors:
Matteo Luperto,
Valerio Castelli,
Francesco Amigoni
Abstract:
Among the abilities that autonomous mobile robots should exhibit, map building and localization are definitely recognized as fundamental. Consequently, countless algorithms for solving the Simultaneous Localization And Mapping (SLAM) problem have been proposed. Currently, their evaluation is performed ex-post, according to outcomes obtained when running the algorithms on data collected by robots i…
▽ More
Among the abilities that autonomous mobile robots should exhibit, map building and localization are definitely recognized as fundamental. Consequently, countless algorithms for solving the Simultaneous Localization And Mapping (SLAM) problem have been proposed. Currently, their evaluation is performed ex-post, according to outcomes obtained when running the algorithms on data collected by robots in real or simulated environments. In this paper, we present a novel method that allows the ex-ante prediction of the performance of a SLAM algorithm in an unseen environment, before it is actually run. Our method collects the performance of a SLAM algorithm in a number of simulated environments, builds a model that represents the relationship between the observed performance and some geometrical features of the environments, and exploits such model to predict the performance of the algorithm in an unseen environment starting from its features.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
Towards Robust Neural Retrieval Models with Synthetic Pre-Training
Authors:
Revanth Gangi Reddy,
Vikas Yadav,
Md Arafat Sultan,
Martin Franz,
Vittorio Castelli,
Heng Ji,
Avirup Sil
Abstract:
Recent work has shown that commonly available machine reading comprehension (MRC) datasets can be used to train high-performance neural information retrieval (IR) systems. However, the evaluation of neural IR has so far been limited to standard supervised learning settings, where they have outperformed traditional term matching baselines. We conduct in-domain and out-of-domain evaluations of neura…
▽ More
Recent work has shown that commonly available machine reading comprehension (MRC) datasets can be used to train high-performance neural information retrieval (IR) systems. However, the evaluation of neural IR has so far been limited to standard supervised learning settings, where they have outperformed traditional term matching baselines. We conduct in-domain and out-of-domain evaluations of neural IR, and seek to improve its robustness across different scenarios, including zero-shot settings. We show that synthetic training examples generated using a sequence-to-sequence generator can be effective towards this goal: in our experiments, pre-training with synthetic examples improves retrieval performance in both in-domain and out-of-domain evaluation on five different test sets.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
End-to-End QA on COVID-19: Domain Adaptation with Synthetic Training
Authors:
Revanth Gangi Reddy,
Bhavani Iyer,
Md Arafat Sultan,
Rong Zhang,
Avi Sil,
Vittorio Castelli,
Radu Florian,
Salim Roukos
Abstract:
End-to-end question answering (QA) requires both information retrieval (IR) over a large document collection and machine reading comprehension (MRC) on the retrieved passages. Recent work has successfully trained neural IR systems using only supervised question answering (QA) examples from open-domain datasets. However, despite impressive performance on Wikipedia, neural IR lags behind traditional…
▽ More
End-to-end question answering (QA) requires both information retrieval (IR) over a large document collection and machine reading comprehension (MRC) on the retrieved passages. Recent work has successfully trained neural IR systems using only supervised question answering (QA) examples from open-domain datasets. However, despite impressive performance on Wikipedia, neural IR lags behind traditional term matching approaches such as BM25 in more specific and specialized target domains such as COVID-19. Furthermore, given little or no labeled data, effective adaptation of QA systems can also be challenging in such target domains. In this work, we explore the application of synthetically generated QA examples to improve performance on closed-domain retrieval and MRC. We combine our neural IR and MRC systems and show significant improvements in end-to-end QA on the CORD-19 collection over a state-of-the-art open-domain QA baseline.
△ Less
Submitted 2 December, 2020;
originally announced December 2020.
-
Answer Span Correction in Machine Reading Comprehension
Authors:
Revanth Gangi Reddy,
Md Arafat Sultan,
Efsun Sarioglu Kayi,
Rong Zhang,
Vittorio Castelli,
Avirup Sil
Abstract:
Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the "answerability" of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions.…
▽ More
Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the "answerability" of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions. We explore the nature of such errors and propose a post-processing correction method that yields statistically significant performance improvements over state-of-the-art MRC systems in both monolingual and multilingual evaluation.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Improved Synthetic Training for Reading Comprehension
Authors:
Yanda Chen,
Md Arafat Sultan,
Vittorio Castelli
Abstract:
Automatically generated synthetic training examples have been shown to improve performance in machine reading comprehension (MRC). Compared to human annotated gold standard data, synthetic training data has unique properties, such as high availability at the possible expense of quality. In view of such differences, in this paper, we explore novel applications of synthetic examples to MRC. Our prop…
▽ More
Automatically generated synthetic training examples have been shown to improve performance in machine reading comprehension (MRC). Compared to human annotated gold standard data, synthetic training data has unique properties, such as high availability at the possible expense of quality. In view of such differences, in this paper, we explore novel applications of synthetic examples to MRC. Our proposed pre-training and knowledge distillation strategies show significant improvements over existing methods. In a particularly surprising discovery, we observe that synthetic distillation often yields students that can outperform the teacher model.
△ Less
Submitted 24 October, 2020;
originally announced October 2020.
-
Multi-Stage Pre-training for Low-Resource Domain Adaptation
Authors:
Rong Zhang,
Revanth Gangi Reddy,
Md Arafat Sultan,
Vittorio Castelli,
Anthony Ferritto,
Radu Florian,
Efsun Sarioglu Kayi,
Salim Roukos,
Avirup Sil,
Todd Ward
Abstract:
Transfer learning techniques are particularly useful in NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pre-trained language model (LM) on in-domain text before fine-tuning to downstream tasks. We show that extending the vocabulary of the LM with domain-specific terms leads to further gains. To a bigger effect, we utilize…
▽ More
Transfer learning techniques are particularly useful in NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pre-trained language model (LM) on in-domain text before fine-tuning to downstream tasks. We show that extending the vocabulary of the LM with domain-specific terms leads to further gains. To a bigger effect, we utilize structure in the unlabeled data to create auxiliary synthetic tasks, which helps the LM transfer to downstream tasks. We apply these approaches incrementally on a pre-trained Roberta-large LM and show considerable performance gain on three tasks in the IT domain: Extractive Reading Comprehension, Document Ranking and Duplicate Question Detection.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
The TechQA Dataset
Authors:
Vittorio Castelli,
Rishav Chakravarti,
Saswati Dana,
Anthony Ferritto,
Radu Florian,
Martin Franz,
Dinesh Garg,
Dinesh Khandelwal,
Scott McCarley,
Mike McCawley,
Mohamed Nasr,
Lin Pan,
Cezar Pendus,
John Pitrelli,
Saurabh Pujar,
Salim Roukos,
Andrzej Sakrajda,
Avirup Sil,
Rosario Uceda-Sosa,
Todd Ward,
Rong Zhang
Abstract:
We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 de…
▽ More
We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 dev, and 490 evaluation question/answer pairs -- thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TechQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote---a technical document that addresses a specific technical issue. We also release a collection of the 801,998 publicly available Technotes as of April 4, 2019 as a companion resource that might be used for pretraining, to learn representations of the IT domain language.
△ Less
Submitted 7 November, 2019;
originally announced November 2019.
-
CFO: A Framework for Building Production NLP Systems
Authors:
Rishav Chakravarti,
Cezar Pendus,
Andrzej Sakrajda,
Anthony Ferritto,
Lin Pan,
Michael Glass,
Vittorio Castelli,
J. William Murdock,
Radu Florian,
Salim Roukos,
Avirup Sil
Abstract:
This paper introduces a novel orchestration framework, called CFO (COMPUTATION FLOW ORCHESTRATOR), for building, experimenting with, and deploying interactive NLP (Natural Language Processing) and IR (Information Retrieval) systems to production environments. We then demonstrate a question answering system built using this framework which incorporates state-of-the-art BERT based MRC (Machine Readi…
▽ More
This paper introduces a novel orchestration framework, called CFO (COMPUTATION FLOW ORCHESTRATOR), for building, experimenting with, and deploying interactive NLP (Natural Language Processing) and IR (Information Retrieval) systems to production environments. We then demonstrate a question answering system built using this framework which incorporates state-of-the-art BERT based MRC (Machine Reading Comprehension) with IR components to enable end-to-end answer retrieval. Results from the demo system are shown to be high quality in both academic and industry domain specific settings. Finally, we discuss best practices when (pre-)training BERT based MRC models for production systems.
△ Less
Submitted 19 June, 2020; v1 submitted 16 August, 2019;
originally announced August 2019.
-
A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization
Authors:
Lu Wang,
Hema Raghavan,
Vittorio Castelli,
Radu Florian,
Claire Cardie
Abstract:
We consider the problem of using sentence compression techniques to facilitate query-focused multi-document summarization. We present a sentence-compression-based framework for the task, and design a series of learning-based compression models built on parse trees. An innovative beam search decoder is proposed to efficiently find highly probable compressions. Under this framework, we show how to i…
▽ More
We consider the problem of using sentence compression techniques to facilitate query-focused multi-document summarization. We present a sentence-compression-based framework for the task, and design a series of learning-based compression models built on parse trees. An innovative beam search decoder is proposed to efficiently find highly probable compressions. Under this framework, we show how to integrate various indicative metrics such as linguistic motivation and query relevance into the compression process by deriving a novel formulation of a compression scoring function. Our best model achieves statistically significant improvement over the state-of-the-art systems on several metrics (e.g. 8.0% and 5.4% improvements in ROUGE-2 respectively) for the DUC 2006 and 2007 summarization task.
△ Less
Submitted 23 June, 2016;
originally announced June 2016.
-
Query-Focused Opinion Summarization for User-Generated Content
Authors:
Lu Wang,
Hema Raghavan,
Claire Cardie,
Vittorio Castelli
Abstract:
We present a submodular function-based framework for query-focused opinion summarization. Within our framework, relevance ordering produced by a statistical ranker, and information coverage with respect to topic distribution and diverse viewpoints are both encoded as submodular functions. Dispersion functions are utilized to minimize the redundancy. We are the first to evaluate different metrics o…
▽ More
We present a submodular function-based framework for query-focused opinion summarization. Within our framework, relevance ordering produced by a statistical ranker, and information coverage with respect to topic distribution and diverse viewpoints are both encoded as submodular functions. Dispersion functions are utilized to minimize the redundancy. We are the first to evaluate different metrics of text similarity for submodularity-based summarization methods. By experimenting on community QA and blog summarization, we show that our system outperforms state-of-the-art approaches in both automatic evaluation and human evaluation. A human evaluation task is conducted on Amazon Mechanical Turk with scale, and shows that our systems are able to generate summaries of high overall quality and information diversity.
△ Less
Submitted 17 June, 2016;
originally announced June 2016.