Authors:
Leonardo Moraes
1
;
2
;
Pedro Jardim
1
and
Cristina Dutra Aguiar
1
Affiliations:
1
Department of Computer Science, University of São Paulo, São Carlos, Brazil
;
2
Machine Learning & Artificial Intelligence, Sinch, Stockholm, Sweden
Keyword(s):
Question Answering, Big Data, Software Reference Architecture, Design Principles.
Abstract:
Companies continuously produce several documents containing valuable information for users. However, querying these documents is challenging, mainly because of the heterogeneity and volume of documents available. In this work, we investigate the challenge of developing a Big Data Question Answering system, i.e., a system that provides a unified, reliable, and accurate way to query documents through naturally asked questions. We define a set of design principles and introduce BigQA, the first software reference architecture to meet these design principles. The architecture consists of high-level layers and is independent of programming language, technology, querying and answering algorithms. BigQA was validated through a pharmaceutical case study managing over 18k documents from Wikipedia articles and FAQ about Coronavirus. The results demonstrated the applicability of BigQA to real-world applications. In addition, we conducted 27 experiments on three open-domain datasets and compared
the recall results of the well-established BM25, TF-IDF, and Dense Passage Retriever algorithms to find the most appropriate generic querying algorithm. According to the experiments, BM25 provided the highest overall performance.
(More)