0% found this document useful (0 votes)
68 views5 pages

6 Study of Question Answering On

This study compares the performance of BERT and other transformer-based models on the general domain SQuAD dataset and the legal domain PolicyQA dataset. The models tested include BERT, ALBERT, RoBERTa, DistilBERT, and LEGAL-BERT. The results show that these models perform significantly better on SQuAD than PolicyQA, and surprisingly, general domain models like BERT and ALBERT outperform the legal domain specialized model LEGAL-BERT on PolicyQA.

Uploaded by

Tony Khánh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views5 pages

6 Study of Question Answering On

This study compares the performance of BERT and other transformer-based models on the general domain SQuAD dataset and the legal domain PolicyQA dataset. The models tested include BERT, ALBERT, RoBERTa, DistilBERT, and LEGAL-BERT. The results show that these models perform significantly better on SQuAD than PolicyQA, and surprisingly, general domain models like BERT and ALBERT outperform the legal domain specialized model LEGAL-BERT on PolicyQA.

Uploaded by

Tony Khánh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/366435126

Study of Question Answering on Legal Software Document using BERT based


models

Conference Paper · July 2022


DOI: 10.52591/lxai202207103

CITATIONS READS

2 19

5 authors, including:

Tom Černý Pablo Rivas


Baylor University Baylor University
166 PUBLICATIONS 1,402 CITATIONS 73 PUBLICATIONS 496 CITATIONS

SEE PROFILE SEE PROFILE

Gissella Bejarano
Baylor University
12 PUBLICATIONS 31 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cloudhubs View project

Ethics Under Pressure View project

All content following this page was uploaded by Tom Černý on 30 January 2023.

The user has requested enhancement of the downloaded file.


Study of Question Answering on Legal Software Document
using BERT based models
Ernesto Quevedo Caballero, Mushfika Rahman, Tomas Cerny,
Pablo Rivas, and Gissella Bejarano
Baylor University
{ernesto_quevedo1,mushfika_rahman1,
tomas_cerny,pablo_rivas,gissella_bejaranonic}@baylor.edu

Abstract performance regarding legal questionnaires.


Legal documents are challenging to understand
The transformer-based architectures have
achieved remarkable success in several Natural
appropriately without a legal background. The
Language Processing tasks, such as the Ques- challenges also lie with software companies and
tion Answering domain. Our research focuses software privacy policies and regulations. The ex-
on different transformer-based language mod- ponential growth of applications worldwide and
els’ performance in software development legal monitoring of our environments, decisions, tastes,
domain specialized datasets for the Question and others make it more and more important to be
Answering task. It compares the performance aware of how the data is being managed, shared,
with the general-purpose Question Answering
and used. Companies must include the stated infor-
task. We have experimented with the PolicyQA
dataset and conformed to documents regarding mation in the privacy policies of every application.
users’ data handling policies, which fall into The challenges lie in the characteristics of legal
the software legal domain. We used as base software documents, which are longevity, ambigu-
encoders BERT, ALBERT, RoBERTa, Distil- ity, and complexity. A high-performance Question
BERT and LEGAL-BERT and compare their Answering system on legal documents of software
performance on the Question answering bench- systems (privacy, policy rules) can have various
mark dataset SQuAD V2.0 and PolicyQA. Our
applications. One example of possible use is that
results indicate that the performance of these
models as contextual embeddings encoders in
every person can check fast if their queries have an
the PolicyQA dataset is significantly lower than answer in such a long document before signing or
in the SQuAD V2.0. Furthermore, we showed agreeing to the terms and conditions.
that surprisingly general domain BERT-based
The big transformer-based BERT model has had
models like ALBERT and BERT obtain bet-
ter performance than a more domain-specific a great result on general-purpose dataset SQuAD
trained model like LEGAL-BERT. version 1 (V1), and 2 (V2) (Rajpurkar et al., 2016,
2018). There is limited research on the perfor-
1 Introduction mance of BERT-model variants on legal software
datasets such as PolicyQA (Ahmad et al., 2020;
Question Answering (QA) Systems are an au-
Martinez-Gil, 2021) which reports the results only
tomated method for retrieving correct answers
of the original BERT model.
to questions posed by humans in natural lan-
guage (Dwivedi and Singh, 2013). A subclass of In this paper, we provide a study of the per-
Question Answering systems is machine reading formance of models like BERT (Devlin et al.,
comprehension, whose primary goal is to retrieve 2018), LEGAL-BERT (Chalkidis et al., 2020), AL-
answers to a given question in a single paragraph BERT (Lan et al., 2019), DistilBERT (Sanh et al.,
of text. The task has achieved remarkable success 2019), and RoBERTa (Liu et al., 2019) in the
in the general domain using transformer-based ar- SQuAD and PolicyQA datasets. Furthermore, we
chitecture. However, previous research has demon- compare and analyze such results, allowing us to
strated that utilizing in-domain text for training can pick the best model to obtain the best results in
benefit more from general-domain language mod- the PolicyQA dataset using only pre-trained mod-
els in specialized disciplines such as biology (Lee els. Our results indicate that the performance of
et al., 2020). Thus, one can infer that using legal- these models as contextual embeddings encoders
domain text can leverage the Question Answering in the PolicyQA dataset is significantly lower than
in the SQuAD V2.0. Furthermore, we showed that 3 Methodology
surprisingly general domain BERT-based models We studied and compared the performance of sev-
like ALBERT and BERT obtain better performance eral BERT-related models in two Question Answer-
than a more domain-specific trained model like ing datasets. One dataset is from general domains
LEGAL-BERT. SQuAD V2.0 and a specific domain in legal text
The paper is organized as follows. First, we related to software development called PolicyQA.
present a section of related work. The next section
elaborates on the methodology followed. Then, 3.1 Datasets
in section 3 presents performed experiments. The The SQuAD V2 dataset is a reading comprehension
consequent section 4 details a discussion about the dataset consisting of more than 100,000 questions
results. The final section concludes the work. posed by crowdworkers on a set of Wikipedia arti-
cles. The answer to each question is a segment of
text from the corresponding reading passage (Ra-
2 Related Work jpurkar et al., 2016, 2018).

Researchers have dedicated significantly to the The PolicyQA dataset is a reading comprehen-
Question Answering systems in the legal domain in sion dataset that contains 25,017 reading compre-
recent times. According to the study by (Martinez- hension style examples curated from an existing
Gil, 2021) Deep Learning models achieved the best corpus of 115 website privacy policies. PolicyQA
results. The most recent research success in Ques- provides 714 human-annotated questions for a wide
tion Answering and Legal Question Answering range of privacy practices (Ahmad et al., 2020).
(LQA) has come from Neural Attentive Text Rep- Both datasets are designed for the extractive
resentation. Few Shot Learning in the legal domain Question Answering, where the answer is a span
and diverse applications of the successful BERT of text in the passage. Also, the passage might not
model (Devlin et al., 2018; Martinez-Gil, 2021). be related to the question and does not contain the
answer.
There are various datasets for working with
Question Answering tasks in the general-purpose
domain, and the SQuAD is the most recognized 3.2 Models
because of achieving benchmark results (Rajpurkar We selected a set of the most used BERT-related
et al., 2018). In the legal domain, JEC-QA (Zhong models with outstanding performance in the
et al., 2020), ResPubliQA (Peñas et al., 2009), JRC- SQuAD dataset like ALBERT, RoBERTa, and clas-
ACQUIS Multilingual Parallel Corpus (Steinberger sic BERT. Additionally, we utilized the DistilBERT
et al., 2006) are well recognized. In the legal do- model because it is a cheaper and smaller model
main of Software Development of Question An- with competitive capabilities compared to other
swering datasets, PolicyQA (Ahmad et al., 2020) bigger BERT-based models. Thus, it is feasible to
is one of the most well-known and compatible with choose the DistilBERT model for speed during in-
the SQuAD dataset format. However, the study on ference and usability on devices (Sanh et al., 2019).
the performance of some significant benchmarks Moreover, we tested the LEGAL-BERT model, a
in general Question Answering in these domain- version of the original BERT model trained from
specific datasets is limited (Martinez-Gil, 2021). scratch with legal documents (Chalkidis et al.,
2020). We compared it with the other general-
To the best of our knowledge, this is the first purpose models to see if we could get better results
work studying the performance on the PolicyQA using a legal-based BERT in the PolicyQA dataset
dataset using as architecture only variants of BERT- than others trained in general text.
based model encoder besides the original BERT.
Also, the best results reported have been on the
work (Ahmad et al., 2020) using the original BERT 4 Experiments
model with 29.5 of Exact Match (EM) and 56.11 We conducted our experiment using the pretrained
of F1. This work showed that ALBERT is a better versions of the models BERT, ALBERT, RoBERTa,
encoder and obtains the best results in our study on DistilBERT, and LEGAL-BERT. The benchmark
the PolicyQA dataset. in the Question Answering (QA) task is evaluated
Dataset Name BERT ALBERT LEGAL-BERT RoBERTa DistillBERT
EM F1 EM F1 EM F1 EM F1 EM F1
SQUAD V2.0
71.6 77.39 73.9 77.91 73.5 77.01 76.9 80.1 65.47 69.27
5 epochs
PolicyQA
29.5 56.11 28.76 57.36 28.08 54.66 27.23 54.88 25.42 52.34
5 epochs
SQUAD V2.0
71.7 75.39 72.71 77.21 71.5 75.30 75.18 74.30 64.79 68.93
10 epochs
PolicyQA
29.6 57.02 29.7 58.43 28.45 55.01 27.85 54.91 25.81 52.48
10 epochs

Table 1: Result of the models on SQUAD V2.0 and PolicyQA

on the EM (Exact Match) and F1 metrics (Jurafsky of time.


and Martin).
Another essential aspect to note is the big dif-
To run our experiments, we used the open-source ference between the results in SQuAD V2.0 and
code from hugginface/transformers 1 , we built two PolicyQA of every model. One first reason is the
notebooks to run experiments using this code, and difference in the size of both datasets. Another rea-
we provide an open repository with both notebooks son is the difficult language that comes with legal
at 2 . In this work, we didn’t make any changes to domain text (Martinez-Gil, 2021).
the original code, we only plugged in the PolicyQA
dataset instead of the usual SQuAD and also the From these results, our recommendations and,
particular model we wanted to test. at the same time, our future work are the follow-
ing. First, train the BERT model from scratch on
Our selected models ran for 5 and 10 epochs on
only data related to the software development le-
PolicyQA and SQUAD V2.0 datasets. Table 1 com-
gal domain and use that model instead as the base
pares the models on both datasets and variations
encoder. Furthermore, since ALBERT obtained
of epochs, and the metric for measurement is EM
the best performance, we believe the best would
(Exact Match) and F1.
be to train from scratch the ALBERT model di-
rectly. Finally, we recommend the use of ensemble
5 Discussion methods which have proved to give significant im-
Our experimental results revealed some interest- provements in Question Answering Systems 3 .
ing aspects. Firstly, The BERT and ALBERT
model’s pre-trained version performed better than
the LEGAL-BERT model in SQUAD V2.0 and Pol- 6 Conclusions
icyQA datasets. Since LEGAL-BERT was trained
on legal text, our assumption was that it would Transformer-based language models have signif-
perform better than all the models. However, our icantly advanced the field of NLP tasks, includ-
assumption did not hold. We believe that there are ing Question Answering. This work studies the
notable separations among subdomains even inside performance of several BERT model variants in
the legal domain. PolicyQA is legal text in the sub- the SQuAD V2.0 and PolicyQA datasets. The re-
domain of Software Development and privacy in sults showed LEGAL-BERT did not perform better
applications. Thus, it is feasible that a more gen- than general pre-trained models like BERT and AL-
eralized pre-trained model would produce a better BERT. Finally, the ALBERT model achieved top
result than a model trained in a legal subdomain results, which makes it a proper choice at the mo-
with high separation from the Software Develop- ment as a root and contextual embeddings encoder
ment legal domain. Secondly, in PolicyQA, the for a complex model design in the future. Fur-
increase of epochs improves the performance of thermore, our work suggests that the best venue to
the models to some degree, which suggests that follow is to train the ALBERT model from scratch
training for more epochs might still improve the re- on only data related to the software development
sults. We will continue increasing epochs, however, legal domain and use that model instead as the base
every experiment consumes a considerable amount encoder.
1
https://github.com/huggingface/transformers
2 3
https://github.com/Fidac/Legal-SE-BERT-Study https://rajpurkar.github.io/SQuAD-explorer/
Acknowledgements Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
This work was funded in part by the National Sci- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
ence Foundation under grants CNS-2136961 and proach. arXiv preprint arXiv:1907.11692.
CNS-2210091.
Jorge Martinez-Gil. 2021. A survey on legal
question answering systems. arXiv preprint
References arXiv:2110.07333.

Wasi Uddin Ahmad, Jianfeng Chi, Yuan Tian, and Kai- Anselmo Peñas, Pamela Forner, Richard Sutcliffe, Ál-
Wei Chang. 2020. Policyqa: A reading compre- varo Rodrigo, Corina Forăscu, Iñaki Alegria, Danilo
hension dataset for privacy policies. arXiv preprint Giampiccolo, Nicolas Moreau, and Petya Osenova.
arXiv:2010.02557. 2009. Overview of respubliqa 2009: Question an-
swering evaluation over european legislation. In
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- Workshop of the Cross-Language Evaluation Forum
siotis, Nikolaos Aletras, and Ion Androutsopoulos. for European Languages, pages 174–196. Springer.
2020. Legal-bert: The muppets straight out of law Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
school. arXiv preprint arXiv:2010.02559. Know what you don’t know: Unanswerable questions
for squad. arXiv preprint arXiv:1806.03822.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
ing. arXiv preprint arXiv:1810.04805. Percy Liang. 2016. Squad: 100,000+ questions
for machine comprehension of text. arXiv preprint
Sanjay K Dwivedi and Vaishali Singh. 2013. Research arXiv:1606.05250.
and reviews in question answering system. Procedia
Technology, 10:417–424.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Daniel Jurafsky and James H Martin. Speech and lan- Thomas Wolf. 2019. Distilbert, a distilled version
guage processing: An introduction to natural lan- of bert: smaller, faster, cheaper and lighter. arXiv
guage processing, computational linguistics, and preprint arXiv:1910.01108.
speech recognition.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Dániel
Kevin Gimpel, Piyush Sharma, and Radu Soricut. Varga. 2006. The jrc-acquis: A multilingual aligned
2019. Albert: A lite bert for self-supervised learn- parallel corpus with 20+ languages. arXiv preprint
ing of language representations. arXiv preprint cs/0609058.
arXiv:1909.11942.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang
Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Zhang, Zhiyuan Liu, and Maosong Sun. 2020. Jec-
2020. Biobert: a pre-trained biomedical language qa: A legal-domain question answering dataset. In
representation model for biomedical text mining. Proceedings of the AAAI Conference on Artificial
Bioinformatics, 36(4):1234–1240. Intelligence, volume 34, pages 9701–9708.

View publication stats

You might also like