AIP491 SP23AI08 Capstone Project Report
AIP491 SP23AI08 Capstone Project Report
TEXT RETRIEVAL
by
by
DEPARTMENT OF ITS
THE FPT UNIVERSITY HO CHI MINH CITY
April 2023 (Month year)
Final Capstone Project 1
ACKNOWLEDGMENTS
• FPT Cloud for reducing fees GPU Server (A30) that greatly aided our
research.
• Mr. Kiet Nguyen, for his supporter about UIT-ViQuAD (version 1.0) - A
Vietnamese Dataset for Evaluating Machine Reading Comprehension.
• Our friends and family for their love and support throughout our academic
journey.
We are grateful to all those who have helped us in ways both big and small,
and without whom this project would not have been possible. Thank you all.
Final Capstone Project 2
AUTHOR CONTRIBUTIONS
Conceptualization, Gia Khang and Minh Nhat; methodology, Gia Khang; soft-
ware, Gia Khang; validation, Minh Nhat; formal analysis, Gia Khang and
Minh Nhat; investigation, Gia Khang; resources, Minh Nhat; data curation,
Gia Khang; writing—original draft preparation, Minh Nhat; writing—review
and editing, Gia Khang and Minh Nhat; visualization, Gia Khang and Minh
Nhat; supervision, Gia Khang; project administration. All authors have read
and agreed to the Final Capstone Project document.
Final Capstone Project 3
ABSTRACT
CONTENTS
ACKNOWLEDGMENTS 1
AUTHOR CONTRIBUTIONS 2
ABSTRACT 3
1 INTRODUCTION 8
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1 Question and Answer System . . . . . . . . . . . . . . . . 8
1.2 Main Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Vietnamese Law . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Vietnamese Legal Text Retrieval . . . . . . . . . . . . . . 10
1.3 Specific Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 RELATED WORKS 13
2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Encoder-Decoder Architecture . . . . . . . . . . . . . . . 18
2.1.4 Transformer Components . . . . . . . . . . . . . . . . . . 20
2.1.5 Training and Inference . . . . . . . . . . . . . . . . . . . . 24
2.2 Sparse Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Dense Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Cross-encoder approaches . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Dual-encoder approaches . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Sequence-to-Sequence (Seq2Seq) for question answering . . . . . 31
2.7 Beam search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Contrastive learning in Information Retrieval . . . . . . . . . . . 32
2.8.1 Query-document matching . . . . . . . . . . . . . . . . . 32
2.8.2 Learn a good representation for queries and documents . 33
2.8.3 Contrastive learning . . . . . . . . . . . . . . . . . . . . . 33
2.8.4 Distinguish between positive and negative pairs of text . . 34
2.8.5 Positive and negative pairs of text sampling . . . . . . . . 36
6 DISCUSSIONS 83
7 CONCLUSIONS 84
8 REFERENCES 85
9 APPENDIX 89
Final Capstone Project 6
LIST OF FIGURES
List of Figures
1 Overview of their Legal Document Retrieval system in [3] . . . . 11
2 Their proposed pipeline for training in [4] . . . . . . . . . . . . . 12
3 Crawled data information. . . . . . . . . . . . . . . . . . . . . . 41
4 Example of in-domain data selection. . . . . . . . . . . . . . . . . 42
5 A sample in the dataset with highlighted parts in [5] . . . . . . . 42
6 Training flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7 Masked Language Modeling examples . . . . . . . . . . . . . . . 50
8 SBERT architecture with objective function . . . . . . . . . . . . 62
9 Overview about four versions in Vietnamese Legal Text Retrieval 66
10 Overview about training loss . . . . . . . . . . . . . . . . . . . . 67
11 Overview about pretrain Masked Language Model . . . . . . . . 68
12 Overview about SB-Condenser-300MB . . . . . . . . . . . . . . . 70
13 Overview about SB-Condenser-300MB . . . . . . . . . . . . . . . 71
14 Overview about SB-Condenser-300MB-Full (Round 1) . . . . . . 75
15 Overview about SB-Condenser-300MB-Lite (Round 1) . . . . . . 76
16 Inference flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Final Capstone Project 7
LIST OF TABLES
1 INTRODUCTION
1.1 Overview
1.1.1 Question and Answer System
A Question and Answer (QA) system is a natural language processing (NLP)
technology designed to process and respond to user queries or questions in a con-
versational manner. This system has gained popularity in recent years with the
emergence of virtual assistants and chatbots that provide personalized customer
support and assistance. QA systems operate by using pre existing data sources
or knowledge bases to generate responses to user inquiries. The knowledge
base contains information relevant to the specific domain in which the system
operates. The system uses natural language processing (NLP) algorithms to
understand the question posed by the user and retrieve the most appropriate
response from the knowledge base.
The primary goal of a QA system is to provide accurate and relevant re-
sponses to user questions in a conversational manner, allowing users to interact
with the system in natural language. This technology is becoming increasingly
popular in the areas of customer service and support, where users can obtain
quick and accurate responses to their queries. QA systems can also be uti-
lized in educational and training applications, where users can ask questions
and receive immediate feedback. The system can also be trained to understand
multiple languages, making it possible to reach a wider audience.
One of the key challenges in developing QA systems is accurately interpret-
ing user queries or questions. Users may pose questions in a variety of ways and
use informal language that may not conform to standard grammatical rules.
The system must be able to identify the underlying intent of the question and
retrieve the appropriate response from the knowledge base. NLP algorithms
are used to analyze the language used by the user and identify the important
concepts in the question. The system then uses this information to generate
a response that is relevant and accurate. Besides, QA systems are a powerful
application of natural language processing technology. These systems have the
potential to revolutionize the way we interact with machines and access infor-
mation. By providing users with a conversational interface and the ability to
pose questions in natural language, QA systems are making it easier than ever
before to obtain accurate and relevant information. The continued development
of natural language processing technology is likely to result in even more sophis-
ticated QA systems in the future, enabling us to engage in increasingly complex
conversations with machines.
or may not contain the answer. QA retrieval techniques often involve natural
language processing and machine learning algorithms to analyze the question
and retrieve relevant information from the corpus of documents. QA retrieval
has numerous applications, such as virtual assistants, customer support, and
educational resources.
cepts, which can help improve the accuracy and efficiency of legal text retrieval.
However, the development of machine learning algorithms for Vietnamese legal
text retrieval requires significant resources and expertise.
Despite many challenges, researchers are developing specialized techniques
and algorithms for analyzing Vietnamese legal text and retrieving relevant docu-
ments. These tools play a crucial role in legal research and e-discovery, enabling
legal professionals to quickly search through vast amounts of legal data to find
the information they need. As the legal system in Vietnam continues to evolve,
the demand for efficient and effective legal text retrieval tools will only continue
to grow.
2 RELATED WORKS
2.1 Transformer
2.1.1 Introduction
Definition The Transformer is an architecture for natural language process-
ing tasks that was introduced in the paper "Attention is All You Need" [8]. It
is a type of neural network that is designed to process sequential data, such as
text, and has achieved state-of-the-art results in a variety of NLP tasks. The
Transformer architecture is based on the concept of attention, which allows the
network to selectively focus on certain parts of the input sequence when mak-
ing predictions. Unlike traditional sequence models, such as recurrent neural
networks (RNNs) and convolutional neural networks (CNNs), the Transformer
does not rely on recurrence or convolutions to process sequences, which makes
it more parallelizable and efficient. The Transformer consists of an encoder
and a decoder, each of which contains multiple layers of self-attention and feed-
forward neural networks. The encoder processes the input sequence, and the
decoder generates the output sequence. The self-attention mechanism allows
the Transformer to capture long-range dependencies and contextual informa-
tion in the input sequence, while the feedforward networks enable it to model
complex nonlinear relationships between the input and output. Overall, the
Transformer architecture has significantly improved the performance of NLP
models, especially in tasks that require understanding of long-range dependen-
cies and complex relationships between the input and output.
• Compute the Query, Key, and Value vectors: The query, key, and value
vectors are computed for each element in the input and output sequences.
These vectors are used to compute the attention weights and the weighted
sum.Let x be the input sequence of length n, and y be the output sequence
of length m. We compute the query vector qi , key vector kj , and value
vector vj for each input element xi and output element yj as follows:
qi = Wq ∗ xi (1)
kj = Wk ∗ yj (2)
vj = W v ∗ y j (3)
Here, Wq , Wk , and Wv are learned weight matrices.
• Compute the Attention Weights: The attention weights are computed by
applying a softmax function to the dot product of the query vector and
the key vector for each input element. The attention weights ai,j are
computed for each input element xi and output element yj as follows:
qi ∗ kj
ai,j = sofmax( √ ) (4)
dk
where dk is the dimension of the key vectors
Final Capstone Project 17
• Compute the Weighted Sum: The weighted sum of the input elements is
computed by multiplying each input element by its corresponding atten-
tion weight and summing the results. The weighted sum cj of the input
elements is computed as follows:
X
cj = (ai,j ∗ vi ) (5)
i
• Generate the Output: The weighted sum is used to generate the output
element. The output element yj is generated using the weighted sum cj :
where,
[cj ; yj−1 ]
is the concatenation of the weighted sum and the previous output element
yj−1 , and Wo is a learned weight matrix.
Advantages of the Attention Mechanism over Other Methods: the attention
mechanism has several advantages over other methods for processing sequential
data, including: flexibility: The attention mechanism allows the network to se-
lectively focus on different parts of the input when making predictions. This
allows the network to adapt to different input sequences and capture complex
relationships between different parts of the sequence; efficiency: The attention
mechanism allows the network to process the input sequence in parallel, rather
than sequentially. This can lead to significant speedups in training and infer-
ence times; interpretability: The attention weights provide a measure of the
importance of each input element for a given output element. This can help to
explain the network’s predictions and provide insights into how it is processing
the input sequence.
through a stack of encoder layers, each of which applies self-attention and feed-
forward layers to the input sequence to generate a sequence of encoder outputs.
These encoder outputs contain information about the input sequence that is
relevant for generating the output sequence. The output sequence is then gen-
erated by the decoder network, which also consists of a stack of decoder layers.
At each time step, the decoder takes the previously generated tokens (or the
start-of-sequence token for the first time step) and the context vector generated
by the encoder as input, and generates the next token in the output sequence.
The context vector is generated by performing attention over the encoder out-
puts, weighted by the current state of the decoder. The attention mechanism
used in the Transformer allows the model to focus on different parts of the input
sequence at each time step, based on the current state of the decoder. This en-
ables the model to capture long-range dependencies and generate accurate pre-
dictions for a variety of sequence-to-sequence tasks. During training, the model
is typically trained to minimize a loss function that measures the difference be-
tween the predicted output sequence and the ground truth output sequence.
The model parameters are updated using gradient descent, which iteratively
adjusts the model parameters to minimize the loss function. The Transformer
architecture uses the encoder-decoder architecture to generate predictions for
sequence-to-sequence tasks, with the attention mechanism enabling the model
to capture long-range dependencies and generate accurate predictions.
where,
• x is the input token
• Embedding is a function that maps the input token to its embedding
vector
√
• dmodel is a scaling factor used to prevent the embeddings from becoming
too small or too large
• P E is the positional encoding vector for the input token
Final Capstone Project 21
• Einput is the final input embedding vector for the input token
The dimensionality of the embedding vector is typically a hyperparameter that
is set based on the size of the input vocabulary and the complexity of the
task. In the Transformer architecture, the embedding dimensionality is often
set to be the same as the hidden layer dimensionality, dmodel , which is also a
hyperparameter.
where the weights are determined by the attention scores. The Multi-Head
Attention layer can be expressed mathematically as:
where,
• Q is the Query matrix
• K is the Key matrix
• V is the Value matrix
• headi = Attention(QWiQ , KWiK , V WiV ) is the attention output for the
i-th attention head
• WiQ , WiK , WiV are learnable weight matrices for the i-th attention head.
These are hav been briefly introduced before in Eq. 1, Eq. 2 and Eq. 3
• W O is a learnable weight matrix used to combine the outputs of the at-
tention heads
• Concat is a function that concatenates the outputs of the attention heads
along the last dimension
• h is the number of attention heads
T
• Attention(Q, K, V ) = sof tmax( QK
√
dk
)V is the attention mechanism be-
tween the Query, Key, and Value matrices
The Positionwise Feedforward layer in the Encoder applies two linear trans-
formations with a ReLU activation function to each position independently and
identically. The non-linear dependencies between the hidden states that the
Positionwise Feedforward layer is designed to capture refer to the complex re-
lationships between the embedded tokens in the sequence. Each token in the
sequence is initially represented as an embedding vector, and these embeddings
are then transformed through the multiple layers of the Transformer architec-
ture to produce a final output sequence. The Positionwise Feedforward layer
is an essential component of the Transformer architecture because it allows the
model to capture non-linear relationships between the embedded tokens in the
sequence. This is important because natural language is inherently complex,
and the relationships between tokens in a sentence can be highly non-linear and
difficult to model using linear techniques. By introducing non-linearity into the
model through the use of the ReLU activation function, and by allowing the
model to learn a non-linear mapping between the input and output sequences
through the use of the two linear transformations, the Positionwise Feedfor-
ward layer enables the Transformer architecture to capture complex patterns in
the data and achieve state-of-the-art performance on a wide range of natural
language processing tasks.It can be expressed mathematically as:
where,
• x is the input vector
• W1 , b1 , W2 , b2 are learnable weight matrices and biases
The output of each layer in the Encoder is passed as input to the next layer
in the stack, allowing for the encoding of increasingly complex and abstract
representations of the input sequence. The equations for the Encoder can be
summarized as:
H0 = Einput (12)
Hi = M ultiHead(Hi−1 WiQ , Hi−1 WiK , Hi−1 WiV ) + Hi−1 (13)
Hi = F F N (Hi ) + Hi (14)
where,
• Einput is the input sequence after the Input Embedding and Positional
Encoding layers
learns the embeddings of the source tokens during training. In contrast, the De-
coder has to generate tokens in the target language, which are also represented
as a sequence of embeddings. Therefore, the Decoder also has an additional
Input Embedding layer that is responsible for embedding the tokens of the tar-
get sequence into a continuous vector space, where each token is represented
as a dense vector. This layer also learns the embeddings of the target tokens
during training. The embeddings of the target tokens generated by the Input
Embedding layer are then fed into the subsequent layers of the Decoder, such
as the Masked Multi-Head Attention layer, the Multi-Head Attention layer, and
the Positionwise Feedforward layer, to generate the output sequence.
Output Layer The output layer is the final layer in the Transformer archi-
tecture and is responsible for producing the final output sequence. The out-
put layer takes as input the final hidden state of the decoder, which has been
processed through the multiple layers of the Transformer architecture, and pro-
duces a probability distribution over the vocabulary of the target language.
The output layer typically consists of a linear transformation followed by a soft-
max activation function. The linear transformation projects the final hidden
state of the decoder onto a high-dimensional space, and the softmax function
normalizes the resulting vector to produce a probability distribution over the
vocabulary. The output layer is designed to produce a probability distribution
over the vocabulary of the target language, which allows the model to generate
predictions for each token in the output sequence. During training, the model
is typically trained to maximize the log-likelihood of the target sequence given
the input sequence, which involves minimizing the cross-entropy loss between
the predicted distribution and the true distribution over the target vocabulary.
In summary, the output layer is an essential component of the Transformer ar-
chitecture because it enables the model to generate predictions for each token
in the output sequence by producing a probability distribution over the target
vocabulary. The output layer is typically implemented using a linear transfor-
mation followed by a softmax activation function and is trained to maximize
the log-likelihood of the target sequence given the input sequence.
• Forward Pass: During the forward pass, the input sequence is fed into the
encoder, and the output sequence is generated by the decoder. The output
sequence is then compared to the target sequence using a loss function,
such as cross-entropy loss.
Final Capstone Project 25
• Backward Pass: During the backward pass, the gradients of the loss with
respect to the model parameters are computed using the chain rule of
calculus. The gradients are then backpropagated through the layers of
the Transformer architecture, from the output layer to the input layer.
• Parameter Update: After computing the gradients, the model parameters
are updated in the direction of the negative gradient using an optimization
algorithm, such as stochastic gradient descent (SGD), Adam, or Adagrad.
The learning rate hyperparameter determines the step size of the param-
eter update.
• Repeat: Steps 1-3 are repeated for multiple epochs, until the model con-
verges to a satisfactory solution.
During training, the gradients are typically computed using a technique called
teacher forcing, in which the decoder is fed the true target sequence as input
at each time step, rather than the predicted sequence from the previous time
step. Teacher forcing can speed up convergence during training, but can lead
to suboptimal performance at inference time when the model is required to
generate sequences without access to the true target sequence.
The loss function of the Transformer is typically the cross-entropy loss, which
measures the dissimilarity between the predicted probability distribution over
the target vocabulary and the true probability distribution over the target vo-
cabulary. The cross-entropy loss is defined as:
N M
1 XX
L=− yij log(pij ) (15)
N i=1 j=1
where N is the number of training examples, M is the size of the target vocab-
ulary, yij is the true probability of the j-th token in the i-th target sequence,
and pij is the predicted probability of the j-th token in the i-th target sequence.
During training, the goal is to minimize the cross-entropy loss with respect to
the model parameters, which can be achieved using backpropagation and gradi-
ent descent. The gradients of the loss with respect to the model parameters can
be computed using the chain rule of calculus and are backpropagated through
the layers of the Transformer architecture to update the model parameters. The
cross-entropy loss is a commonly used loss function in natural language process-
ing (NLP) tasks, such as machine translation, text classification, and language
modeling. It is well-suited for these tasks because it provides a measure of the
dissimilarity between the predicted and true probability distributions over the
target vocabulary, which is the main objective of NLP tasks.
The update equation for a parameter θ at iteration t is:
θt+1 = θt − η · gt (16)
where η is the learning rate, and gt is the gradient of the loss with respect
to θ at iteration t. The learning rate determines how quickly the parameters
Final Capstone Project 26
are updated in response to the gradients. A high learning rate can result in
large updates that overshoot the optimal parameter values, while a low learning
rate can result in slow convergence and getting stuck in local minima. By
iteratively repeating these steps on a large dataset of input-target pairs, the
model gradually learns to produce accurate translations.
high. Sparse representations are generally more efficient to store and process
than dense representations, especially when the number of dimensions is very
large, as is often the case with text data. However, they may not capture the
full complexity of the data and may not perform as well as dense representations
on certain tasks, such as natural language understanding and text generation.
Sparse retrieval methods are mainly based on relevant, similarity between
two documents, or between a query and documents based on keywords that
appear in both. Sparse Retrieval is famous for its classic algorithms including
algorithms like BM25 , TFIDF , of which is the best algorithm in this approach.
The biggest disadvantage of Sparse Retrieval is that the representation of the
vocabulary will directly and significantly affect the performance of the algo-
rithms, namely lexical mismatching. For example, “Civil law” and “civil law”
have the same meaning but are two different strings because the "c" charac-
ter is capital. In addition, to use these algorithms, the input string must be
processed by removing stopwords, which means that the grammatical structure
of the sentence will not be considered, a very important part of the semantic
representation. of the sentence. Especially, for Vietnamese, compound words in
Vietnamese show different meanings if we separate two compound words into
separate words. Depending on the language, the sparse retrieval approach must
have a specific treatment in each language.
BM25 BM25 (Best Match 25) is a ranking function used in information re-
trieval that is a variant of the TF-IDF weighting scheme. It is often used in
search engines to rank the relevance of documents to a particular search query.
Like TF-IDF, BM25 calculates a score for each document in a corpus based on
the frequency of its query terms in the document, but it also takes into account
document length and term frequency saturation. The formula for BM25 score
is given by:
|q|
X f (qi , d) · (k + 1)
BM25(q, d) = IDF(qi ) · |d|
(20)
i=1 f (qi , d) + k · (1 − b + b · avgdl )
where
• q is the query
• d is a document in the corpus
• |q| is the length of the query
• IDF(qi ) is the IDF of the i-th query term qi
where the weight for each term is multiplied by the weight of the field it
appears in.
• Ranking: The documents are ranked in descending order based on their
scores, and the top-ranked documents are returned as the search results.
BM25+ can be a more effective ranking algorithm than basic BM25 when docu-
ments contain multiple fields, since it takes into account the varying importance
of each field for the query. However, BM25+ also has more tunable parame-
ters than basic BM25, since weights must be assigned to each field. BM25+
combines the basic BM25 formula with field weights. Here’s the formula for
computing the BM25+ score for a document:
|q|
X wi · f (qi , d) · (k + 1)
BM25+(q, d) = IDF(qi ) · |d|
(21)
i=1 f (qi , d) + k · (1 − b + b · avgdl )
where
• q is the query
• d is a document in the corpus
the need for explicit supervision or labeled data. Contrastive learning is a self-
supervised learning method that learns representations by contrasting similar
and dissimilar pairs of data.
In contrastive learning, we start by creating pairs of similar and dissimilar
data points. For example, we can create a pair of similar data points by tak-
ing two different augmentations of the same image, or we can create a pair of
dissimilar data points by taking an image and a completely different image.
The idea is to learn a representation that maps similar data points close
together in the embedding space and dissimilar data points far apart. This is
achieved by optimizing a contrastive loss function that encourages the model to
minimize the distance between similar data points while maximizing the distance
between dissimilar data points.
Contrastive learning has shown to be effective in a variety of tasks, includ-
ing image classification, object detection, and natural language processing. It
has also been shown to be effective in learning representations for information
retrieval tasks, such as dense passage retrieval.
One reason why contrastive learning is popular is because it does not require
labeled data, which can be expensive and time-consuming to obtain. Instead,
it can use unlabeled data, which is typically abundant and easy to obtain.
Additionally, contrastive learning can leverage the vast amounts of unlabeled
data available on the internet to learn general-purpose representations that can
be fine-tuned on specific tasks. This makes contrastive learning an attractive
approach for learning representations for a wide range of applications, including
information retrieval.
similarity scores in the contrastive loss function. The choice of similarity score
depends on the nature of the data and the task at hand. Cosine similarity is a
commonly used similarity score in contrastive learning because it is a robust and
efficient way to measure the similarity between two vectors in high-dimensional
spaces. It is particularly effective when the data is sparse, and the magnitude of
the vectors is not important. However, depending on the task, other similarity
scores can be used. For example, in image retrieval, the Euclidean distance
or the Manhattan distance can be used instead of cosine similarity. In text
retrieval, the Jaccard similarity or the Dice similarity can be used instead of
cosine similarity. The choice of similarity score can be problem-specific and
may require experimentation to find the best option. In general, the similarity
score should be chosen based on its ability to capture the semantic similarity
between the data points and its compatibility with the chosen representation
function.
The loss function encourages the model to minimize the distance between the
embeddings of the query and relevant document while maximizing the distance
between the embeddings of the query and irrelevant documents. In other words,
it tries to pull the embeddings of positive pairs closer together while pushing the
embeddings of negative pairs farther apart. During training, the model learns
to distinguish between positive and negative pairs by optimizing the contrastive
loss function using stochastic gradient descent or other optimization algorithms.
The learned representation can then be used for dense passage retrieval by
computing the similarity between the query and all documents in the corpus
using the learned embeddings.
Based on contrastive loss function in Eq. 36. qpos represents a positive
query, dpos represents a positive document that is relevant to the query, and
Dneg represents a set of negative documents that are irrelevant to the query.
The function f is a scoring function that takes as input a query-document pair
and outputs a relevance score. The loss function is computed as the negative
f (qpos ,dpos )
logarithm of the fraction ef (qpos ,dpos ) +eP ef (qpos ,dneg )
. The numerator
dneg ∈Dneg
• For each query and its relevant document, create a positive pair (question,
relevant document).
• For each query and negative document, create a negative pair (question,
irrelevant document).
• Repeat steps 3 and 4 for all queries q and their relevant and negative
documents.
The number of negative documents sampled for each query can be tuned
depending on the dataset size and the difficulty of the task. In general, a
larger number of negative documents can make the training more challenging
but also more effective. It is important to note that this approach assumes
the availability of labeled data, which is not always the case in information
retrieval. In such cases, unsupervised methods such as clustering or density-
based sampling can be used to generate positive and negative pairs based on
the similarity of the queries and documents in the corpus.
Final Capstone Project 37
• In the first few weeks, the main tasks were searching for literature and
studying methods from related articles to gain different perspectives and
details.
• In weeks 2 and 3, the team’s main task was collecting appropriate datasets
for the topic "Vietnamese legal document retrieval" specifically Zalo Legal
2021 dataset and also collecting data from reputable legal websites. The
data was then processed by filtering noise and rearranging to fit the next
training phase.
• From week 3 to 7, the team focused on analyzing query architectures and
evaluating their effectiveness such as DPR or Condenser. Additionally,
three experiments from versions V.0.1, V.0.2, and V.1.0 were completed to
gain a comprehensive overview of the chosen model which was Condenser
due to feasibility and efficiency on the Vietnamese language.
• From week 8 to 11, our work was to find hardware solutions such as renting
and setting up GPUs to ensure progress in our training. We also created
a small demo product for our project. Finally, we developed a "Question
Answering" system to increase efficiency and provide better visualization
of the results from previous models.
• In the remaining weeks, we evaluated the results and summarized our
report, as well as wrote a paper for the scientific conference in Indonesia,
ISICO 2023.
Timeline
Week 12 5.1 Write the project report, summarize the best re-
sults to write the report for the scientific conference..
Week 13 5.2 Refine the report and create slides for the final
project defense.
• Regarding the query model, Khang will be the main responsible person.
However, we will still study two query structures, DPR and Condenser.
Final Capstone Project 39
Khang will develop the first version, V.0.1, and hand it over to Nhat for
further development in subsequent versions, V.0.2 and V.1.0.
• For the remaining parts, we will focus on researching and training the
"Question Answering" model, writing a thesis report, writing articles for
scientific conferences, and creating a prototype to introduce the product.
Final Capstone Project 41
and convolutional neural networks to capture these properties. While the first
component is meant to extract important information from the input question
and legal documents, the second component is used to match and present the
most important parts.
Final Capstone Project 43
For the paper " Miko Team: Deep Learning Approach for Legal Question
Answering in ALQAC 2022" [3], They finetuned RoBERTa with 4GB of legal
text data to enhance its performance in the legal domain. They use two dis-
tinct approaches to gather this data: Collected directly from 2 websites and
Extract sentences close to the legal topic from the news corpus. Firstly, Table
I provides the collected data for the first method, which includes the number of
articles (articles) and sentences (sentences). This number represents the num-
ber of sentences that are retained following the selection process (for example.
removing redundant and non-Vietnamese sentences). Secondly, In order to ex-
tract legal documents from the news corpus using the second approach, which
is based on the work in [23], we first create a collection of legal documents
known as "in-domain" data. On this in-domain dataset, we then construct a
statistical language model (base language model). Select sentences in another
corpus whose perplexity score falls within the threshold by utilizing this model
to evaluate the score for each sentence. A popular metric for evaluating lan-
guage models is perplexity. We can determine how effective our language model
is in our data domain using the perplexity score. Since it was learned from
"in-domain" data, we assume that our base language model is adequate in this
instance.
used for applications like computer vision and natural language processing. It
is open-source software that is available for free under a modified BSD license.
PyTorch features a C++ interface, even though the Python interface is more
refined and the main focus of development. Using PyTorch, a programmer can
easily create a sophisticated neural network because its primary data format,
Tensor, is a multi-dimensional array similar to Numpy arrays. Due to PyTorch’s
flexibility, speed, and ease of use, it is becoming more and more popular in both
the business world and among researchers. PyTorch is one of the best deep
learning tools. For a variety of reasons, PyTorch has emerged as one of the
top machine learning frameworks: Widely Available; accelerated model devel-
opment; Quick training periods; Support for High-Quality GPU Training and
robust ecosystem
4.1.3 Hardware
Cloud Service Free GPU-enabled platforms include Google Colab and Kag-
gle. However, there are still many limitations affecting the work progress.
Specifically, Google Colab with the regular version only allows the GPU to
be used in a certain amount of computing, for the Retriever module, it can only
be trained for a period of 4-5 hours. For the colab pro version, the gpu provided
is T4 or P100, one of the two GPUs will be randomly allocated. The amount
of compute using GPU in the pro version is significantly increased. Pretrain
Final Capstone Project 45
it is important for users to be aware that their code and data may be accessi-
ble to other users who have access to the same virtual machine. If we use free
version, we are supplied 1x T4 GPU. For pro version, we also have 1x T4 GPU
but 24 hours runtime actively. For pro plus verion, we get 1x GPU A100 but
it is allowed to use a limited time until runing out of computing units, then
get back to NVIDIA T4. Despite these limitations, Google Colab is a powerful
and convenient tool for data scientists and machine learning practitioners. It
offers a way to access powerful hardware resources and collaborate with oth-
ers on Jupyter notebooks, all without requiring expensive hardware or software
licenses.
FPT Cloud FPT Smart Cloud (FCI) – a member of FPT Corporation, the
leading provider of Artificial Intelligence (AI) Cloud Computing (Cloud Com-
Final Capstone Project 47
rize existing notes, conduct daily standups, change the tone, translate, or check
material. It also includes AI capabilities and a library of free and fee-based
templates. For their Business and Enterprise tiers, security features include sin-
gle sign-on with Security Assertion Markup Language and private team areas.
SaaS tools including GitHub, GitLab, Zoom, Lucid Software, Cisco Webex, and
Typeform are all integrated with Notion.
4.2 Methods
Vietnamese Legal Text Retrieval is developed mainly with Retriever module
with the input is a question in string representation. There is a knowledge
base, in this case, the knowledge base consists of legal documents. The ques-
tion will be converted to embedding vector representation and compared to all
of embeddings in knowledge base (each data text has its own corresponding
embedding as well) by compute similarity scores such as dot product, cosine
similarity, euclidean distance and so on. Relevant legal documents will be re-
turned with a long text or just a title of the circulars and decrees. In practice,
users need to lookup the legal answer in a convenient way, returning the ti-
tles is implicit and hard to be enjoyable, returning a full text of circulars and
decrees contains redundant information which is not necessary to answer the
user’s question. Therefore, a question answering model is created to extract
the main idea of the returned long circulars and decrees text. Figure 6 shows
that this project will focus on pretraining language model in in-domain with
masked language modeling due to lack of vietnamese legal text retrieval for
training model explained in 4.2.2 and in 4.2.1 will show how to initialize train-
ing dataset for training Retriever module. Instead of using bert-based models or
any transformer encoder-based models to build Sentence Transformers, which is
outstanding in semantic textual similarity task explained in 4.2.5. This project
will use Condenser and CoCondenser, which are explained in 4.2.3 and 4.2.4 to
improve the performance of any transformer encoder-based model.
BM25+
Negative pairs
(round 1)
PhoBERT
Legal corpus
Condenser SentenceBert
coCondenser
Negative pairs
(round 2)
Checkpoint from
Dataflow
0.1% môn
... ...
10% đại
... ...
3% trường
FFNN + Softmax
1 2 3 4 5 6 7 ... 512
EncoderBlock
... BERT
...
EncoderBlock
1 2 3 4 5 6 7 ... 512
Randomly mask
15% of tokens CLS Tôi học ở [MASK] học FPT
1 2 3 4 5 6 7 ... 512
Input
CLS Tôi học ở đại học FPT
[3] but less than it. By doing this way, we map the an open-domain phoBERT
into an in-domain legal language model. This is exactly useful when we can not
pretrain again a full phoBERT in a large legal dataset. After that, the new legal
PhoBERT is used to train condenser in the next section.
Lcondenser
mlm = Σi∈M CrossEntropy(ŷihead , ti ) (28)
This new architecture design utilizes the meaningful CLS embedding at last layer
of an encoder model, also avoid lack of information of other words by take them
from the middle layers. In condenser, it tries to trains boost CLS hidden state
more powerful. Despite being able to improve token forms, the late encoder
backbone can only transmit new information through late CLS hidden state,
make sure the backbone must aggregate freshly generated information later, and
the head must condition on the late CLS representation to make LM predictions.
By linking the early layers to the late layer, CLS is free from having to encode
local information and the incoming text’s syntactic structure, allowing CLS to
concentrate on the overall meaning of the text. This informational division is
controlled by the number of early and late layers.
Head layer helps condenser boost it performance in encoding meaningful of
a sentence as much as possible. However, they are just exists in pretraining
phase and are dropped in finetuning. When finetuning condenser in Semantic
Textual Similarity or any relevant task, hidden state of CLS at late layer are
powered up in pretraining phase, now is finetuned in a specific task, it is trained
and backpropagate gradient into backbone. In the other words, condenser acts
as a regular transformer encoder model in finetuning, it still have the same
architecture with Transformer. The difference is that condenser has another
step to power up the performance of CLS, representative token of a sentence by
pretraining the LM backbone again added some layers, and the model is back
to transformer architecture in finetuning and inference.
In this project, we chose to initialize Condenser using PhoBERT that has al-
ready pretrained in-domain with legal corpus and initialize the head arbitrarily.
It means there are 12 encoder layers in our backbone PhoBERT, it is divided
into the first 6 layers is early layers, the later 6 layers is late layers and create
more 2 head layers. By eliminating the enormous cost of starting from scratch
with pre-training, This fits within our computing budget. The result will be
used to train Sentence BERT which is explained in Section 4.2.5. We impose
a semantic restriction by running MLM also with backbone late outputs to
stop gradient back propagated from the random head from distorting backbone
weights.
Lconstrain
mlm = Σi∈M CrossEntropy(W hlate i , ti ) (29)
This restriction is justified by the assumption that encoding per-token repre-
sentations hlate
i with i ̸= 0 and sequence representation hlate
0 where i = 0 is
the position of [CLS] token have a comparable method and won’t conflict. hlate
i
with i ̸= 0 is therefore still applicable for LM prediction. Therefore, the total
loss is determined as the sum of two MLM losses.
Ltotal = Lcondenser
mlm + Lconstrain
mlm (30)
The two MLM losses share the output projection matrix W, which lowers the
overall number of parameters and memory requirements.
Final Capstone Project 53
contrastive loss used in SimCLR [36], where the goal is to learn representations
by contrasting positive and negative pairs of augmented examples. In this case,
the positive pairs are the two spans from the same document, and the negative
pairs are the spans from different documents. The use of the contrastive loss
is a form of noise contrastive estimation (NCE), which is a technique used to
estimate the probability distribution of a random variable by contrasting it with
a noise distribution.
Noise Contrastive Estimation (NCE) is a technique for estimating the prob-
ability of a certain event or occurrence, based on a limited set of observations.
This is often used in machine learning and natural language processing, where
we want to estimate the probability of a certain word or phrase appearing in
a given context. The basic idea behind NCE is to train a model to distinguish
between "real" and "fake" examples of the event or occurrence we are interested
in. The model is given a set of "real" examples, and a larger set of "fake" exam-
ples, which are generated by adding some random noise to the real examples.
The model is then trained to predict whether a given example is real or fake,
based on its features. For example, let’s say we want to estimate the probability
of the word "cat" appearing in a sentence. We could train a model using NCE
by giving it a set of real examples, which are sentences that contain the word
"cat", and a larger set of fake examples, which are sentences that don’t con-
tain the word "cat", but are generated by randomly adding or changing some
words in the real examples. The model is then trained to distinguish between
real and fake sentences, based on their features (e.g. the presence or absence
of certain words). Once the model is trained, we can use it to estimate the
probability of the word "cat" appearing in a new sentence. We simply give
the model the features of the new sentence, and it outputs a probability score,
indicating how likely it is that the sentence contains the word "cat". he main
difference between NCE and training with hard negative or in-batch negative
when using contrastive learning is in how the negative samples are chosen. In
hard negative mining, the negative samples are chosen to be the hardest ones to
classify correctly among a set of randomly selected candidates. In other words,
the algorithm selects samples that are most similar to the positive sample, but
are still labeled as negative. In-batch negative mining, on the other hand, se-
lects negative samples from within the same batch as the positive sample. This
means that the negative samples are taken from the same set of data that the
model is currently training on. In contrast, NCE selects negative samples from
a noise distribution that is different from the training data. The noise distribu-
tion is designed to be easy to sample from, but has a different distribution than
the training data. By sampling negative examples from the noise distribution,
the model is forced to learn to discriminate between the true training data and
the noise distribution. This can help prevent the model from overfitting to the
training data and can lead to more generalizable representations. Here’s an ex-
ample to help illustrate the difference between the three approaches: Suppose
we have a dataset of images of cats and dogs, and we want to train a model to
classify them correctly. With hard negative mining, we would randomly select a
negative sample, and then choose the sample that is most similar to the positive
Final Capstone Project 56
sample, but still labeled as negative. With in-batch negative mining, we would
select a negative sample from within the same batch of images that the posi-
tive sample is in. With NCE, we would sample negative examples from a noise
distribution, such as a distribution of random noise images or a distribution of
images of objects that are not cats or dogs.
The authors of the coCondenser paper noted that the contrastive loss used
in their approach is similar to the one used in NCE, where the loss function aims
to distinguish between true samples (i.e., samples from the actual data distribu-
tion) and noise samples (i.e., samples generated from a noise distribution). In
the context of coCondenser, the authors use random span sampling as a form of
noise contrastive estimation, where random spans from the input documents are
used as negative samples during training. The contrastive loss function is then
used to encourage the model to differentiate between true positive samples (i.e.,
spans that are semantically related) and negative samples (i.e., random spans
that are not semantically related). Therefore, the coCondenser approach can
be seen as providing an NCE narrative, where the noise samples are generated
through random span sampling, and the contrastive loss is used to encourage
the model to distinguish between positive and negative samples. Let’s say we
have a document that talks about the benefits of exercise, and we want to use
coCondenser to generate embeddings for the different spans of text in this doc-
ument. For instance, one span could be "running is good for your heart" and
another span could be "lifting weights can help build muscle". These two spans
are semantically related, since they both talk about the benefits of different
types of exercise. On the other hand, a random span in the same document
that is not semantically related could be "the sky is blue". This span is not
related to the topic of exercise and would not be used in the training process.
During training, coCondenser takes pairs of spans from the same document and
tries to learn to distinguish between semantically related and unrelated spans
using the contrastive loss. This helps it to generate embeddings that capture the
semantic meaning of the text. The coCondenser model assumes that spans that
are not semantically related have a low probability of being selected in a random
sampling process. This is because the model uses a corpus-aware contrastive
loss that compares the similarity between the representations of different spans
across the entire corpus. In other words, if two spans are not semantically re-
lated, they are likely to have different representations across the corpus, and
therefore the contrastive loss will penalize the model for treating them as simi-
lar. Of course, this assumption is not perfect, and it is possible that two spans
that are not semantically related may have similar representations by chance.
However, the coCondenser model is designed to minimize the impact of these
cases by using a large corpus and random sampling of spans. The training data
for coCondenser is sampled from spans taken from different documents in a cor-
pus. The purpose of sampling from a corpus is to ensure that the model learns
to represent information in a generalizable way rather than just memorizing
specific instances in the training data. So, although the data used to train co-
Condenser comes from a corpus, it is not trained on the entire corpus at once,
and the training data is sampled in a way that promotes generalization.
Final Capstone Project 57
the authors are explaining the approach they have taken in training their
coCondenser model. They start by stating that they use random spans as
surrogates of passages, which means that they randomly select spans of text
from different documents as their training data. They then enforce the dis-
tributional hypothesis through noise contrastive estimation (NCE), which is a
method commonly used in word embedding learning (e.g., Word2Vec) to learn
representations of words based on their co-occurrence patterns in a corpus. By
using NCE, the authors aim to ensure that the model learns to distinguish be-
tween semantically related and unrelated spans. They then go on to explain
that this approach can also be seen as a span-level language model objective,
similar to the popular "skip-gram" model used in word embedding learning. Fi-
nally, they state that the batch’s loss is defined as an average sum of MLM and
contrastive loss, which can also be seen as word and span LM loss, respectively.
The authors of coCondenser are drawing a comparison between their proposed
span-level language model objective and the popular "skip-gram" model used
in word embedding learning, such as in Word2Vec (Mikolov et al., 2013). In
the "skip-gram" model, the goal is to learn word embeddings by predicting the
context words surrounding a target word within a fixed window size. This can
be seen as a type of language modeling, where the target word is the input
and the context words are the output. Similarly, in coCondenser, the authors
propose to use spans as surrogates of passages and enforce the distributional
hypothesis through NCE, which is a form of contrastive learning. The MLM
loss of the span-level language model is similar to the word-level language model
in "skip-gram", and the contrastive loss helps to further refine the embedding
space by distinguishing related spans from unrelated ones. So, by making this
comparison to "skip-gram", the authors are highlighting the similarity of their
proposed approach to the widely-used technique in word embedding learning.
The authors of the coCondenser paper use two types of losses to train their
model: MLM loss and contrastive loss. The MLM loss is a standard loss used in
many language models, which aims to predict the masked tokens in a sequence
based on the context provided by the other tokens. In coCondenser, the MLM
loss is computed for each span separately, denoted as Lmlm ij for span sij . The
contrastive loss, as we discussed earlier, aims to distinguish between semantically
related spans and randomly selected spans. The contrastive loss is computed
based on pairs of spans selected from the training corpus, and it penalizes the
model if the similarity between a semantically related pair is lower than that
of a randomly selected pair. The contrastive loss for a pair of spans is denoted
as Lco
ij . To combine the two losses, the authors define the batch’s loss as an
average sum of the MLM loss and the contrastive loss. That is, for a batch of
spans, the total loss is calculated as follows:
n 2
1 X X mlm
L= [L + Lco
ij ] (32)
2n i=1 j=1 ij
In other words, the batch’s loss is a weighted average of the MLM loss for each
individual span in the batch, and the contrastive loss for pairs of spans in the
Final Capstone Project 58
batch. This loss function encourages the coCondenser model to learn both the
semantic relationships between spans and the contextual information within
each span.
In the paper, the authors refer to "unsupervised factors" as the underlying
semantic structure that is present in unannotated text data. These factors
capture the relationships between words and phrases in the language, such as
synonyms, antonyms, and related concepts, and can be learned by a language
model trained on a large corpus of text data. The coCondenser model is designed
to capture these unsupervised factors by using a contrastive loss to learn to
distinguish between related and unrelated spans of text. By doing so, it can
create a more effective embedding space that can be used in downstream natural
language processing tasks.
Stochastic gradient estimators (SGEs) are methods used to estimate the
gradient of the loss function with respect to the model parameters using only
a subset (or a single example) of the training data at each iteration. The idea
is to randomly sample a subset of the training data, called a mini-batch, to
compute an approximate gradient and update the model parameters. This pro-
cess is repeated for multiple iterations until the model converges to a minimum
of the loss function. Gradient estimators are algorithms used to estimate the
gradient of a function with respect to its parameters. In machine learning, the
gradient is used to update the model parameters during training to minimize
the loss function. There are several gradient estimators, including the stochas-
tic gradient estimator, which is commonly used in deep learning. The authors
of the coCondenser paper mention "large-batch unsupervised pretraining" as
a way to construct effective stochastic gradient estimators for the contrastive
loss. This means that they first pretrain their model on a large dataset using
unsupervised methods, without any specific task or objective in mind. The
purpose of this pretraining is to teach the model to understand the structure
and patterns of language in a general sense, so that it can later be applied to
specific tasks more effectively. Once the pretraining is done, the model is then
fine-tuned on a specific task or dataset, such as a question-answering task or a
language generation task. The fine-tuning is done using small batches of data,
which allows for more efficient training and better generalization to new data.
The authors argue that the large-batch unsupervised pretraining step is cru-
cial for achieving good performance on downstream tasks, because it helps the
model learn useful representations of language that can be applied to a wide
range of tasks. They also suggest that this pretraining step can be done once
on a large dataset, and the resulting model can be reused or adapted for dif-
ferent downstream tasks, which saves time and resources compared to training
a new model from scratch for each task. the contrastive loss requires fitting
the large batch into GPU memory, which can be challenging with limited re-
sources, such as a machine with only four commercial GPUs. To overcome this
memory constraint and perform effective contrastive learning, they incorporate
a technique called gradient caching [cite]. Gradient caching involves storing
intermediate gradient values for the parameters during training, allowing the
system to perform more efficient backpropagation during subsequent epochs.
Final Capstone Project 59
This reduces the memory usage during each batch and allows the system to
process larger batches, even with limited GPU resources. In essence, gradient
caching allows the system to approximate the gradient of the large batch with
a series of smaller batches, without losing accuracy or incurring a significant
computational overhead. Gradient caching and gradient checkpointing are both
techniques used to reduce the memory requirements of deep learning models
during training. Gradient caching refers to storing intermediate computations
of gradients during the forward and backward passes of the model, rather than
recomputing them during each iteration. This can reduce the amount of mem-
ory required during training, but can also result in slower training times due to
the additional overhead of caching. Gradient checkpointing, on the other hand,
involves recomputing intermediate activations during the backward pass, rather
than storing them in memory. This can reduce the memory requirements of the
model even further, at the cost of additional computation time. In the context
of the coCondenser paper, the authors use a variant of gradient caching called
"recompute-aware gradient caching". This involves recomputing intermediate
activations during the backward pass, but also caching a subset of activations
in memory to reduce the overhead of recomputation. This allows the model to
train effectively on machines with limited GPU memory.
Storing intermediate computations of gradients during the forward and back-
ward passes of the model can reduce the amount of memory required during
training because it allows the model to perform backpropagation and compute
gradients in a more memory-efficient way. During backpropagation, the model
computes gradients for each parameter by multiplying the gradient of the loss
function with respect to the output of the layer with the gradient of the layer’s
output with respect to its input. These gradients can be quite large and storing
them all in memory can quickly become unfeasible, especially for large models
or when training on GPUs with limited memory. By caching intermediate com-
putations of gradients, the model can free up memory by discarding unnecessary
computations that would otherwise need to be stored. This can help reduce the
memory footprint of the model and make it possible to train larger models or
use larger batch sizes without running out of memory. Gradient checkpoint-
ing is one way to implement gradient caching, where intermediate activations
are recomputed during the backward pass rather than stored in memory. This
allows the model to use less memory during training at the cost of increased
computational time. The memory referred to in this context is the GPU mem-
ory required to store intermediate computations of gradients during the training
process. Storing these intermediate computations allows the model to compute
gradients for larger batch sizes without running out of memory on the GPU.
Recomputing intermediate activations during the backward pass can reduce the
memory requirements of the model even further because it allows the model to
discard the intermediate activations after they have been used in the backward
pass. This means that the memory used to store the intermediate activations
can be freed up and used for other purposes, such as storing activations for the
forward pass or computing gradients. Without recomputing the intermediate
activations during the backward pass, the model would need to store all the
Final Capstone Project 60
intermediate activations for each layer until the gradients are computed for the
final layer. This can be very memory-intensive, especially for large models or
models with many layers. By recomputing the intermediate activations during
the backward pass, the model can avoid storing these intermediate activations,
thereby reducing the overall memory requirements.
The pretraining of coCondenser is done in two stages:
• Universal Condenser pretraining: In this stage, a Condenser is pretrained
using the same data as BERT, i.e., English Wikipedia and the BookCor-
pus. The Condenser is initialized with the pre-trained 12-layer BERTbase
weights and its backbone layers are warm-started using an equal split of
6 early layers and 6 late layers. The pretraining objective is to learn
a general-purpose representation of language that can be fine-tuned on
different downstream tasks.
• Corpus aware coCondenser pretraining: In this stage, the pre-trained Con-
denser from stage one is taken and its backbone and head layers are used
to warm-start the pretraining on the target corpus, which can be either
Wikipedia or MS-MARCO web collection. The pretraining objective is to
learn a corpus-specific representation of language that can be fine-tuned on
specific downstream tasks related to the target corpus. The architecture
of the Condenser is kept unchanged in this stage.
During both stages of pretraining, the coCondenser model uses the contrastive
learning framework and NCE objective to learn the representations of spans
sampled from the input corpus. The loss function used for pretraining is a
combination of MLM loss and contrastive loss, with MLM loss being used to
predict the original spans and contrastive loss being used to distinguish them
from negative spans. The negative spans are sampled randomly from the same
document as the original spans. The pretraining process is done in batches,
with each batch consisting of multiple spans sampled from different documents.
The gradients are calculated for each batch using backpropagation and used to
update the model parameters. The pretraining process continues for a fixed
number of epochs until the model converges and achieves the desired level of
performance. After pretraining, the coCondenser model can be fine-tuned on
specific downstream tasks by adding a task-specific head layer and training the
entire model end-to-end using supervised learning.
On top of this structure, a feed forward neural network perform classify whether
two sentences are similar by applying sigmoid or using any similarity score like
cosine similarity.
Comparing each pair of sentences makes cross-encoder is not scalable al-
though it produces very accurate similarity scores (better than the following
Sentence BERT about to be mentioned). We would have to do the cross-encoder
inference computation 100K times if we wanted to run a similarity search across
a tiny dataset of 100K sentences. Because the input of cross-encoder is a pair of
sentences separated by a special token, it is hard to use this model to generate
semantic embedding vector for each document in advance and invoke them from
database, compute similarity score when needed. Next, By averaging the values
across all token embedding output, original BERT creates semantic sentence
embeddings, it mean last hidden state of transformer encoder models. On the
other hand, special token, CLS token which stands in front of every sequence
when training with transformer encoder model as an alternative way to rep-
resent for the sequence and take it to comparison operator. But regardless of
which method is used, the accuracy is poor and is worse than utilizing averaged
GloVe embeddings.
Sentence-BERT, also known as SBERT, was developed as a remedy for this
lack of an accurate model with a respectable latency. For every standard se-
mantic textual similarity (STS) task, SBERT performs better than the prior
state-of-the-art (SOTA) models. SBERT provides sentence embeddings, which
is a blessing for scalability because it eliminates the requirement for a full in-
ference calculation for every sentence-pair comparison.In 2019, Reimers and
Gurevych presented evidence of the sharp acceleration. With BERT, it took 65
hours to identify the most analogous pair of sentences among 10K sentences.
The creation of embeddings with SBERT takes around 5 seconds, while the
cosine similarity comparison takes about 0.01 seconds.
Many more sentence transformer models have been developed since the
SBERT paper using ideas that were used to train the original SBERT. They
have all been practiced on numerous pairs of sentences, both similar and dissim-
ilar. These models are optimized to create comparable embeddings for related
sentences and dissimilar embeddings in all other cases using a loss function such
as softmax loss, multiple negatives ranking loss, or MSE margin loss.
In this project, Sentence BERT with the backbone is legal Condenser ex-
plained in Section 4.2.4, called CoLegalPhoBERT, instead traidtional trans-
former encoder model such as BERT or RoBERTa in [37]. Our Sentence BERT
using coCondenser and is finetuned by creating siamese network but trained with
constrastive learning approach instead triplet networks in the original Sentence
BERT paper. At the end of pre-training, the authors discard the Condenser
head, which includes the final prediction layer, and keep only the backbone lay-
ers. As a result, the model reduces to its backbone, or effectively a Transformer
Encoder. The weights of the backbone layers are then used to initialize the
query encoder and passage encoder in the downstream task of retrieval. Specif-
ically, the query encoder (Eq. 33) and document encoder (Eq. 34) are each
initialized with the weights of the corresponding backbone layer outputting the
Final Capstone Project 62
f(u,v)
u v
Pooling Pooling
BERT BERT
Question q Document d
last token’s CLS representation. This initialization process allows the retrieval
model to benefit from the pre-trained knowledge captured by the coCondenser
model. Figure 8 has shown overview SBERT architecture. The pooling layer
which is used in this project is mean pooling layer. Objective function f (u, v) is
contrastive loss. Question and documents are encoded to semantic embedding
vector by:
q = CoLegalPhoBERT([CLS; question; SEP]) (33)
d = CoLegalPhoBERT([CLS; document; SEP]) (34)
Question and Document are bounded with beginning special token [CLS] and
ending special token [SEP] and passed into CoLegalPhoBERT, to create se-
mantic embedding vector. In training, f (u, v) in Figure 8 indicates L∫ ⌊ ranking
loss which can be calculated with both positive document dpos and negative
documents dneg for the given question to contrastively train CoLegalPhoBERT:
Where Dpos is the positive document collection for the given question q. lc (q, dpos , Dneg )
is the contrastive loss function, which can be referenced from [38]:
ef (qpos ,dpos )
lc (q, dpos , Dneg ) = − log (36)
ef (qpos ,dpos ) + Σdneg ∈Dneg ef (qpos ,dneg )
Where Dneg is the collection of negative documents for given question q sampled
with sparse retrieval methods, in this project we use BM25+.
in one of given contexts. start position and end position are not originally the
idea of using language model to create an answer, answer a question like human.
Model is better to predict a span of text where each token is generated based
on the previous one until complete the answer. Predicting the beginning of
the supposed answer and its ending token, then slice the span of text is not
appropriate. Proof is shown below:
Final Capstone Project 65
The combined and pre-built Vietnamese character sets are created to support
typing and displaying Vietnamese characters on computers. They include all the
characters in the Vietnamese alphabet, including tone and punctuation marks.
devices. However, users need to pay attention to using the correct char-
acters and marks in Vietnamese to avoid confusion or misunderstanding
the meaning of words.
• For example, in the case of "hoàng" and "hòang", these are two different
words in meaning. "hoàng" means a name, while "hòang" has no meaning
in Vietnamese.
0.9 0.9
0.8
0.8 0.8
Loss
0.6 0.6
0.4
0.5 0.5
0.4 0.4
0.2
0.3 0.3
0.0
0 1 2 3 4 5 0 2 4 6 8 10
Learning Rate 1e 5 Epoch
were used in the parameter setup. Therefore, throughout the process of describ-
ing the implementation method, we only represent the SB-Condenser-300MB
model in the core main section.
Phobert Large 10 32
In setting the parameters for fine-tuning the Phobert Large language model,
we chose Epochs as 10 for training, and set the batch size for both the evaluation
and training sets as 8. Additionally, we set a parameter to increase the batch size
of the model to 32 by setting the Gradient Accumulation parameter to 4 (8*4).
The reason we increased the batch size through Gradient Accumulation instead
of directly increasing it is that directly increasing the batch size would lead to
an increase in GPU memory usage. In this training session, we used two types
of GPUs: T4 from Google Colab and P100 from Kaggle Notebook, both with
GPU memory ranging from around 13GB. This means that we cannot increase
the batch size further without encountering memory overflow issues. Therefore,
setting Gradient Accumulation ensures that the batch size can increase up to a
maximum of 32. Additionally, we also wanted to ensure that the batch size was
not too low, as having too low of a batch size would result in issues with the
loss function during training.
Final Capstone Project 68
10
8
6
epoch
4
2
0
1e 5
5
4
learning_rate
3
2
1
0
1.0
0.8
loss
0.6
0.4
60000
40000
step
20000
0
0 5 10 0 2 4 0.4 0.6 0.8 1.0 0 25000 50000 75000
epoch learning_rate 1e 5 loss step
model with over 340 million parameters, setting the epoch parameter to 10 can
save time during the model training process while still ensuring similar good
results as when using a larger epoch parameter. To clarify this, we can refer
back to Figure 10b, where we observe a significant decrease in the loss function,
especially during the first few epochs where the loss function reaches around
0.5, and then a slight decrease from 0.5 to almost 0.2 during the subsequent
epochs. Moreover, overall, the loss function decreases during the entire training
process with slight fluctuations and a notable decrease during the third epoch.
With regard of the Figure 11, we use the seaborn library to visualize four
main parameters: step, learning rate, loss, and epoch. From all the charts in
the figure, we can also have a deeper insight into the correlation and variation
of these parameters over time during the model training process. We can also
focus mainly on the learning rate parameter compared to other parameters,
clearly the model can be evaluated as relatively good because it maintains the
loss decreasing and not varying too much. Furthermore, by using dots to clearly
represent the position of the losses over time, we can also see a slight variation
from epoch 3 to epoch 4. Besides, there is also a bar chart comparing the
loss with itself to clarify that it still decreases and illustrate the variation of
this parameter more visually. Overall, setting the parameters for this language
model training, specifically the Phobert Large model with 300MB of clean legal
data and the training parameters mentioned above, resulted in a very objective
training outcome and can ensure the quality for the next training modules such
as Condenser, Cocondenser, and Round 1 and 2 of Sentence Transfer.
Phobert Large 8 32
5 5
4
Loss
Loss
4
3 3
2 2
1 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Epoch Epoch
(a) Overview about f1 Scores (b) Overview about cosim similarity ’s accu-
racy, f1 and recall
Loss
5 5
4 4
3 3
2 2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Epoch Epoch
(a) Overview about f1 Scores (b) Overview about cosim similarity ’s accu-
racy, f1 and recall
Phobert Large 8 16
dotproduct (A, B)
cosinesimilarity (A, B) = (37)
magitudeA ∗ magnitudeB
The resulting cosine similarity value will range between -1 and 1, where -1
indicates that the two vectors are completely dissimilar, 0 indicates that the two
vectors are orthogonal (i.e. perpendicular), and 1 indicates that the two vectors
are identical.
Cosine similarity is a useful measure for comparing the similarity of two
vectors in a high-dimensional space, where other distance measures such as Eu-
clidean distance may not be as effective. It is also useful in applications such as
collaborative filtering, where it can be used to recommend items to users based
on their similarity to other users or items.
imbalanced, meaning one class has significantly more samples than the other.
F1 score ranges from 0 to 1, with higher values indicating better performance.
Recall: Recall, also known as sensitivity, is a measure of how well the model
correctly identifies positive samples. It is calculated by dividing the number of
true positives by the sum of true positives and false negatives.
Average Precision (AP): Average Precision is a metric commonly used in
object detection and image segmentation tasks. It measures the area under the
precision-recall curve, which shows the trade-off between precision and recall at
different classification thresholds. AP ranges from 0 to 1, with higher values
indicating better performance.
Phobert Large(R1) 10 32
Phobert Large(R2) 5 32
In terms of Sentence Transformer, we divide into two round. For the first
round, we set the training parameters as follows: max length is 256. This is be-
cause for the Phobert Base and Phobert Large language models, the maximum
length per line is 256. Therefore, setting the max length to 256 ensures that
each line meets the language model’s requirements without causing errors. We
also set the batch size parameter to 32 to ensure that the model’s learning pro-
cess runs optimally. We cannot increase the batch size to 64 or higher because
our available resources are limited to the A30 with 24GB of memory. Increasing
the batch size beyond this limit may cause memory overflow. While gradient
accumulation can be used to increase the batch size without changing memory
requirements, it may slow down the process. Therefore, this parameter is not
supported in this case. We trained the model for 10 epochs because we found
that it was suitable in terms of time while still ensuring the model’s quality.
Additionally, we consulted other papers as [4] with similar epoch numbers and
achieved good results.
For the second round, we inherited the results from the last epoch checkpoint
of the first round and continued to use the same training parameters, with a
max length of 256 and a batch size of 32. However, this time we set the number
of epochs to 5 because in some of our experiments and in other papers that use
the Condenser architecture, the number of epochs ranges from 5-10. However,
we found that using more than 5 epochs within this range did not result in
significant changes. After round 2, we could continue training in the subsequent
rounds, but training the model in these rounds only improves local results, while
Final Capstone Project 75
12.5
Accuracy/F1 Score/Recall
0.8
10.0
Density
7.5 0.7
5.0
0.6
2.5 Accuracy
F1 Score
0.5 Recall
0.0
0.3 0.4 0.5 0.6 0.7 0 2 4 6 8
Epoch
(a) Overview about f1 Scores (b) Overview about cosim similarity ’s accu-
racy, f1 and recall
overall there may be significant differences. To clarify this, we can observe the
results we recorded in rounds 1 and 2, and from there, gain insight for subsequent
rounds if we continue training.
In terms of Figure 15, Firstly, considering figure 15a, we can see that the
"Overview about f1 Scores" chart, which evaluates three main methods: Cosine
similarity, Manhattan distance, and Euclidean distance, shows that the varia-
tion of f1 scores during the training process is completely different. However,
the density at the 0.7 level always has the highest proportion. Next is figure 15b,
in which we focus on F1 Score, Recall, and Accuracy of the Cosine similarity
evaluation method, as its displayed results are the best and its range represents
more clearly the other evaluation methods. Throughout the training process,
the fluctuation range of Accuracy is always above 0.95, which can be considered
as a measure to ensure that the model performs relatively well.
Final Capstone Project 76
8 0.8
Accuracy/F1 Score/Recall
6 0.7
Density
0.6
4
0.5
2
0.4 Accuracy
F1 Score
0 0.3 Recall
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 2 4 6 8
Epoch
(a) Overview about f1 Scores (b) Overview about cosim similarity ’s accu-
racy, f1 and recall
SB-Condenser-300MB-Lite
Cosine-Similarity 96.9 63.2 76.3
Manhattan-Distance 96.6 59.2 69.4
Euclidean-Distance 96.6 59.5 73.8
SB-Condenser-300MB-Full
Cosine-Similarity 96.9 63.8 72.7
Manhattan-Distance 96.4 54.3 68.1
Euclidean-Distance 96.4 54.5 67.5
The result: with regard to the summary table 8, we introduce four versions
of a conference we tested based on the Condenser architecture to solve the
problem of "Vietnamese Legal Text Retrieval".
• In the first version (SB-Condenser-100MB): we use 100MB of data
from the Zalo Legal Text 2021 competition. We use 100 MB of data to
pretrain Masked Language Model, although it is not much, it helps the
language model understand the characteristics of words in the legal field
linked together. In the following rounds, Condenser and Coconder, we
continue to use this data to understand the context of each sentence and
the links between legal terms and the separate content of each term.
• In the second version (SB-Condenser-300MB-Lite):, we added
200MB of data collected from reputable legal websites in Vietnam to
supplement the initial 100MB data. First, we used the 300MB data to
fine-tune the Phobert language model. Then, we used the checkpoint ob-
tained from fine-tuning to train in subsequent rounds, with 100MB of data
retained for training in each round.
• In the third version (SB-Condenser-300MB-Full):, we used 300MB
of data similar to experiment 2, but this time we trained all rounds with
the 300MB data, including Pretrain Masked Language Model, Pretrain
Condenser, Pretrain Cocondenser. For the final round, Sentence Trans-
former, we reused 100MB because the task was to combine Sparse Re-
trieval and Dense Retrieval on the domain dataset to answer questions.
• In the fourth version (SB-Condenser-3GB):,we added 2.9 GB of
data collected from reputable legal websites in Vietnam to supplement
the initial 100MB data. First, we used the 3GB data to fine-tune the
Phobert language model. Then, we used the checkpoint obtained from
fine-tuning to train in subsequent rounds, with 100MB of data retained
for training in each round. Besides, This training method is similar to the
second version, which can bring optimal training time while still ensuring
that the results are similar to using all 3GB for training in all rounds.
Final Capstone Project 78
Predicted class
Actual class Positive Negative
Precisio: being the ratio of true positives to the sum of true positives and
false positives:
TP
precision = (42)
TP + FP
Recall: being the ratio of true positives to the sum of true positives and
false negatives:
TP
recall = (43)
TP + FN
Final Capstone Project 79
SB-Condenser-300MB-Lite F2 0,699
SB-Condenser-300MB-Full F2 0,649
SB-Condenser-3GB F2 0.723
In terms of the table 10, we evaluated our trial version using the F2 Score
evaluation method. In the field of law, Recall plays a crucial role as it indicates
the proportion of predictions that match the labels. Our highest F2 Score result
was achieved with the SB-Condenser-3GB version, with a score of 0.723. The
Lite version trained on 300MB achieved a score of 0.699, while the Full version
achieved a score of 0.649.
For inference, take a look at Figure 16, the query is sent to BM25+ module
to calculate BM25+ score with all documents in the database (sparse retrieval
purpose). At the same time, the query is passed to SentenceBERT model to
create cosine similarity score with all documents in the database (dense retrieval
purpose). The documents are chosen whose scores is maximum and calculated
by BM 25 + _score ∗ cosine_similarity. Despite the number of documents
returned is fixed but it has another constraint that the relevant documents need
to lie in the range from max_score − 2.6 to max_score. finally, the documents
and query are given to Question Answering model to extract a keypoint that
the query mentions, avoid redundant information in original returned documents
from the retrieval process’ output.
Key answer
QA model
Query Documents
Legal database
BM25+ flow
SentenceBERT flow
QA flow
(1 + β 2 ) ∗ precision ∗ recall
F1 = (44)
β 2 ∗ precision + recall
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) be-
ing a set of metrics used for automatic evaluation of text summarization and
machine translation. It measures the similarity of the generated text with one or
more reference summaries or translations, using several scores such as ROUGE-
1, ROUGE-2, and ROUGE-L. These scores represent the n-gram overlap of the
generated text with the reference text at the unigram, bigram, and longest com-
mon subsequence levels. ROUGE metrics are widely used in natural language
processing and text summarization research to evaluate the quality of generated
summaries or translations.
Vqa-ViT5 F1 0,646
Vqa-ViT5 EM 0,41
Vqa-ViT5 rougeL 0,66
Final Capstone Project 81
Dataset: In order to address the problem of limited training data for legal
question answering, the authors trained their legal Phobert model with UIT-
ViQuAD [39], a collection of 23,000 questions and answers created by humans
using passages from 174 Vietnamese Wikipedia entries. By doing this, the
extractive question answering model can first learn reading comprehension skills
before being applied or performed inference on the 520 legal questions and 1377
articles from the Automated Legal Question Answering Competition (ALQAC
2022). After checking 520 questions, we found that there are 9 questions contain
the answer is not a pieces of information that can be extracted from a given
question’s corresponding articles. Because we trained our Vqa-ViT5 to perform
extractive question answering task, we discard these unsuitable question and
retain 511 samples
The results: We used three evaluation methods, namely F1 Score, Exact
Match, and ROUGE, to evaluate the performance of our Vqa-ViT5 experiment,
which was trained on the dataset UIT-ViQuAD [39] with data spanning various
domains. For evaluation, we used 511 pairs of legal questions and answers from
the ALQAC-2022 competition, in contrast to the commonly used approach of
training on legal data and using a small set of validation data to evaluate results.
Table 11 shows our Vqa-ViT5 model trained on the comprehensive dataset and
using all 511 questions achieved an F1 Score of 0.646. The more accuracy-
demanding evaluation method, Exact Match (EM), yielded a score of 0.41. [3]
gives their result about 90% on ALQAC-2022. However, they mentions that
their F1 score is on the dev set which they randomly pick 15% of the official
dataset. It means their validation dataset (about 78 data) is smaller than us (511
data). Moreover, our Vqa-ViT5 is not trained on ALQAC-2022 dataset at all but
it still gives acceptable result. Table 12 shows some examples from Vqa-ViT5,
at the first question, model predict correctly answer in token-level, obviously its
EM is 1.0. Take a look at the second question, our model predicts a span of
text that includes the actual label, although the prediction is not incorrect, but
EM metric still return 0.0, and most of inference result is the same with this
situation, that why we said the result is kind of acceptable but the EM score is
not pretty high. Therefore In Table 12, we also compute rougL score in order
not to discard acceptable result like EM score because rougL score measures the
similarity between a machine-generated summary or translation and a reference
summary or translation by computing the longest common subsequence (LCS)
between them. The LCS is the longest sequence of words that appears in both
the generated and reference summaries. However, result of rougeL score is
just approximate to indicate that there are some semantically correct predicted
answers are discarded by EM, rougeL metric is not popularly used for extractive
question answering task because of answer’s representation.
Final Capstone Project 82
question: Người đã nhận làm gián điệp, nhưng không thực hiện
nhiệm vụ được giao và tự thú, thành khẩn khai báo
với cơ quan nhà nước có thẩm quyền, thì được miễn
trách nhiệm gì về tội gián điệp? (A person who has
accepted to act as a spy, but fails to perform the
assigned tasks and confesses and honestly declares
to the competent state agency, shall be exempt from
any responsibility for espionage charges?)
context: Tội gián điệp 1. Người nào có một trong các hành
vi... a) Hoạt động tình báo, ... thành khẩn khai báo
với cơ quan nhà nước có thẩm quyền, thì được miễn
trách nhiệm hình sự về tội này. (Crime of espionage
1. Any person who commits one of the acts... a)
Intelligence activities, ... sincerely declares to the
competent state agency, shall be exempt from re-
sponsibility criminal about this crime.)
prediction: miễn trách nhiệm hình sự (exempt from criminal li-
ability)
label: hình sự (Criminal)
Final Capstone Project 83
6 DISCUSSIONS
The general topic of our thesis is "Vietnamese Legal Text Retrieval," where our
model aims to provide users with the most relevant and appropriate laws in
response to their query. However, to improve the answering capability of our
model, we have also developed a "Question Answering" task that extracts laws
related to the user’s query and presents a concise answer based on those laws.
This is to provide users with the most accurate and succinct answer possible,
tailored to their specific question.
The results of our work over the past 14 weeks will be presented at the
ISICO 2023 conference in July in Indonesia. Besides, we will present the three
main parts of our work, which are also documented in two papers: preprocessing
of Vietnamese legal texts, which we collected from two websites, "lawnet.vn"
and "vbpl.vn," with nearly 145,000 documents; training the Condenser archi-
tecture to achieve optimal results on large datasets; and using the ViT5 model
trained on multidisciplinary datasets to achieve top results when tested on legal
datasets.
7 CONCLUSIONS
Along with the experiments on my approaches, architectures, and proposals
that I have built and evaluated over the past 14 weeks, we have achieved certain
results on two tasks, "Vietnamese Legal Text Retrieval" and "Question Answer-
ing," equivalent to the top results on these tasks in Vietnam in 2022. From that,
we have gained a better understanding of the opportunities and challenges in
developing projects in practice, such as the need to quickly and timely collect
and process legal data with relative high costs and time required to train com-
plete models for both tasks. However, this topic still holds a lot of potential for
practical applications in addressing small issues from individuals’ legal research
to protect their legitimate rights or even for foreign businesses wishing to invest
in Vietnam but face legal obstacles as it is now.
In the future, we will continue to develop this project with more accurate
data processing, as well as integrating new algorithms to further enhance the
effectiveness. Perhaps releasing an official version for user experience evaluation
will help us collect more accurate evaluation results and provide us with a more
comprehensive overview of the applicability of this topic in Vietnam today. In
addition, we will also consider research directions to ensure that the results of
the model are improved and continue to produce papers at conferences on Nat-
ural Language Processing in the future.
Final Capstone Project 85
8 REFERENCES
References
[1] Luyu Gao and Jamie Callan. Condenser: a pre-training architecture for
dense retrieval. arXiv preprint arXiv:2104.08253, 2021.
[2] Dat Quoc Nguyen and Anh Tuan Nguyen. Phobert: Pre-trained language
models for vietnamese. arXiv preprint arXiv:2003.00744, 2020.
[3] Hieu Nguyen Van, Dat Nguyen, Phuong Minh Nguyen, and Minh
Le Nguyen. Miko team: Deep learning approach for legal question answer-
ing in alqac 2022. In 2022 14th International Conference on Knowledge
and Systems Engineering (KSE), pages 1–5. IEEE, 2022.
[4] Nhat-Minh Pham, Ha-Thanh Nguyen, and Trong-Hop Do. Multi-
stage information retrieval for vietnamese legal texts. arXiv preprint
arXiv:2209.14494, 2022.
[5] Phi Manh Kien, Ha-Thanh Nguyen, Ngo Xuan Bach, Vu Tran, Minh
Le Nguyen, and Tu Minh Phuong. Answering legal questions by learning
neural attentive text representation. In Proceedings of the 28th Interna-
tional Conference on Computational Linguistics, pages 988–998, 2020.
[6] Chieu-Nguyen Chau, Truong-Son Nguyen, and Le-Minh Nguyen. Vnlaw-
bert: A vietnamese legal answer selection approach using bert language
model. In 2020 7th NAFOSTED Conference on Information and Com-
puter Science (NICS), pages 298–301, 2020.
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692, 2019.
[8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need, 2017.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert:
Pre-training of deep bidirectional transformers for language understanding.
arXiv preprint arXiv:1810.04805, 2018.
[10] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. Language models are unsupervised multitask learners.
OpenAI blog, 1(8):9, 2019.
[11] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits
of transfer learning with a unified text-to-text transformer, 2020.
Final Capstone Project 86
[12] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey
Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language
models are few-shot learners, 2020.
[13] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang.
Squad: 100,000+ questions for machine comprehension of text, 2016.
[21] Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiao-
Qiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. PAIR: Leveraging
passage-centric similarity relation for improving dense passage retrieval. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP
2021. Association for Computational Linguistics, 2021.
Final Capstone Project 87
[22] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin
Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An opti-
mized training approach to dense passage retrieval for open-domain ques-
tion answering. arXiv preprint arXiv:2010.08191, 2020.
[23] Nicholas Monath, Manzil Zaheer, Kelsey Allen, and Andrew McCallum.
Improving dual-encoder training through dynamic indexes for negative min-
ing, 2023.
[24] Xuan Fu, Jiangnan Du, Hai-Tao Zheng, Jianfeng Li, Cuiqin Hou, Qiyu
Zhou, and Hong-Gee Kim. Ss-bert: A semantic information selecting ap-
proach for open-domain question answering. Electronics, 12(7):1692, 2023.
[25] Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can
you pack into the parameters of a language model?, 2020.
[26] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with
generative models for open domain question answering, 2021.
[27] Peng Xu, Davis Liang, Zhiheng Huang, and Bing Xiang. Attention-guided
generative models for extractive question answering, 2021.
[28] Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras, and Mark
Johnson. VnCoreNLP: A Vietnamese natural language processing toolkit.
In Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Demonstrations, pages 56–
60, New Orleans, Louisiana, June 2018. Association for Computational
Linguistics.
[29] Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark
Johnson. A fast and accurate vietnamese word segmenter. In Nicoletta Cal-
zolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi,
Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène
Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Toku-
naga, editors, Proceedings of the Eleventh International Conference on Lan-
guage Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12,
2018. European Language Resources Association (ELRA), 2018.
[30] Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark Dras, and Mark
Johnson. From word segmentation to POS tagging for Vietnamese. In
Proceedings of the Australasian Language Technology Association Workshop
2017, pages 108–113, Brisbane, Australia, December 2017.
[31] Feng Wang and Huaping Liu. Understanding the behaviour of contrastive
loss. CoRR, abs/2012.09740, 2020.
[32] Nhat-Minh Pham, Ha-Thanh Nguyen, and Trong-Hop Do. Multi-stage
information retrieval for vietnamese legal texts, 2022.
Final Capstone Project 88
[33] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Man-
ning. What does bert look at? an analysis of bert’s attention, 2019.
[34] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin
Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An opti-
mized training approach to dense passage retrieval for open-domain ques-
tion answering, 2021.
[35] Luyu Gao and Jamie Callan. Unsupervised corpus aware language model
pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540,
2021.
[36] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.
A simple framework for contrastive learning of visual representations, 2020.
[37] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings
using siamese bert-networks, 2019.
[38] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Ben-
nett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor
negative contrastive learning for dense text retrieval, 2020.
[39] Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, and Ngan Luu-
Thuy Nguyen. A vietnamese dataset for evaluating machine reading com-
prehension, 2020.
Final Capstone Project 89
9 APPENDIX
Appendix. Source code & Dataset
Link
https://drive.google.com/drive/folders/
Source
13MKB2i29prZ8KN-Kv5dNsnvb8v2QpBEO?usp=
code
sharing
https://drive.google.com/drive/folders/
Dataset 1i08yDyb_Z-BoppN3VMk7Tham1rs_rlqe?usp=
sharing