License: arXiv.org perpetual non-exclusive license
arXiv:2402.13028v1 [cs.CL] 20 Feb 2024

Heterogeneous Graph Reasoning for Fact Checking over Texts and Tables

Haisong Gong1,2, Weizhi Xu3, Shu Wu1,2, Qiang Liu1,2, Liang Wang1,2 Corresponding Author
Abstract

Fact checking aims to predict claim veracity by reasoning over multiple evidence pieces. It usually involves evidence retrieval and veracity reasoning. In this paper, we focus on the latter, reasoning over unstructured text and structured table information. Previous works have primarily relied on fine-tuning pretrained language models or training homogeneous-graph-based models. Despite their effectiveness, we argue that they fail to explore the rich semantic information underlying the evidence with different structures. To address this, we propose a novel word-level Heterogeneous-graph-based model for Fact Checking over unstructured and structured information, namely HeterFC. Our approach leverages a heterogeneous evidence graph, with words as nodes and thoughtfully designed edges representing different evidence properties. We perform information propagation via a relational graph neural network, facilitating interactions between claims and evidence. An attention-based method is utilized to integrate information, combined with a language model for generating predictions. We introduce a multitask loss function to account for potential inaccuracies in evidence retrieval. Comprehensive experiments on the large fact checking dataset FEVEROUS demonstrate the effectiveness of HeterFC. Code will be released at: https://github.com/Deno-V/HeterFC.

Introduction

Fact checking, or fact verification, predicts claim veracity using evidence. This task has practical applications in various domains like politics (Liu et al. 2023; Xu et al. 2022), news media (Zellers et al. 2019), public health (Naeem and Bhatti 2020; Krause et al. 2020), and science (Wright et al. 2022; Wadden et al. 2020), attracting extensive research. Prior efforts predominantly address unstructured fact checking, handling evidence and claims as plain text (Thorne et al. 2018; Wang 2017; Xu et al. 2023). However, real-world contexts often involve structured data like tables, creating a pressing need for fact checking across unstructured and structured information.

Fact checking mainly involves evidence retrieval and veracity reasoning, which are two independent tasks (Aly et al. 2021). The aim of evidence retrieval is to retrieve as much the claim-related evidence (golden evidence) as possible; the target of veracity reasoning is to precisely predict the veracity of claim based on the retrieved evidence from the former stage. In this paper, we mainly focus on the design of veracity reasoning model.

Previous veracity reasoning models (Aly et al. 2021; Kotonya et al. 2021; Funkquist 2021; Hu et al. 2022; Bouziane et al. 2021; Zhao et al. 2020; Wu et al. 2022) have primarily relied on fine-tuning pretrained language models or training homogeneous-graph-based models. In the fine-tuning approach, they firstly transform tables into sentences via some heuristic linearizing rules. Then, a pretrained language model (PLM), such as RoBERTa (Liu et al. 2019), is fine-tuned by concatenating all pieces of evidence as the input. In the homogeneous-graph-based approach, they construct a homogeneous fully-connected evidence graph where each node is treated as a piece of evidence. After that, a graph neural network (GNN) is utilized to propagate neighborhood information, which enables the semantic representations of different pieces of evidence to be aggregated.

While effective, existing approaches exhibit two key weaknesses. Firstly, fact checking necessitates capturing semantics among various evidence pieces, demanding intricate modeling of evidence relationships. Transformer-based methods often fall short as they merely concatenate evidence or deal with point-wise claim-evidence pairs, insufficiently exploring complex evidence interconnections. Secondly, prevalent graph-based methods construct sentence-level graphs with claim-evidence pairs as nodes, employing Pre-trained Language Models (PLMs) for node representations (Zhou et al. 2019; Liu et al. 2020; Kotonya et al. 2021). Although these models perform well in conventional fact checking, they falter in scenarios involving both structured and unstructured information. This is due to the limitations of sentence-level graphs in capturing fine-grained details such as entities and time phrases. Furthermore, assuming uniform relationships between node pairs overlooks the diverse properties inherent in table and sentence evidence.

To tackle the aforementioned problems, we propose a novel word-level Heterogeneous-graph-based model for Fact Checking over unstructured and structured information, HeterFC for brevity. Specifically, we first construct a graph where nodes represent words in all pieces of evidence, thereby achieving a granularity at word-level. Then, to capture the different relationships in structured and unstructured information, we specially design three kinds of connections on the graph, namely intra-sentence edges, intra-table edges, and inter-evidence edges. In detail, intra-sentence edges and intra-table edges are added between each word and its local contextual words in a fix-sized sliding window. Inter-evidence edges are between the same keyword appearing in several pieces of evidence, which allows the important information be aggregated across the evidence via these edges. We employ the relational graph convolutional network (R-GCN) to perform neighborhood propagation and readout the representations of each evidence, followed by an attention mechanism to obtain information from all pieces of evidence. Combined with a language model, we get the final veracity prediction. To train the model, in addition to cross-entropy loss for the claim veracity, we propose a multitask loss to assist the model in discerning between valid and invalid evidence, thereby enhancing the overall performance of the model.

In a nutshell, our main contributions can be listed as follows,

  • We figure out the inapplicability of previous homogeneous-graph-based methods in the traditional unstructured fact checking and analyze the underlying possible reasons.

  • We propose a novel word-level heterogeneous-graph-based model, namely HeterFC, which is specially designed for fact checking over unstructured and structured information.

  • Extensive experiments on the large-scale FEVEROUS fact-checking dataset, which includes both structured and unstructured information, have demonstrated the effectiveness of our proposed model over several baselines.

Related Work

Fact Checking Over Unstructured Information

Fact checking, a form of natural language inference (NLI), involves predicting claim veracity by reasoning over multiple evidence pieces. Existing methods fall into two categories. The first category uses pretrained language models (PLMs), fine-tuning them for fact checking. They organize the input by concatenating all evidence and claim into a single sentence (Aly et al. 2021), or processing each evidence separately with the claim using aggregation techniques (Soleimani, Monz, and Worring 2020; Gi, Fang, and Tsai 2021). The second category employs graph neural networks (GNNs) to capture complex semantic interactions. GEAR (Zhou et al. 2019) constructs sentence-level fully-connected evidence graphs with GNNs. Zhao et al. (2020) use Transformer-XH for graph representation, while Liu et al. (2020) introduce KGAT with node and edge kernels. DREAM (Zhong et al. 2020) incorporates semantic role labeling for fine-grained semantic graphs. Chen et al. (2022) propose EvidenceNet with symmetrical interaction attention and gating on sentence-level evidence graphs.

Fact Checking Over Mixed-type Information

Unlike unstructured fact checking, fact checking over both structured and unstructured information requires handling a combination of structured tables and unstructured text. Existing methods include table linearization, where tables are converted into text, potentially losing structural information (Gi, Fang, and Tsai 2021; Kotonya et al. 2021; Malon 2021). Another approach is to combine sentence and table evidence using models like TAPAS (Funkquist 2021; Bouziane et al. 2021; Hu et al. 2022). In graph-based methods, previous works focused on homogeneous sentence-level evidence graphs (Kotonya et al. 2021). In contrast, our approach introduces word-level nodes and heterogeneous relations, making it more suitable for fact checking over structured and unstructured information.

Heterogeneous Graph Neural Networks

Heterogeneous graph neural networks (heterGNNs) are specialized GNNs designed for neighborhood propagation on graphs with different types of edges. R-GCN (Schlichtkrull et al. 2018) is a representative heterGNN that assigns trainable weight matrices to each relation. HeterGNNs have been successfully applied in various domains including recommender systems (Fan et al. 2019; Zhao et al. 2017; Yan et al. 2021) and question answering (Yu et al. 2019; Sun et al. 2018).

Method

In this section, we introduce the proposed method HeterFC in details. The overall framework is shown in Figure 1.

Refer to caption
Figure 1: Architecture of HeterFC. Inputs are claim c𝑐citalic_c and an evidence set \mathcal{E}caligraphic_E. There are five main parts: 1) Word-level evidence graph construction. Initial node embeddings are obtained by using PLM and subword mean pooling. Three types of edges are designed for the heterogeneous connection. 2) Heterogeneous information propagation. R-GCN is used to perform neighborhood aggregation on the word-level evidence graph. 3) Evidence-level representation readout. Evidence representations are obtained by pooling over subgraphs corresponding to each piece of evidence. 4) Attention-based claim-evidence interaction. Graph representation 𝐨gsubscript𝐨𝑔\mathbf{o}_{g}bold_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is generated by claim-guided evidence combination. A supervised loss item, Losse𝐿𝑜𝑠subscript𝑠𝑒Loss_{e}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, is computed based on the attention assignment. 5) Fused veracity prediction. The claim and evidence are concatenated and fed into PLM to obtain 𝐨tsubscript𝐨𝑡\mathbf{o}_{t}bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which, when combined with the graph representation 𝐨gsubscript𝐨𝑔\mathbf{o}_{g}bold_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, forms the final representation. A fully-connected network takes the representation as input and generates the prediction 𝐩^^𝐩\mathbf{\hat{p}}over^ start_ARG bold_p end_ARG. HeterFC is trained using classification loss Lossc𝐿𝑜𝑠subscript𝑠𝑐Loss_{c}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and assisted Losse𝐿𝑜𝑠subscript𝑠𝑒Loss_{e}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

Task Formulation

The aim of multi-structured fact checking is to predict the veracity of a claim according to several pieces of evidence, which contains both tables and texts. Mathematically, given a claim c𝑐citalic_c and a retrieved evidence set ={e1,e2,,eM}subscript𝑒1subscript𝑒2subscript𝑒𝑀\mathcal{E}=\{e_{1},e_{2},\ldots,e_{M}\}caligraphic_E = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, where each piece of evidence eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents either a sentence or a table cell, we need to propose a model 𝐩^=f(c,)^𝐩𝑓𝑐\mathbf{\hat{p}}=f(c,\mathcal{E})over^ start_ARG bold_p end_ARG = italic_f ( italic_c , caligraphic_E ) to output the predicted veracity 𝐩^^𝐩\mathbf{\hat{p}}over^ start_ARG bold_p end_ARG.

Word-level Evidence Graph Construction

In this part, we elaborate the design of nodes and edges on the evidence graph.

Node Construction

We treat each word in the evidence as a node on the evidence graph, since it contains more fine-grained semantic information than the sentence.111We have tried token-level node construction, which is more fine-grained than the word-level nodes. However, it is less effective and more analysis can be seen in the experimental section. To achieve this, we employ a PLM to process each claim-evidence pair to build the initial node representations.

The sentence evidence can be directly treated as input to a PLM, however, since tables have a distinct structure from sentences, table evidence can’t be directly processed by PLM. To address this, we adopt the idea of cell linearization proposed in (Hu et al. 2022). Specifically, each piece of table evidence is transformed into either “<column header>  for <row header>  is <cell value>  ” if it belongs to a general table or “<column header>  :<row header>  of <title>  is <cell value>  ” if it is from an infobox-type table.

By applying cell linearization technique to table evidence, we are able to feed each claim-evidence pair to a PLM and obtain the embedding of each token (subword) in both table evidence and sentence evidence. It is noteworthy that though claim-evidence pair is fed into PLM, the subwords in claim are excluded and only the subwords in evidence are kept. This can be expressed mathematically as follows:

𝐒i0subscriptsuperscript𝐒0𝑖\displaystyle\mathbf{S}^{0}_{i}bold_S start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =PLM([c,ei])absentPLM𝑐subscript𝑒𝑖\displaystyle=\text{PLM}([c,e_{i}])= PLM ( [ italic_c , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) (1)
𝐒i0subscriptsuperscript𝐒0𝑖\displaystyle\mathbf{S}^{0}_{i}bold_S start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT :=𝐬i00||𝐬i10||||𝐬ij0assignabsentsubscriptsuperscript𝐬0𝑖0subscriptsuperscript𝐬0𝑖1subscriptsuperscript𝐬0𝑖𝑗\displaystyle:=\mathbf{s}^{0}_{i0}\ ||\ \mathbf{s}^{0}_{i1}\ ||\ \ldots\ ||\ % \mathbf{s}^{0}_{ij}:= bold_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT | | bold_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT | | … | | bold_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (2)

where PLM is a RoBERTa model here, following previous works (Aly et al. 2021). c𝑐citalic_c denotes the claim and eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th evidence. 𝐬ij0subscriptsuperscript𝐬0𝑖𝑗\mathbf{s}^{0}_{ij}bold_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the embedding of the j𝑗jitalic_j-th subword in the i𝑖iitalic_i-th evidence and ||||| | is the concatenation of vectors.

Then, we generate the embedding of a whole word via computing the mean of the embeddings of its corresponding subwords. By following this approach, we can obtain the word embedding matrix 𝐇0superscript𝐇0\mathbf{H}^{0}bold_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for the all pieces of evidence by the mentioned way, where 𝐇0superscript𝐇0\mathbf{H}^{0}bold_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the initial node representations since we treat each word in the evidence as a node.

Edge Construction

After constructing word-level nodes, the next step is to design the connections among them. The simplest way is to construct a fully-connected graph, where each node shares an edge with every other node on the graph. However, this approach may bring too much noise since only part of information is related and beneficial for a node. Especially, the input evidence inevitably contains some unrelated information due to the error of the retrieval model. Therefore, a more elaborate design is required in this task.

We carefully design three types of edges to capture the heterogeneous information among several pieces of evidence. Specifically, the three types of edges are named inter-evidence edges, intra-sentence edges and intra-table edges. The illustration of such edges is shown in Figure 1 where different type of edge is illustrated with different color in the graph, and we introduce them in detail as follows,

Inter-evidence edges resubscript𝑟𝑒r_{e}italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.   Aggregating relevant information from multiple pieces of evidence is crucial for accurate claim veracity prediction. In fact verification scenarios involving both texts and tables, it is common for the same entity to be referenced in different types of evidence. Thus, integrating information from both sources is necessary for a comprehensive understanding. To capture the multi-hop relationship between evidence, we construct edges connecting the same word in different pieces of evidence. By propagating information along these edges, we can capture the flow of information between related evidence. To ensure the quality of inter-evidence edges, we filter out stop words like “is,” “of,” and “the” to prevent constructing edges between frequently used but insignificant words.

Intra-sentence edges rssubscript𝑟𝑠r_{s}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.   A word in the sentence is usually associated with its local context for understanding the semantics (Mikolov et al. 2013). Therefore, we adopt this traditional technique and employ a sliding window with a fix size w𝑤witalic_w to cover the local context. In this way, each word in the center of the window has edges with the rest of words in the window, through which each word is connected with its context on the graph. Thus, the contextual information can be aggregated via a one-layer GNN, which is beneficial for learning the sentence-level semantics.

Intra-table edges rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.   Tables have a completely different structure compared with sentences. The header and the cell form a key-value relationship. This structure is distinct from that of a sentence, which contains many stop words (is, the, etc.) to ensure fluency. Based on the analysis above, we assign edges among the cell, its row header, column header and its page title. To achieve this, we reuse the cell linearization method mentioned in the node construction section to transform each table cell into sequence and utilize the fix-sized window again to connect words, just like building intra-sentence edges.

Eventually, we construct a word-level evidence graph 𝒢𝒢\mathcal{G}caligraphic_G via the aforementioned design, which involves three different relations ={rs,rt,re}subscript𝑟𝑠subscript𝑟𝑡subscript𝑟𝑒\mathcal{R}=\{r_{s},r_{t},r_{e}\}caligraphic_R = { italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT }. Next, we introduce the main model architecture.

Heterogeneous Information Propagation

The constructed evidence graph includes various edges, making homogeneous GNNs unsuitable. Thus, we employ relational graph convolutional networks (R-GCN) to capture distinct node relations. Formally, it can be written as,

𝐡il+1=σ(rj𝒩ir1ci,r𝐖rl𝐡jl+𝐖0l𝐡il)superscriptsubscript𝐡𝑖𝑙1𝜎subscript𝑟subscript𝑗superscriptsubscript𝒩𝑖𝑟1subscript𝑐𝑖𝑟superscriptsubscript𝐖𝑟𝑙superscriptsubscript𝐡𝑗𝑙superscriptsubscript𝐖0𝑙superscriptsubscript𝐡𝑖𝑙\mathbf{h}_{i}^{l+1}=\sigma\left(\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{% i}^{r}}\frac{1}{c_{i,r}}\mathbf{W}_{r}^{l}\mathbf{h}_{j}^{l}+\mathbf{W}_{0}^{l% }\mathbf{h}_{i}^{l}\right)bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_σ ( ∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT end_ARG bold_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) (3)

where 𝒩irsuperscriptsubscript𝒩𝑖𝑟\mathcal{N}_{i}^{r}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT denotes the one-hop neighbors of i𝑖iitalic_i-th node that have edges of relation r𝑟ritalic_r and 𝐡il=𝐇l[i,:]1*dlsuperscriptsubscript𝐡𝑖𝑙superscript𝐇𝑙𝑖:superscript1superscript𝑑𝑙\mathbf{h}_{i}^{l}=\mathbf{H}^{l}[i,:]\in\mathbb{R}^{1*d^{l}}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT [ italic_i , : ] ∈ blackboard_R start_POSTSUPERSCRIPT 1 * italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the embedding dimension). ci,r=|𝒩ir|subscript𝑐𝑖𝑟superscriptsubscript𝒩𝑖𝑟c_{i,r}=|\mathcal{N}_{i}^{r}|italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT = | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | is the normalized term, σ𝜎\sigmaitalic_σ is the Sigmoid activation function, and 𝐖*lsuperscriptsubscript𝐖𝑙\mathbf{W}_{*}^{l}bold_W start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are learnable weight matrices in the l𝑙litalic_l-th layer.

After one-step neighborhood propagation via Eq. (3), we obtain the contextual information in one-hop neighborhood. By stacking k𝑘kitalic_k layers of R-GCN, we can aggregate the information from k𝑘kitalic_k-hop neighborhood, where k𝑘kitalic_k is a hyperparameter that will be discussed in the experimental section. We denote the final output of k𝑘kitalic_k-layer R-GCN as 𝐇N*d𝐇superscript𝑁𝑑\mathbf{H}\in\mathbb{R}^{N*d}bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_N * italic_d end_POSTSUPERSCRIPT, where N𝑁Nitalic_N and d𝑑ditalic_d are the number of nodes and the embedding dimension, respectively.

Evidence-level Representation Readout

The word-level node representations 𝐇𝐇\mathbf{H}bold_H are processed using a readout module to generate evidence-level embeddings. Inspired by the work in the graph classification (Ying et al. 2018), we employ both max pooling and mean pooling over the words within each evidence to produce its representation.

𝐞m=max(𝐇i:j)||mean(𝐇i:j)\mathbf{e}_{m}=\text{max}(\mathbf{H}_{i:j})\ ||\ \text{mean}(\mathbf{H}_{i:j})bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = max ( bold_H start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT ) | | mean ( bold_H start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT ) (4)

where 𝐇i:j(ji+1)*dsubscript𝐇:𝑖𝑗superscript𝑗𝑖1𝑑\mathbf{H}_{i:j}\in\mathbb{R}^{(j-i+1)*d}bold_H start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_j - italic_i + 1 ) * italic_d end_POSTSUPERSCRIPT denotes the segment of 𝐇𝐇\mathbf{H}bold_H spanning rows i𝑖iitalic_i to j𝑗jitalic_j, corresponding to nodes from the m𝑚mitalic_m-th evidence. The pooling strategies are applied along the first dimension so as to obtain the embedding of each evidence 𝐞m1*2dsubscript𝐞𝑚superscript12𝑑\mathbf{e}_{m}\in\mathbb{R}^{1*2d}bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 * 2 italic_d end_POSTSUPERSCRIPT.

In contrast to the graph classification context where pooling is performed over all graph nodes, this evidence-wise readout scheme proves beneficial for our task due to the varying significance of different evidence pieces.

Attention-based Claim-evidence Interaction

Evidence representations {𝐞1,𝐞2,,𝐞M}subscript𝐞1subscript𝐞2subscript𝐞𝑀\{\mathbf{e}_{1},\mathbf{e}_{2},\ldots,\mathbf{e}_{M}\}{ bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } have varying significance for claim verification. For example, the imperfect upstream evidence retrieval model may recall some redundant, claim-unrelated evidence. Such evidence should be ignored in the reasoning model. According to this observation, we introduce an attention-based claim-evidence interaction module. In detail, we compute the importance score αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for the m𝑚mitalic_m-th evidence regarding the claim based on an attention mechanism,

𝐜𝐜\displaystyle\mathbf{c}bold_c =PLM(c)absentPLM𝑐\displaystyle=\text{PLM}(c)= PLM ( italic_c ) (5)
gmsubscript𝑔𝑚\displaystyle g_{m}italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =𝐖a1(ReLU(𝐖a0(𝐜𝐞m)))absentsubscript𝐖𝑎1ReLUsubscript𝐖𝑎0conditional𝐜subscript𝐞𝑚\displaystyle=\mathbf{W}_{a1}\left(\operatorname{ReLU}\left(\mathbf{W}_{a0}% \left(\mathbf{c}\|\mathbf{e}_{m}\right)\right)\right)= bold_W start_POSTSUBSCRIPT italic_a 1 end_POSTSUBSCRIPT ( roman_ReLU ( bold_W start_POSTSUBSCRIPT italic_a 0 end_POSTSUBSCRIPT ( bold_c ∥ bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ) (6)
αmsubscript𝛼𝑚\displaystyle\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =softmax(gm)=exp(gm)i=1Mexp(gi)absentsoftmaxsubscript𝑔𝑚subscript𝑔𝑚superscriptsubscript𝑖1𝑀subscript𝑔𝑖\displaystyle=\operatorname{softmax}\left(g_{m}\right)=\frac{\exp\left(g_{m}% \right)}{\sum_{i=1}^{M}\exp\left(g_{i}\right)}= roman_softmax ( italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT roman_exp ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (7)

where PLM denotes a RoBERTa model encoding the claim into embeddings. PLMs in this module and node construction share the same weights for efficiency. We take the [CLS] representation as the claim embedding 𝐜1*d𝐜superscript1𝑑\mathbf{c}\in\mathbb{R}^{1*d}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT 1 * italic_d end_POSTSUPERSCRIPT. We then obtain the graph representation of the whole evidence set 𝐨g1*2dsubscript𝐨𝑔superscript12𝑑\mathbf{o}_{g}\in\mathbb{R}^{1*2d}bold_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 * 2 italic_d end_POSTSUPERSCRIPT via the attention-weighted summation of all evidence.

𝐨g=i=1Mαi𝐞isubscript𝐨𝑔superscriptsubscript𝑖1𝑀subscript𝛼𝑖subscript𝐞𝑖\mathbf{o}_{g}=\sum_{i=1}^{M}\alpha_{i}\mathbf{e}_{i}bold_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (8)

Fused Veracity Prediction

The graph construction method utilized here effectively captures the interaction between evidence, but it also weakens the integrity of the claim and evidence paragraphs. We found that relying solely on the graph representation may cause the model to overlook phrases such as negation words. Therefore, in addition to solely using the graph representation, we generate an assisted representation 𝐨tsubscript𝐨𝑡\mathbf{o}_{t}bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by feeding the linearized claim and evidence sequences directly to a PLM. Then, the graph representation 𝐨gsubscript𝐨𝑔\mathbf{o}_{g}bold_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is concatenated with the assisted representation 𝐨tsubscript𝐨𝑡\mathbf{o}_{t}bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and fed into a multi-layer fully-connected network, followed by softmax normalization, to produce the final prediction 𝐩^1*C^𝐩superscript1𝐶\hat{\mathbf{p}}\in\mathbb{R}^{1*C}over^ start_ARG bold_p end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 1 * italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C denotes the number of class.

𝐨tsubscript𝐨𝑡\displaystyle\mathbf{o}_{t}bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =PLM([c,e1,e2,,eM])absentPLM𝑐subscript𝑒1subscript𝑒2subscript𝑒𝑀\displaystyle=\text{PLM}(\left[c,e_{1},e_{2},\cdots,e_{M}\right])= PLM ( [ italic_c , italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] ) (9)
𝐩^^𝐩\displaystyle\hat{\mathbf{p}}over^ start_ARG bold_p end_ARG =softmax(MLP(𝐨g𝐨t))absentsoftmaxMLPconditionalsubscript𝐨𝑔subscript𝐨𝑡\displaystyle=\operatorname{softmax}(\operatorname{MLP}(\mathbf{o}_{g}\|% \mathbf{o}_{t}))= roman_softmax ( roman_MLP ( bold_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (10)

Model Training

The cross-entropy objective is utilized to compute the veracity classification loss,

Lossc=c=1C𝐲clog(𝐩^c)𝐿𝑜𝑠subscript𝑠𝑐superscriptsubscript𝑐1𝐶subscript𝐲𝑐subscript^𝐩𝑐Loss_{c}=-\sum_{c=1}^{C}\mathbf{y}_{c}\log\left(\hat{\mathbf{p}}_{c}\right)italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_log ( over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (11)

where 𝐲𝐲\mathbf{y}bold_y denotes the one-hot label vector (e.g., [1, 0, 0] represents that the ground truth is the first class).

To counteract the effects of the upstream evidence retrieval model’s imperfections, we utilize the advantageous attributes of attention mechanisms in the attention-based claim-evidence interaction module. Specifically, we calculate a predicted evidence score smsubscript𝑠𝑚s_{m}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for each evidence emsubscript𝑒𝑚e_{m}italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT using the sigmoid function, and then compare it with the truth evidence label to generate a binary classification loss. The truth evidence label indicates whether an evidence is relevant for verifying the claim. The assisted loss is obtained by averaging all binary classification losses over the evidence set \mathcal{E}caligraphic_E:

smsubscript𝑠𝑚\displaystyle s_{m}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =sigmoid(gm)absentsigmoidsubscript𝑔𝑚\displaystyle=\operatorname{sigmoid}(g_{m})= roman_sigmoid ( italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) (12)
Losse=1Mi=1M𝐿𝑜𝑠subscript𝑠𝑒1𝑀superscriptsubscript𝑖1𝑀\displaystyle Loss_{e}=\frac{1}{M}\sum_{i=1}^{M}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT [tmlog(sm)+(1tm)log(1sm)]delimited-[]subscript𝑡𝑚subscript𝑠𝑚1subscript𝑡𝑚1subscript𝑠𝑚\displaystyle\left[t_{m}\log(s_{m})+(1-t_{m})\log(1-s_{m})\right][ italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_log ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + ( 1 - italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_log ( 1 - italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ] (13)

where gmsubscript𝑔𝑚g_{m}italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a intermediate result in computing attention scores in Equation 6, tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the truth evidence label for evidence emsubscript𝑒𝑚e_{m}italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

For training the whole model, we combined the loss item lossc𝑙𝑜𝑠subscript𝑠𝑐loss_{c}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and losse𝑙𝑜𝑠subscript𝑠𝑒loss_{e}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT with a hyperparameter β𝛽\betaitalic_β, which will be discussed in the experimental section.

Loss=Lossc+βLosse𝐿𝑜𝑠𝑠𝐿𝑜𝑠subscript𝑠𝑐𝛽𝐿𝑜𝑠subscript𝑠𝑒Loss=Loss_{c}+\beta Loss_{e}italic_L italic_o italic_s italic_s = italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_β italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (14)

Experiment

In this section, we conduct comprehensive experiments to answer the following research questions:

  • RQ1: How does the proposed method HeterFC perform compared with existing baselines?

  • RQ2: How is the word-level graph compared with the sentence-level graph and the token-level graph?

  • RQ3: How does the model perform with different design of edges?

  • RQ4: Which part of the model contributes most to the final result apart from the graph construction?

  • RQ5: How does the model performance change with different values of hyper-parameters?

  • RQ6: How does the model perform with different retrieval model?

Experimental Setups

Dataset

Following prior research, we utilize the extensive FEVEROUS dataset in our experiments (Aly et al. 2021). The FEVEROUS task involves finding relevant evidence before claim verification. Each claim is manually annotated with labels: Supported, Refuted, or Not Enough Information, and paired with a corresponding golden evidence set. The dataset is divided into a training set of 71,291 claims, a development set of 7,890 claims, and a blind test set available on an online judging system222https://eval.ai/web/challenges/challenge-page/1091/overview. The evidence sets include 38,941 with sentences only, 30,574 with tables only, and 25,395 with both sentences and tables.

Metrics

Two metrics gauge the model’s performance: the Feverous score and label accuracy. Label accuracy solely evaluates claim veracity classification accuracy, whereas the Feverous score assesses verdict prediction accuracy and correct retrieval of the golden evidence set. This score quantifies instances where the golden evidence set is successfully retrieved and the verdict is correctly predicted. The Feverous score is a comprehensive metric that assesses both the retrieval system and veracity reasoning model’s performance.

Model Dev Test
Feverous Label Accuracy Feverous Label Accuracy
Transformer-based RoBERTa-PairmeansubscriptRoBERTa-Pairmean\text{RoBERTa-Pair}_{\text{mean}}RoBERTa-Pair start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT 0.3452 0.7117 0.3231 0.6074
RoBERTa-PairmaxsubscriptRoBERTa-Pairmax\text{RoBERTa-Pair}_{\text{max}}RoBERTa-Pair start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 0.3550 0.7207 0.3347 0.6190
RoBERTa-Concat 0.3549 0.7221 0.3344 0.6194
DCUF (Hu et al. 2022) 0.3577 0.7291 0.3397 0.6321
Graph-based GEAR (Zhou et al. 2019) 0.2640 0.5859 0.2483 0.4964
KGAT (Liu et al. 2020) 0.3293 0.6844 0.3043 0.5797
HeterFC 0.3714 0.7352 0.3476 0.6329
Table 1: Comparison of models on Feverous task

Baselines

  • RoBERTa-Pairmean/maxsubscriptRoBERTa-Pairmean/max\textbf{RoBERTa-Pair}_{\text{mean/max}}RoBERTa-Pair start_POSTSUBSCRIPT mean/max end_POSTSUBSCRIPT Utilizes RoBERTa as backbone. Concatenates claim with each evidence separately to form sentences. Mean or max pooling is applied over embeddings of all evidence for final prediction.

  • RoBERTa-Concat Concatenates claim with all evidence using [SEP] as separator. [CLS] token’s representation is used for classification.

  • GEAR (Zhou et al. 2019) Homogeneous coarse-grained graph-based method. Treats each claim-evidence pair as node in a fully-connected evidence graph. Graph convolutional network and evidence aggregator are used.

  • KGAT (Liu et al. 2020) Kernel graph attention model. Similar graph construction as GEAR. Employs node and edge kernels for fine-grained evidence propagation.

  • DCUF (Hu et al. 2022) Dual-channel approach. Converts evidence to sentence form and table form. Utilizes RoBERTa for sentence form, and TAPAS for table form. Integrates both channels for veracity prediction.

Evidence Retrieval

The primary objective of our paper is to present a model for veracity reasoning. For equitable comparison, we employ the evidence retrieval model from (Hu et al. 2022) for all tested models. We retrieve up to 150 relevant Wikipedia pages per claim using entity-matching and TF-IDF. The top 5 pages are selected based on SBERT333https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 and BM25 rankings. Tables within these pages are flattened, and up to 5 sentences and 3 tables are selected per claim using DrQA (Chen et al. 2017). Relevant table cells are identified using a cell selector. Each claim’s retrieved evidence set comprises up to 5 sentences and 25 table cells, as FEVEROUS task requirements.

Implementation Details

To boost our model’s ability to extract relevant details from noisy evidence, we augment each claim with two sets of evidence: the golden evidence set and the retrieved evidence set. Claims with the golden evidence set have all truth evidence labels set to positive. while for claims with the retrieved evidence set, only evidence shared with the golden evidence set receives a positive label. We consistently apply this augmentation to all baseline models. This approach differs from GEAR and KGAT’s original technique for the FEVER task (Thorne et al. 2018), which focuses on fact verification over unstructured text evidence. We use the Adam optimizer (Kingma and Ba 2015) with learning rates of 1e-5 for language model parameters and 1e-3 for others, employing a linear scheduler with a 20% warm-up rate. RoBERTa-Large serves as the PLM, with window size w𝑤witalic_w and R-GCN layer count k𝑘kitalic_k set to 2. All experiments run on a server with an AMD EPYC 7742 (256) @ 2.250GHz CPU and one NVIDIA A100 GPU.

Overall Performance (RQ1)

We compare our proposed HeterFC against various baselines, spanning transformer-based and graph-based models. In Table 1, HeterFC consistently outperforms all strong baselines across metrics and datasets, highlighting its superiority. Key observations from these results are as follows:

  • Among RoBERTa-based models, RoBERTa-PairmeansubscriptRoBERTa-Pairmean\text{RoBERTa-Pair}_{\text{mean}}RoBERTa-Pair start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT lags, while the other two exhibit similar performance levels. This may stem from RoBERTa-Pair’s limited ability to capture diverse relationships by processing evidence pieces separately. RoBERTa-Concat processes evidence together, but struggles to distinguish between golden and noisy evidence, affecting its performance. Thus, direct PLM use falls short, emphasizing the need for task-specific design.

  • DCUF, the top-performing transformer-based method, incorporates RoBERTa-Concat alongside TAPAS, contributing to its superior performance.

  • Graph-based models GEAR and KGAT, stemming from FEVER task, exhibit suboptimal performance due to task and data differences. Comparing them with HeterFCgraphsubscriptHeterFCgraph\text{HeterFC}_{\text{graph}}HeterFC start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT (see Table 2), which is a purely graph-based model, HeterFCgraphsubscriptHeterFCgraph\text{HeterFC}_{\text{graph}}HeterFC start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT outperforms substantially, affirming the efficacy of our hybrid fact verification design.

Ablation Study and Model Variants

Comparison of the Graph Granularity (RQ2)  

Refer to caption
Figure 2: The performance comparison among models with different graph granularity. HeterFCsentsubscriptHeterFCsent\text{HeterFC}_{\text{sent}}HeterFC start_POSTSUBSCRIPT sent end_POSTSUBSCRIPT represents the variant with sentence-level graph, while HeterFCtokensubscriptHeterFCtoken\text{HeterFC}_{\text{token}}HeterFC start_POSTSUBSCRIPT token end_POSTSUBSCRIPT represents the variant with token-level graph.
Refer to caption
Figure 3: The performance comparison between the proposed model HeterFC and its three variants: HeterFCw/o glsubscriptHeterFCw/o gl\text{HeterFC}_{\text{w/o gl}}HeterFC start_POSTSUBSCRIPT w/o gl end_POSTSUBSCRIPT ignores global and local connection designs. HeterFCw/o HetsubscriptHeterFCw/o Het\text{HeterFC}_{\text{w/o Het}}HeterFC start_POSTSUBSCRIPT w/o Het end_POSTSUBSCRIPT ignores heterogeneous relations. HeterFCw/o bothsubscriptHeterFCw/o both\text{HeterFC}_{\text{w/o both}}HeterFC start_POSTSUBSCRIPT w/o both end_POSTSUBSCRIPT removes all special designs for a fully-connected homogeneous graph.

To validate our word-level graph design, we experiment with HeterFC against its sentence-level graph variant (𝐇𝐞𝐭𝐞𝐫𝐅𝐂sentsubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂sent\textbf{HeterFC}_{\text{sent}}HeterFC start_POSTSUBSCRIPT sent end_POSTSUBSCRIPT) and token-level graph variant (𝐇𝐞𝐭𝐞𝐫𝐅𝐂tokensubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂token\textbf{HeterFC}_{\text{token}}HeterFC start_POSTSUBSCRIPT token end_POSTSUBSCRIPT). For sentence-level graphs, each claim-paired sentence (tables linearized into sentences) constitutes a node. Nodes from the same claim interconnect to form a fully-connected evidence graph, initialized with PLM [CLS] token embeddings. For token-level graphs, we omit subword mean pooling, resulting in nodes representing subwords.

From Figure 2, HeterFC surpasses both model variants across metrics. HeterFCsentsubscriptHeterFCsent\text{HeterFC}_{\text{sent}}HeterFC start_POSTSUBSCRIPT sent end_POSTSUBSCRIPT significantly lags behind HeterFC and HeterFCtokensubscriptHeterFCtoken\text{HeterFC}_{\text{token}}HeterFC start_POSTSUBSCRIPT token end_POSTSUBSCRIPT. This gap suggests that HeterFCsentsubscriptHeterFCsent\text{HeterFC}_{\text{sent}}HeterFC start_POSTSUBSCRIPT sent end_POSTSUBSCRIPT’s coarser granularity struggles to capture nuanced semantic relationships among evidence pieces. Comparing HeterFC with HeterFCtokensubscriptHeterFCtoken\text{HeterFC}_{\text{token}}HeterFC start_POSTSUBSCRIPT token end_POSTSUBSCRIPT, HeterFC maintains better Feverous score and slightly edges in label accuracy. This suggests that the word-level graph in HeterFC enhances performance due to more precise inter-evidence connections than the token-level graph. In token-level graphs, shared subwords may lead to extraneous inter-evidence links among unrelated words (e.g., “interesting” and “thing” sharing “ing”). This results in noisy connections due to non-keyword shared subwords, undermining overall performance.


Comparison of Different Edge Construction Strategies (RQ3)   In HeterFC, unique edge constructions include heterogeneous relations, inter-evidence global connections, and intra-evidence local connections. This section ablates these strategies through model variants to assess their impact. For 𝐇𝐞𝐭𝐞𝐫𝐅𝐂w/o Hetsubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂w/o Het\textbf{HeterFC}_{\text{w/o Het}}HeterFC start_POSTSUBSCRIPT w/o Het end_POSTSUBSCRIPT, homogeneous edges replace heterogeneous relations, making all edges identical. 𝐇𝐞𝐭𝐞𝐫𝐅𝐂w/o glsubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂w/o gl\textbf{HeterFC}_{\text{w/o gl}}HeterFC start_POSTSUBSCRIPT w/o gl end_POSTSUBSCRIPT disregards specific local and global connection designs, yielding a fully-connected graph while retaining heterogeneous relations. 𝐇𝐞𝐭𝐞𝐫𝐅𝐂w/o bothsubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂w/o both\textbf{HeterFC}_{\text{w/o both}}HeterFC start_POSTSUBSCRIPT w/o both end_POSTSUBSCRIPT discards all special designs, leading to a fully-connected and homogeneous graph.

From Figure 3, HeterFC outperforms its variants, with HeterFCw/o bothsubscriptHeterFCw/o both\text{HeterFC}_{\text{w/o both}}HeterFC start_POSTSUBSCRIPT w/o both end_POSTSUBSCRIPT exhibiting the weakest performance. Furthermore, HeterFCw/o glsubscriptHeterFCw/o gl\text{HeterFC}_{\text{w/o gl}}HeterFC start_POSTSUBSCRIPT w/o gl end_POSTSUBSCRIPT and HeterFCw/o HetsubscriptHeterFCw/o Het\text{HeterFC}_{\text{w/o Het}}HeterFC start_POSTSUBSCRIPT w/o Het end_POSTSUBSCRIPT show marked performance drops, emphasizing the effectiveness of both heterogeneous edges and local/global connection designs.

Model Feverous Label Accuracy
HeterFC 0.3714 0.7352
HeterFCmeansubscriptHeterFCmean\text{HeterFC}_{\text{mean}}HeterFC start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT 0.3613 0.7238
HeterFCmaxsubscriptHeterFCmax\text{HeterFC}_{\text{max}}HeterFC start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 0.3620 0.7226
HeterFCgraphsubscriptHeterFCgraph\text{HeterFC}_{\text{graph}}HeterFC start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT 0.3640 0.7243
HeterFCsinglesubscriptHeterFCsingle\text{HeterFC}_{\text{single}}HeterFC start_POSTSUBSCRIPT single end_POSTSUBSCRIPT 0.3641 0.7257
Table 2: Comparison between HeterFC and its variants: HeterFCmean/maxsubscriptHeterFCmean/max\text{HeterFC}_{\text{mean/max}}HeterFC start_POSTSUBSCRIPT mean/max end_POSTSUBSCRIPT substitutes attention-based claim-evidence interaction with mean/max pooling. HeterFCgraphsubscriptHeterFCgraph\text{HeterFC}_{\text{graph}}HeterFC start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT uses only graph representation for a purely graph-based model. HeterFCsinglesubscriptHeterFCsingle\text{HeterFC}_{\text{single}}HeterFC start_POSTSUBSCRIPT single end_POSTSUBSCRIPT ignores the assisted loss.

Investigating Attention Module, Veracity Prediction and Losses (RQ4)   In addition to the discussed graph construction, we further evaluate the Attention-based Claim-evidence Interaction model, Fused Veracity Prediction, and assisted loss designs through several HeterFC variants:

  • 𝐇𝐞𝐭𝐞𝐫𝐅𝐂mean/maxsubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂mean/max\textbf{HeterFC}_{\text{mean/max}}HeterFC start_POSTSUBSCRIPT mean/max end_POSTSUBSCRIPT: Substitutes the original Attention-based Claim-evidence Interaction with a mean/max pooling layer for evidence representation aggregation. No assisted loss is computed due to the absence of predicted evidence scores.

  • 𝐇𝐞𝐭𝐞𝐫𝐅𝐂graphsubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂graph\textbf{HeterFC}_{\text{graph}}HeterFC start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT: Excludes otsubscripto𝑡\text{o}_{t}o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Fused Veracity Prediction, relying solely on graph representation ogsubscripto𝑔\text{o}_{g}o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for a purely graph-based model.

  • 𝐇𝐞𝐭𝐞𝐫𝐅𝐂singlesubscript𝐇𝐞𝐭𝐞𝐫𝐅𝐂single\textbf{HeterFC}_{\text{single}}HeterFC start_POSTSUBSCRIPT single end_POSTSUBSCRIPT: Using only the veracity classification loss Lossc𝐿𝑜𝑠subscript𝑠𝑐Loss_{c}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, ignoring the assisted loss Losse𝐿𝑜𝑠subscript𝑠𝑒Loss_{e}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

Table 2 displays development set results for these variants. Notably, HeterFC surpasses all variants. HeterFCmean/maxsubscriptHeterFCmean/max\text{HeterFC}_{\text{mean/max}}HeterFC start_POSTSUBSCRIPT mean/max end_POSTSUBSCRIPT perform worse than HeterFCsinglesubscriptHeterFCsingle\text{HeterFC}_{\text{{single}}}HeterFC start_POSTSUBSCRIPT single end_POSTSUBSCRIPT. These three variants lack assisted loss during training, highlighting the importance of the Attention-based Claim-evidence Interaction model. Comparing HeterFCsinglesubscriptHeterFCsingle\text{HeterFC}_{\text{{single}}}HeterFC start_POSTSUBSCRIPT single end_POSTSUBSCRIPT and HeterFCgraphsubscriptHeterFCgraph\text{HeterFC}_{\text{{graph}}}HeterFC start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT, assisted loss and Fused Veracity Prediction contribute similarly to outcomes. The pivotal role of the Attention-based Claim-evidence Interaction module is evident.

Refer to caption
Figure 4: The influence of hyper-parameters on HeterFC’s performance on the development set.

Hyperparameter Sensitivity Analysis (RQ5)

We study the impact of two key hyperparameters: the number of R-GCN layers k𝑘kitalic_k determining information aggregation range and the parameter β𝛽\betaitalic_β for balancing Losse𝐿𝑜𝑠subscript𝑠𝑒Loss_{e}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and Lossc𝐿𝑜𝑠subscript𝑠𝑐Loss_{c}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

Number of layers k𝑘kitalic_k.

In the left part of Figure 4, increasing GNN layers enhances metrics, but beyond k=2𝑘2k=2italic_k = 2, metrics drop due to over-smoothing. We select k=2𝑘2k=2italic_k = 2 for HeterFC.

Balancing parameter β𝛽\betaitalic_β.

The right part of Figure 4 shows Feverous score peaks at β=1.2𝛽1.2\beta=1.2italic_β = 1.2 before a slight decline, while label accuracy increases steadily with larger β𝛽\betaitalic_β, stabilizing eventually. We select β=1.2𝛽1.2\beta=1.2italic_β = 1.2 as optimal for HeterFC.

Evaluating the Robustness of HeterFC (RQ6)

Model Feverous Label Acc.
RoBERTa-PairmeansubscriptRoBERTa-Pairmean\text{RoBERTa-Pair}_{\text{mean}}RoBERTa-Pair start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT 0.2272 0.6689
RoBERTa-PairmaxsubscriptRoBERTa-Pairmax\text{RoBERTa-Pair}_{\text{max}}RoBERTa-Pair start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 0.2379 0.6829
RoBERTa-Concat 0.2352 0.6627
DCUF 0.2380 0.6895
GEAR 0.1783 0.5786
KGAT 0.2110 0.6117
HeterFC 0.2427 0.6977
Table 3: Comparison of model performances on the development set under different retrieval method.

We extend our analysis beyond the retrieval method by (Hu et al. 2022) to include a simpler approach by (Aly et al. 2021). Despite less effective retrieval, it simulates scenarios with limited, noisy evidence. Results in Table 3 align with those in Table 1, showing that HeterFC maintains superior performance over baseline models, underscoring its robustness even with less optimal retrieval techniques.

Conclusion

In this paper, we introduce HeterFC, a novel word-level heterogeneous-graph-based model for fact checking that effectively combines unstructured and structured information. Our model employs a carefully designed graph structure with word-level nodes and diverse edge types. We integrate a heterogeneous information propagation module with attention-based claim-evidence interaction to capture the semantic relationships between claims and evidence. Additionally, we introduce an assisted loss based on attention scores to differentiate valid and invalid evidence. Extensive experiments confirm the superiority of HeterFC over diverse baseline models.

Acknowledgements

This work is jointly sponsored by National Natural Science Foundation of China (U19B2038, 62372454, 62206291, 62236010).

References

  • Aly et al. (2021) Aly, R.; Guo, Z.; Schlichtkrull, M. S.; Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Cocarascu, O.; and Mittal, A. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Bouziane et al. (2021) Bouziane, M.; Perrin, H.; Sadeq, A.; Nguyen, T.; Cluzeau, A.; and Mardas, J. 2021. FaBULOUS: Fact-checking Based on Understanding of Language Over Unstructured and Structured information. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 31–39. Association for Computational Linguistics.
  • Chen et al. (2017) Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1870–1879. Association for Computational Linguistics.
  • Chen et al. (2022) Chen, Z.; Hui, S. C.; Zhuang, F.; Liao, L.; Li, F.; Jia, M.; and Li, J. 2022. EvidenceNet: Evidence Fusion Network for Fact Verification. In Proceedings of the ACM Web Conference 2022, 2636–2645.
  • Fan et al. (2019) Fan, S.; Zhu, J.; Han, X.; Shi, C.; Hu, L.; Ma, B.; and Li, Y. 2019. Metapath-guided Heterogeneous Graph Neural Network for Intent Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, 2478–2486. ACM.
  • Funkquist (2021) Funkquist, M. 2021. Combining sentence and table evidence to predict veracity of factual claims using TaPaS and RoBERTa. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 92–100. Association for Computational Linguistics.
  • Gi, Fang, and Tsai (2021) Gi, I.-Z.; Fang, T.-Y.; and Tsai, R. T.-H. 2021. Verdict Inference with Claim and Retrieved Elements Using RoBERTa. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 60–65. Association for Computational Linguistics.
  • Hu et al. (2022) Hu, N.; Wu, Z.; Lai, Y.; Liu, X.; and Feng, Y. 2022. Dual-Channel Evidence Fusion for Fact Verification over Texts and Tables. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5232–5242. Association for Computational Linguistics.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Kotonya et al. (2021) Kotonya, N.; Spooner, T.; Magazzeni, D.; and Toni, F. 2021. Graph Reasoning with Context-Aware Linearization for Interpretable Fact Extraction and Verification. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 21–30. Association for Computational Linguistics.
  • Krause et al. (2020) Krause, N. M.; Freiling, I.; Beets, B.; and Brossard, D. 2020. Fact-checking as risk communication: the multi-layered risk of misinformation in times of COVID-19. Journal of Risk Research, 23: 1052 – 1059.
  • Liu et al. (2023) Liu, Q.; Wu, J.; Wu, S.; and Wang, L. 2023. Out-of-distribution Evidence-aware Fake News Detection via Dual Adversarial Debiasing. arXiv preprint arXiv:2304.12888.
  • Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. ArXiv preprint, abs/1907.11692.
  • Liu et al. (2020) Liu, Z.; Xiong, C.; Sun, M.; and Liu, Z. 2020. Fine-grained Fact Verification with Kernel Graph Attention Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7342–7351. Association for Computational Linguistics.
  • Malon (2021) Malon, C. 2021. Team Papelo at FEVEROUS: Multi-hop Evidence Pursuit. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 40–49. Association for Computational Linguistics.
  • Mikolov et al. (2013) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 3111–3119.
  • Naeem and Bhatti (2020) Naeem, S. B.; and Bhatti, R. 2020. The Covid‐19 ‘infodemic’: a new front for information professionals. Health Information and Libraries Journal, 37: 233 – 239.
  • Schlichtkrull et al. (2018) Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Berg, R. v. d.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference, 593–607. Springer.
  • Soleimani, Monz, and Worring (2020) Soleimani, A.; Monz, C.; and Worring, M. 2020. Bert for evidence retrieval and claim verification. In European Conference on Information Retrieval, 359–366.
  • Sun et al. (2018) Sun, H.; Dhingra, B.; Zaheer, M.; Mazaitis, K.; Salakhutdinov, R.; and Cohen, W. W. 2018. Open domain question answering using early fusion of knowledge bases and text. arXiv preprint arXiv:1809.00782.
  • Thorne et al. (2018) Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 809–819. Association for Computational Linguistics.
  • Wadden et al. (2020) Wadden, D.; Lin, S.; Lo, K.; Wang, L. L.; van Zuylen, M.; Cohan, A.; and Hajishirzi, H. 2020. Fact or Fiction: Verifying Scientific Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7534–7550. Association for Computational Linguistics.
  • Wang (2017) Wang, W. Y. 2017. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 422–426. Association for Computational Linguistics.
  • Wright et al. (2022) Wright, D.; Wadden, D.; Lo, K.; Kuehl, B.; Cohan, A.; Augenstein, I.; and Wang, L. 2022. Generating Scientific Claims for Zero-Shot Scientific Fact Checking. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2448–2460. Association for Computational Linguistics.
  • Wu et al. (2022) Wu, J.; Xu, W.; Liu, Q.; Wu, S.; and Wang, L. 2022. Adversarial contrastive learning for evidence-aware fake news detection with graph neural networks. arXiv preprint arXiv:2210.05498.
  • Xu et al. (2023) Xu, W.; Liu, Q.; Wu, S.; and Wang, L. 2023. Counterfactual Debiasing for Fact Verification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6777–6789.
  • Xu et al. (2022) Xu, W.; Wu, J.; Liu, Q.; Wu, S.; and Wang, L. 2022. Evidence-aware Fake News Detection with Graph Neural Networks. Proceedings of the ACM Web Conference 2022.
  • Yan et al. (2021) Yan, Q.; Zhang, Y.; Liu, Q.; Wu, S.; and Wang, L. 2021. Relation-aware Heterogeneous Graph for User Profiling. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3573–3577.
  • Ying et al. (2018) Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018. Hierarchical Graph Representation Learning with Differentiable Pooling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, 4805–4815.
  • Yu et al. (2019) Yu, W.; Zhou, J.; Yu, W.; Liang, X.; and Xiao, N. 2019. Heterogeneous Graph Learning for Visual Commonsense Reasoning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2765–2775.
  • Zellers et al. (2019) Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; and Choi, Y. 2019. Defending Against Neural Fake News. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 9051–9062.
  • Zhao et al. (2020) Zhao, C.; Xiong, C.; Rosset, C.; Song, X.; Bennett, P. N.; and Tiwary, S. 2020. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Zhao et al. (2017) Zhao, H.; Yao, Q.; Li, J.; Song, Y.; and Lee, D. L. 2017. Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, 635–644. ACM.
  • Zhong et al. (2020) Zhong, W.; Xu, J.; Tang, D.; Xu, Z.; Duan, N.; Zhou, M.; Wang, J.; and Yin, J. 2020. Reasoning Over Semantic-Level Graph for Fact Checking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6170–6180. Association for Computational Linguistics.
  • Zhou et al. (2019) Zhou, J.; Han, X.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; and Sun, M. 2019. GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 892–901. Association for Computational Linguistics.