Metachunking
Metachunking
Metachunking
A BSTRACT
arXiv:2410.12788v1 [cs.CL] 16 Oct 2024
1 I NTRODUCTION
Retrieval-augmented generation (RAG), as a cutting-edge technological paradigm, aims to address
challenges faced by large language models (LLMs), such as data freshness (He et al., 2022), hal-
lucinations (Bénédict et al., 2023; Chen et al., 2023b; Zuccon et al., 2023; Liang et al., 2024), and
the lack of domain-specific knowledge (Li et al., 2023; Shen et al., 2023). This is particularly rel-
evant in knowledge-intensive tasks like open-domain question answering (Lazaridou et al., 2022).
By integrating two key components: the retriever and the generator, this technology enables more
precise responses to input queries (Singh et al., 2021; Lin et al., 2023). While the feasibility of
the retrieval-augmentation strategy has been widely demonstrated through practice, its effectiveness
heavily relies on the relevance and accuracy of the retrieved documents (Li et al., 2022; Tan et al.,
2022). The introduction of excessive redundant or incomplete information through retrieval not only
fails to enhance the performance of the generation model but may also lead to a decline in answer
quality (Shi et al., 2023; Yan et al., 2024).
In response to the aforementioned challenges, current research efforts mainly focus on two aspects:
improving retrieval accuracy (Besta et al., 2024; Zhuang et al., 2024; Sidiropoulos & Kanoulas,
2022; Guo et al., 2023) and enhancing the robustness of LLMs against toxic information (Longpre
et al.; Kim et al., 2024). However, in RAG systems, a commonly overlooked aspect is the chunked
processing of textual content, which directly impacts the quality of dense retrieval (Xu et al., 2023).
Traditional text chunking methods, often based on rules or semantic similarity (Zhang et al., 2021;
Langchain, 2023; Lyu et al., 2024), provide some structural segmentation but are inadequate in
capturing subtle changes in logical relationships between sentences. As illustrated in Figure 1,
∗
Corresponding author: [email protected]
1
Preprint. Under review.
example sentences exhibit a progressive relationship, yet their semantic similarity is low, which
may result in their complete separation. The LumberChunker (Duarte et al., 2024) offers a novel
solution by utilizing LLMs to receive a series of consecutive paragraphs and accurately identify
where content begins to diverge. However, it demands a high level of instruction-following ability
from LLMs, necessitating the use of the Gemini model, which incurs significant resource and time
costs. This raises a practical question: How can we fully utilize the powerful reasoning capabilities
of LLMs while efficiently accomplishing the text chunking task at a lower cost?
This paper introduces the concept of Meta-Chunking, which operates at a granularity between sen-
tences and paragraphs, aiming to enhance logical coherence in the process of text segmentation.
Meta-Chunking consists of sets of sentences within paragraphs that share deep linguistic and log-
ical connections. To address the limitations of traditional methods based on semantic similarity,
we leverage the powerful comprehension and reasoning capabilities of LLMs to devise two Meta-
Chunking strategies: Margin Sampling Chunking and Perplexity (PPL) Chunking. The former
approach determines whether consecutive sentences require segmentation by comparing the proba-
bility difference of a binary classification with a set threshold. The latter calculates the PPL of each
sentence based on its context and identifies text chunk boundaries by analyzing PPL distribution
characteristics. The Margin Sampling Chunking effectively reduces the dependence of text chunk-
ing on model size, enabling smaller language models with relatively weaker reasoning capabilities
to perform this task adequately. The PPL Chunking takes it a step further, significantly improving
processing efficiency and achieving both resource and time savings. This provides crucial support
for LLMs to handle text chunking in real-world scenarios.
To comprehensively evaluate proposed methods, extensive experiments were conducted on eleven
datasets across four benchmarks, involving both Chinese and English texts, ranging from brief to ex-
tensive documents, and measured through seven key metrics. In response to the inherent complexity
of different datasets, we propose a Meta-Chunking with dynamic combination strategy designed
to achieve a valid balance between fine-grained and coarse-grained text segmentation. Experimen-
tal results fully demonstrate that the Meta-Chunking strategy significantly improves performance
compared to traditional rule-based and semantic chunking. More importantly, compared to the cur-
rent LLMs approache, the method proposed in this paper exhibits superior performance in terms of
efficiency and cost savings.
2 M ETHODOLOGY
Our main contribution is an innovative text segmentation technique named Meta-Chunking, which
leverages the capabilities of LLMs to flexibly partition documents into logically coherent, indepen-
dent chunks. Our approach is grounded in a core principle: allowing variability in chunk size to
more effectively capture and maintain the logical integrity of content. This dynamic adjustment of
granularity ensures that each segmented chunk contains a complete and independent expression of
ideas, thereby avoiding breaks in the logical chain during the segmentation process. This not only
enhances the relevance of document retrieval but also improves content clarity.
As illustrated in Figure 2, our method integrates the advantages of traditional text segmentation
strategies, such as adhering to preset chunk length constraints and ensuring sentence structural in-
tegrity, while enhancing the ability to guarantee logical coherence during the segmentation process.
The key lies in introducing a novel concept between sentence-level and paragraph-level text granu-
larity: Meta-Chunking. A meta chunk consists of a collection of sequentially arranged sentences
within a paragraph, where the sentences not only share semantic relevance but, more importantly,
contain deep linguistic logical connections, including but not limited to causal, transitional, parallel,
and progressive relationships. These relationships go beyond mere semantic similarity. In order to
achieve this goal, we have designed and implemented the following two strategies.
Margin Sampling Chunking: Given a text, the initial step involves segmenting it into a collection
of sentences denoted as (x1 , x2 , . . . , xn ), with the ultimate goal being to further partition these
sentences into several chunks, forming a new set (X1 , X2 , . . . , Xk ), where each chunk comprises a
2
Preprint. Under review.
Figure 1: Overview of RAG pipeline, as well as examples based on rules, similarity, and PPL
segmentation. The same background color represents being located in the same chunk.
coherent grouping of the original sentences. The method can be formulated as:
′
′
MarginM (xi ) = PM y = k1 |Prompt(xi , X ) − PM y = k2 |Prompt(xi , X ) (1)
where (k1 , k2 ) indicates a binary decision between yes or no for a segmentation judgment.
′ ′
Prompt(xi , X ) represents forming an instruction between xi ∈ {xl }nl=1 and X , regarding whether
′
they should be merged, where X encompasses either a single sentence or multiple sentences.
Through the probability PM obtained by model M , we can derive the probability difference
MarginM (xi ) between the two options. Subsequently, by contrasting MarginM (xi ) with the thresh-
old θ, a conclusion can be drawn regarding whether the two sentences should be segmented. For the
setting of θ, we initially assign it a value of 0 and then adjust it by recording historical MarginM (xi )
and calculating their average.
Perplexity Chunking: Similarly, we split the text into sentences and use the model to calculate the
PPL of each sentence xi based on the preceding sentences:
PK
PPLM (tik |ti<k , t<i )
PPLM (xi ) = k=1 (2)
K
where K represents the total number of tokens in xi , tik denotes the k-th token in xi , and t<i signifies
all tokens that precede xi . To locate the key points of text segmentation, the algorithm further ana-
lyzes the distribution characteristics of PPLseq = (PPLM (x1 ), PPLM (x2 ), . . . , PPLM (xn )), partic-
ularly focusing on identifying minima:
Minimaindex (PPLseq ) = i min(PPLM (xi−1 ), PPLM (xi+1 )) − PPLM (xi ) > θ,
or PPLM (xi−1 ) − PPLM (xi ) > θ and PPLM (xi+1 ) = PPLM (xi ) (3)
These minima are regarded as potential chunk boundaries. If the text exceeds the processing range
of LLMs or device, we strategically introduce a key-value (KV) caching mechanism. Specifically,
the text is first divided into several parts according to tokens, forming multiple subsequences. As
the PPL calculation progresses, when the GPU memory is about to exceed the server configuration
or the maximum context length of LLMs, the algorithm appropriately removes KV pairs of previous
partial text, thus not sacrificing too much contextual coherence.
To address diverse chunking needs of users, merely adjusting the threshold to control chunk size
sometimes leads to uneven chunking sizes as the threshold increases, as shown in Section 4.2.2 and
3
Preprint. Under review.
Figure 2: Overview of the entire process of Meta-Chunking. Each circle represents a complete sen-
tence, and the sentence lengths are not consistent. The vertical lines indicate where to segment. The
two sides at the bottom of the figure reveal Margin Sampling Chunking and Perplexity Chunking.
Circles with the same background color represent a meta-chunk, which is dynamically combined to
make the final chunk length meet user needs.
4.2.3. Therefore, we propose a strategy combining Meta-Chunking with dynamic merging, aiming to
flexibly respond to varied chunking requirements. Firstly, we set an initial threshold of 0 or a specific
value based on the PPL distribution and perform Meta-Chunking operations, preliminarily dividing
the document into a series of basic units (c1 , c2 , . . . , cα ). Subsequently, according to the user-
specified chunk length L, we iteratively merge adjacent meta-chunks until the total length satisfies
or approximates the requirement. Specifically, if len(c1 , c2 , c3 ) = L or len(c1 , c2 , c3 ) < L while
len(c1 , c2 , c3 , c4 ) > L, then c1 , c2 , c3 are regarded as a complete chunk.
LLMs are designed to learn a distribution Q that approximates the empirical distribution P from
sample texts. To quantify the closeness between these two distributions, cross-entropy is typically
employed as a metric. Under the discrete scenario, cross-entropy of Q relative to P is formally
defined as follows:
X
H(P, Q) = Ep [−logQ] = − P (x) log Q(x) = H(P ) + DKL (P ||Q) (4)
x
where H(P ) represents the empirical entropy, and DKL (P ||Q) is the Kullback-Leibler (KL) diver-
gence between Q and P . The PPL of LLMs, mathematically speaking, is defined as:
PPL(P, Q) = 2H(P,Q) (5)
It is essential to notice that, since H(p) is unoptimizable and bounded as shown in Appendix A.1,
what truly impacts the discrepancy in PPL calculations across different LLMs is the KL divergence,
which serves as a metric to assess the difference between distributions. The greater the KL di-
vergence is, the larger the disparity between two distributions signifies. Furthermore, high PPL
indicates the cognitive hallucination of LLMs towards the real content, and such portions should not
be segmented.
On the other hand, Shannon (1951) approximates the entropy of any language through a function
X
GK = − P (Tk ) log2 P (tk |Tk−1 )
Tk
X X
=− P (Tk ) log2 P (Tk ) + P (Tk−1 ) log2 P (Tk−1 ) (6)
Tk Tk−1
4
Preprint. Under review.
where Tk represents k consecutive tokens (t1 , t2 , . . . , tk ) in a text sequence, entropy can then be
expressed as
Then, based on the proof in Appendix A.1 that GK+1 ≤ GK for all K ≥ 1, we can derive
By combining formulas (4) and (8), we observe that for large-scale text processing tasks, increasing
the context length tends to reduce the cross-entropy or PPL, a phenomenon that reflects the ability
of LLMs to make more effective logical inferences and semantic understandings after capturing
broader contextual information. Consequently, during PPL Chunking experiments, we maximize
the input of longer text sequences to LLMs, anticipating more substantial performance gains.
3 E XPERIMENT
3.2 BASELINES
We primarily compared Meta-Chunking with two types of methods, namely rule-based chunking and
dynamic chunking, noting that the latter incorporates both semantic similarity models and LLMs.
The original rule-based method simply divides long texts into fixed-length chunks, disregarding
sentence boundaries. However, the Llama index method (Langchain, 2023) offers a more nuanced
approach, balancing the maintenance of sentence boundaries while ensuring that token counts in
each segment are close to a preset threshold. On the other hand, similarity chunking (Xiao et al.,
2023) utilizes sentence embedding models to segment text based on semantic similarity, effectively
grouping highly related sentences together. Alternatively, LumberChunker (Duarte et al., 2024)
employs LLMs to predict optimal segmentation points within the text. Both methods exhibit unique
strengths in adapting to the context and structure of texts.
5
Preprint. Under review.
Table 1: Main experimental results are presented in five QA datasets. The first four datasets are
sourced from LongBench. sent. indicates whether it is suitable to separate two sentences, while
chunk signifies whether the latter sentence is appropriate to be merged with the preceding chunk.
comb. refers to the process of first segmenting the text using PPL Chunking with a threshold of 0,
followed by dynamic combination.
Dataset 2WikiMultihopQA Qasper MultiFieldQA-en MultiFieldQA-zh MultiHop-RAG
Chunking Method F1 Time F1 Time F1 Time F1 Time Hits@10 Hits@4 MAP@10 MRR@10
6
Preprint. Under review.
Table 2: Performance of different methods on CRUD QA datasets with overlapping chunks. ppl
represents direct PPL Chunking, with a threshold of 0.5. Precise chunk length and overlap length
results are included in Appendix A.3.
Chunking Method Overlap BLEU-1 BLEU-2 BLEU-3 BLEU-4 BLEU-Avg ROUGE-L BERTScore
Single-hop Query
Original Fixed 0.3330 0.2641 0.2214 0.1881 0.2410 0.4060 0.8425
Llama index Dynamic 0.3326 0.2645 0.2214 0.1890 0.2413 0.4039 0.8439
Qwen2-1.5Bppl Dynamic 0.3592 0.2888 0.2435 0.2081 0.2644 0.4332 0.8555
Qwen2-7Bppl Dynamic 0.3582 0.2898 0.2450 0.2097 0.2657 0.4308 0.8548
Baichuan2-7Bppl Dynamic 0.3656 0.2952 0.2497 0.2143 0.2705 0.4393 0.8549
Two-hop Query
Original Fixed 0.2251 0.1300 0.0909 0.0689 0.1114 0.2579 0.8747
Llama index Dynamic 0.2223 0.1282 0.0896 0.0677 0.1099 0.2555 0.8732
Qwen2-1.5Bppl Dynamic 0.2295 0.1331 0.0934 0.0709 0.1143 0.2609 0.8700
Qwen2-7Bppl Dynamic 0.2312 0.1353 0.0949 0.0719 0.1162 0.2638 0.8751
Baichuan2-7Bppl Dynamic 0.2336 0.1350 0.0940 0.0710 0.1154 0.2650 0.8754
Three-hop Query
Original Fixed 0.2384 0.1268 0.0832 0.0602 0.1066 0.2546 0.8823
Llama index Dynamic 0.2331 0.1250 0.0825 0.0598 0.1049 0.2517 0.8796
Qwen2-1.5Bppl Dynamic 0.2453 0.1319 0.0881 0.0643 0.1114 0.2599 0.8808
Qwen2-7Bppl Dynamic 0.2447 0.1330 0.0891 0.0651 0.1122 0.2618 0.8817
Baichuan2-7Bppl Dynamic 0.2463 0.1324 0.0887 0.0651 0.1120 0.2596 0.8811
chunking techniques, such as similarity-based segmentation and the LumberChunker. Further ana-
lyzing the performance of PPL Chunking under different model sizes, we found that although the
processing time of the 7B model increases compared to the 1.5B model, both can complete chunking
tasks within acceptable ranges. Therefore, in scenarios pursuing higher performance, it is preferable
to use the 7B model for chunking to maximize the predictive capabilities of the model.
How Weak Can the Weaker LLM Be? As a fundamental task, text chunking consumes a large
number of tokens when using LLMs like GPT-4 or Gemini, often leading to a significant imbalance
between resource utilization and task benefits. Therefore, using a lightweight model is a practical
choice. Since our method is applicable to both large and small models, in addition to testing 1.5B
and 7B models, we explored smaller models below 1B parameters. As the model size decreases,
the execution time of the text chunking task significantly reduces, reflecting the advantage of small
models in improving processing efficiency. However, this advantage often comes with a compro-
mise in performance. Furthermore, the limitations of small models in cross-lingual adaptability are
particularly prominent. Taking the Pythia model as an example, its current version mainly focuses
on English text, making it difficult to apply directly to multilingual text chunking. Therefore, when
pursuing the dual goals of high performance and high efficiency, medium-sized models (such as
those at the 1.5B parameter level) exhibit a more balanced performance.
4.2 ANALYSIS
7
Preprint. Under review.
As demonstrated in Table 2, PPL Chunking overlap strategy shows particularly notable performance
in multi-hop QA scenarios. Specifically, except for the BERTScore metric, PPL Chunking overlap
method achieves a performance gain of 2%–3% on the single-hop task. In the case of two-hop and
three-hop tasks, although the rate of improvement slows slightly, a consistent gain of 0.3%–1% is
maintained. Additionally, the performance across all three models exhibits an upward trend with the
size of model parameters. Although the 1.5B model lags slightly behind the 7B model in terms of
overall performance, it still demonstrates notable improvement over traditional chunking methods,
further validating the effectiveness of PPL Chunking.
Figure 3: Performance of different methods on single-hop query in the CRUD QA dataset. ppl
represents direct PPL Chunking, with a threshold of 0.5. comb. indicates PPL Chunking with
dynamic combination, with a threshold of 0 when performing PPL Chunking. Precise chunk length
results and performance of remaining multi-hop scenarios are included in Appendix A.3.
8
Preprint. Under review.
Figure 4: Performance of different methods on CUAD QA datasets. ppl indicates direct PPL Chunk-
ing, with a threshold of 0.
the PPL distribution of texts: when the PPL distribution is relatively stable, it is more appropriate to
select a lower threshold (such as setting the threshold to 0 in HotpotQA, MuSiQue, and DuReader);
whereas when the PPL distribution exhibits large fluctuations, choosing a higher threshold (such as
setting the threshold to 0.4 in NarrativeQA) can effectively distinguish paragraphs with different in-
formation densities, improving the chunking effect. Therefore, when employing PPL for chunking,
it is crucial to comprehensively consider the dual factors of chunk length and text PPL distribution
to determine the relatively optimal configuration that maximizes performance.
To explore the impact of chunking strategies on the RAG system, we evaluated the combination
of different chunking and re-ranking methods. Initially, a top-10 set of relevant texts was filtered
using a dense retriever. We then compared two re-ranking strategies: (1) the BgeRerank method,
leveraging the bge-reranker-large model (Xiao et al., 2023), and (2) the PPLRerank method with the
Qwen2-1.5B model, utilizing the re-ranking method mentioned in the coarse-grained compression
section in Jiang et al. (2023).
Figure 6: Performance of re-ranking strategies combined with different chunking methods in the
MultiHop-RAG benchmark. ppl represents direct PPL Chunking, with a threshold of 0.5. The base
reveals not utilizing re-ranking strategy. Precise chunk length results are included in Appendix A.5.
Experimental results (see Figure 6) revealed that PPL Chunking and PPLRerank achieved the best
overall performance across all metrics. Further analysis demonstrated that, compared to traditional
chunking, PPL Chunking not only provided performance gains independently but also significantly
enhanced the effectiveness of the subsequent re-ranking. Notably, while traditional chunking and
re-ranking strategies already deliver performance improvements, PPL Chunking resulted in even
greater re-ranking gains. For instance, in the Hits@8 metric, PPLRerank under the original chunk-
9
Preprint. Under review.
ing yielded a 1.42% improvement, whereas PPLRerank under PPL Chunking achieved a 3.59%
improvement.
5 R ELATED W ORKS
Text Segmentation It is a fundamental task in NLP, aimed at breaking down text content into its
constituent parts to lay the foundation for subsequent advanced tasks such as information retrieval
(Li et al., 2020) and text summarization (Lukasik et al., 2020; Cho et al., 2022). By conducting topic
modeling on documents, Kherwa & Bansal (2020) and Barde & Bainwad (2017) demonstrate the
identification of primary and sub-topics within documents as a significant basis for text segmenta-
tion. Numerous techniques exist for topic modeling, ranging from algorithms based on probabilistic
methods, such as Latent Dirichlet Allocation (Blei et al., 2003) and Probabilistic Latent Semantic
Analysis (Hofmann et al., 1999), to models that also consider semantic relationships between words
and sentences, like Top2Vec (Angelov, 2020) and BERTopic (Grootendorst, 2022). They are used
in combination with clustering methods like HDBSCAN (McInnes et al., 2017) and dimension re-
duction techniques like UMAP (McInnes et al., 2018). Additionally, Zhang et al. (2021) frames text
segmentation as a sentence-level sequence labeling task, utilizing BERT to encode multiple sen-
tences simultaneously. It calculates sentence vectors after modeling longer contextual dependencies
and finally predicts whether to perform text segmentation after each sentence. Langchain (2023)
provides flexible and powerful support for various text processing scenarios by integrating multiple
text segmentation methods, including character segmentation, delimiter-based text segmentation,
specific document segmentation, and recursive chunk segmentation. Although these methods bet-
ter respect the structure of the document, they have limitations in deep contextual understanding.
To address this issue, semantic-based segmentation (Kamradt, 2024) utilizes embeddings to aggre-
gate semantically similar text chunks and identifies segmentation points by monitoring significant
changes in embedding distances.
Text Chunking in RAG LLMs have demonstrated remarkable capabilities in language-related
tasks through their complex internal structures and reasoning mechanisms (Zheng et al., 2024).
By expanding the input space of LLMs through introducing retrieved text chunks (Guu et al., 2020;
Lewis et al., 2020), RAG significantly improves the performance of knowledge-intensive tasks (Ram
et al., 2023). Text chunking plays a crucial role in RAG, as ineffective chunking strategies can lead
to incomplete contexts or excessive irrelevant information, thereby hurting the performance of QA
systems (Yu et al., 2023). Besides typical granularity levels like sentences or paragraphs (Lyu et al.,
2024; Gao et al., 2023), there are other advanced methods available. Chen et al. (2023a) intro-
duced a novel retrieval granularity called Proposition, which is the smallest text unit that conveys
a single fact. This method excels in fact-based texts like Wikipedia. However, it may not perform
ideally when dealing with content that relies on flow and contextual continuity, such as narrative
texts, leading to the loss of critical information. On the other hand, Li et al. (2024) constructs an
end-to-end extraction model that performs chunking after the retrieval process, adaptively extract-
ing query-relevant content. Despite its flexibility in adjusting text chunking to specific questions, the
response speed is relatively slow, and the delay becomes more pronounced when processing exten-
sive data. Meanwhile, LumberChunker (Duarte et al., 2024) iteratively harnesses LLMs to identify
potential segmentation points within a continuous sequence of textual content, showing some poten-
tial for LLMs chunking. However, this method demands a profound capability of LLMs to follow
instructions and entails substantial consumption when employing the Gemini model.
6 C ONCLUSION
This paper proposes the concept of Meta-Chunking along with its implementation strategies, namely
Margin Sampling Chunking and PPL Chunking, which enable a more precise capture of the inherent
logical structure of text, thereby providing a powerful tool for optimizing text segmentation within
the RAG pipeline. To balance the effectiveness of fine-grained and coarse-grained text segmentation,
we present a dynamic combination approach with Meta-Chunking to address the limitation when
dealing with diverse texts. Our comprehensive evaluation using multiple metrics on eleven datasets
demonstrates that Meta-Chunking significantly outperforms both rule-based and similarity-based
chunking, while also achieving a better balance between performance, time cost, and computational
cost compared to current LLMs approaches.
10
Preprint. Under review.
R EFERENCES
Dimo Angelov. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470,
2020.
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du,
Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long
context understanding. arXiv preprint arXiv:2308.14508, 2023.
Bhagyashree Vyankatrao Barde and Anant Madhavrao Bainwad. An overview of topic modeling
methods and tools. In 2017 International Conference on Intelligent Computing and Control Sys-
tems (ICICCS), pp. 745–750. IEEE, 2017.
Garbiel Bénédict, Ruqing Zhang, and Donald Metzler. Gen-ir@ sigir 2023: The first workshop on
generative information retrieval. In Proceedings of the 46th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 3460–3463, 2023.
Maciej Besta, Ales Kubicek, Roman Niggli, Robert Gerstenberger, Lucas Weitzendorf, Mingyuan
Chi, Patrick Iff, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, et al. Multi-head rag: Solving multi-
aspect problems with llms. arXiv preprint arXiv:2406.05085, 2024.
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric
Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al.
Pythia: A suite for analyzing large language models across training and scaling. In International
Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine
Learning research, 3(Jan):993–1022, 2003.
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui
Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
2024.
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and
Hongming Zhang. Dense x retrieval: What retrieval granularity should we use? arXiv preprint
arXiv:2312.06648, 2023a.
Yuyan Chen, Qiang Fu, Yichen Yuan, Zhihao Wen, Ge Fan, Dayiheng Liu, Dongmei Zhang, Zhixu
Li, and Yanghua Xiao. Hallucination detection: Robustly discerning reliable answers in large
language models. In Proceedings of the 32nd ACM International Conference on Information and
Knowledge Management, pp. 245–255, 2023b.
Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Fei Liu, and Dong Yu. Toward unifying text seg-
mentation and long document summarization. arXiv preprint arXiv:2210.16422, 2022.
SS Dragomir and CJ Goh. Some bounds on entropy measures in information theory. Applied
Mathematics Letters, 10(3):23–28, 1997.
André V Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, and Arlindo L Oliveira.
Lumberchunker: Long-form narrative document segmentation. arXiv preprint arXiv:2406.17526,
2024.
Robert Friel, Masha Belyi, and Atindriyo Sanyal. Ragbench: Explainable benchmark for retrieval-
augmented generation systems. arXiv preprint arXiv:2407.11005, 2024.
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and
Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv
preprint arXiv:2312.10997, 2023.
Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv
preprint arXiv:2203.05794, 2022.
Zhicheng Guo, Sijie Cheng, Yile Wang, Peng Li, and Yang Liu. Prompt-guided retrieval augmenta-
tion for non-knowledge-intensive tasks. arXiv preprint arXiv:2305.17653, 2023.
11
Preprint. Under review.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented
language model pre-training. In International conference on machine learning, pp. 3929–3938.
PMLR, 2020.
Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language
model inference. arXiv preprint arXiv:2301.00303, 2022.
Thomas Hofmann et al. Probabilistic latent semantic analysis. In UAI, volume 99, pp. 289–296,
1999.
Chip Huyen. Evaluation metrics for language modeling. The Gradient, 40, 2019.
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili
Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt com-
pression. arXiv preprint arXiv:2310.06839, 2023.
Greg Kamradt. Semantic chunking. https://github.com/FullStackRetrieval-com/RetrievalTutorials,
2024.
P Kherwa and P Bansal. Topic modeling: A comprehensive review. eai endorsed transactions on
scalable information systems, 7 (24), 1–16, 2020.
Youna Kim, Hyuhng Joon Kim, Cheonbok Park, Choonghyun Park, Hyunsoo Cho, Junyeob Kim,
Kang Min Yoo, Sang-goo Lee, and Taeuk Kim. Adaptive contrastive decoding in retrieval-
augmented generation for handling noisy contexts. arXiv preprint arXiv:2408.01084, 2024.
Langchain. https://github.com/langchain-ai/langchain, 2023.
Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-
augmented language models through few-shot prompting for open-domain question answering.
arXiv preprint arXiv:2203.05115, 2022.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented genera-
tion for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:
9459–9474, 2020.
Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. A survey on retrieval-augmented
text generation. arXiv preprint arXiv:2202.01110, 2022.
Jing Li, Billy Chiu, Shuo Shang, and Ling Shao. Neural text segmentation and its application to
sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, 34(2):828–842,
2020.
Xianzhi Li, Samuel Chan, Xiaodan Zhu, Yulong Pei, Zhiqiang Ma, Xiaomo Liu, and Sameena
Shah. Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? a study on
several typical tasks. arXiv preprint arXiv:2305.05862, 2023.
Zhonghao Li, Xuming Hu, Aiwei Liu, Kening Zheng, Sirui Huang, and Hui Xiong. Refiner: Re-
structure retrieval content efficiently to advance question-answering capabilities. arXiv preprint
arXiv:2406.11357, 2024.
Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li,
Feiyu Xiong, and Zhiyu Li. Internal consistency and self-feedback in large language models: A
survey. arXiv preprint arXiv:2407.14507, 2024.
Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adrià de Gispert, and Gonzalo Iglesias. Li-rage: Late
interaction retrieval augmented generation with explicit signals for open-domain table question
answering. In Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pp. 1557–1566, 2023.
S Longpre, G Yauney, E Reif, K Lee, A Roberts, B Zoph, D Zhou, J Wei, K Robinson, D Mimno,
et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage,
quality, & toxicity, may 2023. URL http://arxiv. org/abs/2305.13169.
12
Preprint. Under review.
Michal Lukasik, Boris Dadachev, Gonçalo Simoes, and Kishore Papineni. Text segmentation by
cross segment attention. arXiv preprint arXiv:2004.14535, 2020.
Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong
Liu, Tong Xu, and Enhong Chen. Crud-rag: A comprehensive chinese benchmark for retrieval-
augmented generation of large language models. arXiv preprint arXiv:2401.17043, 2024.
Leland McInnes, John Healy, Steve Astels, et al. hdbscan: Hierarchical density based clustering. J.
Open Source Softw., 2(11):205, 2017.
Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and
projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and
Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association
for Computational Linguistics, 11:1316–1331, 2023.
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):
50–64, 1951.
Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. In chatgpt we trust? measuring and
characterizing the reliability of chatgpt. arXiv preprint arXiv:2304.08979, 2023.
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael
Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context.
In International Conference on Machine Learning, pp. 31210–31227. PMLR, 2023.
Georgios Sidiropoulos and Evangelos Kanoulas. Analysing the robustness of dual encoders for
dense retrieval against misspellings. In Proceedings of the 45th International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, pp. 2132–2136, 2022.
Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. End-to-end training
of multi-document reader and retriever for open-domain question answering. Advances in Neural
Information Processing Systems, 34:25968–25981, 2021.
Chao-Hong Tan, Jia-Chen Gu, Chongyang Tao, Zhen-Hua Ling, Can Xu, Huang Hu, Xiubo Geng,
and Daxin Jiang. Tegtok: Augmenting text generation via task-specific and open-world knowl-
edge. arXiv preprint arXiv:2203.08517, 2022.
Y Tang and Y Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop
queries (2024). arXiv preprint arXiv:2401.15391.
Shitao Xiao, Zheng Liu, Peitian Zhang, and N Muennighof. C-pack: packaged resources to advance
general chinese embedding. 2023. arXiv preprint arXiv:2309.07597, 2023.
Shicheng Xu, Liang Pang, Huawei Shen, and Xueqi Cheng. Berm: Training the balanced and
extractable representation for matching to improve generalization ability of dense retrieval. arXiv
preprint arXiv:2305.11052, 2023.
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.
arXiv preprint arXiv:2401.15884, 2024.
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan,
Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint
arXiv:2309.10305, 2023.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint
arXiv:2407.10671, 2024.
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu.
Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint
arXiv:2311.09210, 2023.
13
Preprint. Under review.
Qinglin Zhang, Qian Chen, Yali Li, Jiaqing Liu, and Wen Wang. Sequence model with self-adaptive
sliding window for efficient spoken document segmentation. In 2021 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU), pp. 411–418. IEEE, 2021.
Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, and Zhiyu Li.
Attention heads of large language models: A survey. arXiv preprint arXiv:2409.03752, 2024.
Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin,
Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Efficientrag: Efficient retriever for multi-hop
question answering. arXiv preprint arXiv:2408.04259, 2024.
Guido Zuccon, Bevan Koopman, and Razia Shaik. Chatgpt hallucinates when attributing answers. In
Proceedings of the Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval in the Asia Pacific Region, pp. 46–51, 2023.
A A PPENDIX
A.1 T HEORETICAL P ROOF FOR PPL C HUNKING
Firstly, we illustrate the relationship between cross-entropy and two distributions P and Q in another
way. Based on sequencing inequality
X n Xn n
X
ai bi ≥ ai bj(i) ≥ ai bn+1−i
i=1 i=1 i=1
where a1 ≥ a2 ≥ · · · ≥ an , b1 ≥ b2 ≥ · · · ≥ bn and (j(1), j(2), . . . , j(n)) is an arbitrary sorting
of (1, 2, . . . , n), it can be observed that the sum of products of larger numbers paired together is
the maximum, while the sum of products of larger numbers paired with smaller numbers is the
minimum. We desire the cross-entropy H(P, Q) to be as small as possible, which means that when
P (x) is relatively large, − log Q(x) should be relatively small, thereby resulting in Q(x) also being
relatively large. Therefore, a smaller cross-entropy indicates that the prediction is closer to the actual
label.
Afterwards, inspired by insights provided in Huyen (2019), a property of formula (8) is proved:
GK+1 ≤ GK for all K ≥ 1.
Proof.
GK − GK+1
X X
=− P (Tk ) loga P (tk |Tk−1 ) + P (Tk+1 ) loga P (tk+1 |Tk )
Tk Tk+1
X X X
= P (Tk+1 ) loga P (tk+1 |Tk ) − P (Tk ) loga P (tk |Tk−1 )
Tk−1 tk ,tk+1 tk
X X X
≥ P (Tk+1 ) loga P (tk+1 |Tk−1 ) − P (Tk ) loga P (tk |Tk−1 )
Tk−1 tk ,tk+1 tk
X X X
= P (Tk−1 , tk , tk+1 ) loga P (tk+1 |Tk−1 ) − P (Tk−1 , tk ) loga P (tk |Tk−1 )
Tk−1 tk ,tk+1 tk
X X X X
= loga P (tk+1 |Tk−1 ) P (Tk−1 , tk , tk+1 ) − P (Tk−1 , tk ) loga P (tk |Tk−1 )
Tk−1 tk+1 tk tk
X X X
= P (Tk−1 , tk+1 ) loga P (tk+1 |Tk−1 ) − P (Tk−1 , tk ) loga P (tk |Tk−1 )
Tk−1 tk+1 tk
=0
14
Preprint. Under review.
The reason for the last equality is that tk+1 and tk belong to the same domain. Thus, the proof is
complete.
Proof. Let P be a discrete random variable with a finite range of values denoted by W :=
{w1 , w2 , . . . , wl }. Set pi = P {P = wi } for i = 1, 2, . . . , l, and assume that pi > 0 for all
i ∈ {1, 2, . . . , l}. According to Lemma 2 in Dragomir & Goh (1997), if
θi p
γ := max ≤ φ(ε) := 1 + ε ln c + ε ln c(ε ln c + 2)
i,j θj
then
l
! l
X X
0 ≤ logc pk θk − pk logc θk ≤ ε
k=1 k=1
Pl
where θk ∈ (0, +∞), pk ≥ 0 with k=1 pk = 1 and c > 1. Given that θk = 1/pk , the aforemen-
tioned inequality can be transformed into
0 ≤ logc l − Hc (P ) ≤ ε
where ε > 0 satisfies the following conditions
pi
max ≤ φ(ε)
i,j pj
Furthermore, we can derive bounds for entropy as logc l − ε ≤ Hc (P ) ≤ logc l. The proof is
concluded.
All language models utilized in this paper employ the chat or instruct versions where multiple ver-
sions exist, and are loaded in full precision (Float32). When chunking, the default parameter con-
figurations of the models are adopted. For evaluation, Qwen2-7B is employed with the following
settings: top p = 0.9, top k = 5, temperature = 0.1, and max new tokens = 1280. The vector database
is constructed using Milvus, where the embedding model for English texts is bge-large-en-v1.5, and
bge-base-zh-v1.5 for Chinese texts. When conducting QA, the system necessitates dense retrievals
from the vector database, with top k set to 8 for CRUD and RAGBench, 10 for MultiHop-RAG, and
5 for LongBench.
In experiments, we utilized a total of four benchmarks, and their specific configurations are detailed
as follows:
(a) Rule-based Chunking Methods
• Original: This method divides long texts into segments of a fixed length, such as two
hundred Chinese characters or words, without considering sentence boundaries.
• Llama index (Langchain, 2023): This method considers both sentence completeness and
token counts during segmentation. It prioritizes maintaining sentence boundaries while
ensuring that the number of tokens in each chunk are close to a preset threshold. We use
the SimpleNodeParser function from Llama index, adjusting the chunk size
parameter to control segment length. Overlaps are handled by dynamically overlapping
segments using the chunk overlap parameter, ensuring sentence completeness during
segmentation and overlapping.
(b) Dynamic Chunking Methods
• Similarity Chunking (Xiao et al., 2023): Utilizes pre-trained sentence embedding mod-
els to calculate the cosine similarity between sentences. By setting a similarity thresh-
old, sentences with lower similarity are selected as segmentation points, ensuring that
sentences within each chunk are highly semantically related. This method employs the
SemanticSplitterNodeParser from Llama index. For English texts, we ex-
ploit the bge-large-en-v1.5 model, and for Chinese texts, the bge-base-zh-v1.5 model. The
size of the text chunks is controlled by adjusting the similarity threshold.
15
Preprint. Under review.
Table 3: Chunk length and corresponding threshold settings for different methods. - indicates no
relevant setting is involved. The first four datasets are sourced from LongBench. 0+comb. signifies
that an initial chunking is performed using a threshold of 0, followed by a dynamic combination
approach to derive the final chunks. In Llama index and Qwen2-72B, a(b) indicates that the chunk
size of a can be achieved by setting the chunking parameter to b. For other instances of a(b), it
represents the dynamic combination of chunks where setting the combination length to b results in
a final chunk size of a.
Dataset 2WikiMultihopQA Qasper MultiFieldQA-en MultiFieldQA-zh MultiHop-RAG
Chunking Method Length Threshold Length Threshold Length Threshold Length Threshold Length Threshold
In the Margin Sampling Chunking method, we also use prompt, which mainly consists of two parts:
instructions for guiding LLMs to perform chunking and two segmentation schemes. The specific
form is shown in Table 5.
16
Preprint. Under review.
In this experiment, we selected three QA datasets from the CRUD benchmark. Among them, the
single-hop QA dataset consists of questions focused on extracting factual information from a single
document. These questions typically require precise retrieval of specific details such as dates, in-
dividuals, or events from the provided text. The two-hop QA dataset, on the other hand, evaluates
integration capabilities and understanding of informational relationships between different docu-
ments. The more complex three-hop QA dataset often presents more intricate questions, demanding
LLMs to process a greater number of information sources to formulate a complete and accurate
response.
Before the chunking phase, we collected original news articles used in all types of QA tasks in
CRUD. Specifically, since CRUD provides evidence context snippets relied on by each QA pair, as
well as the original news library where the context snippets are extracted, we can obtain the original
news articles containing the context snippets through sentence matching. Taking the two-hop QA as
an example, CRUD provides two news snippets, news1 and news2, which are necessary to answer
questions. We then save the matched original news articles matched news1 and matched news2 that
contain news1 and news2. Finally, from the original news library of 80,000 articles, we recall all
10,000 news articles containing context snippets as the initial text for chunking.
We conducted two sets of experiments with overlapping and non-overlapping chunking on the
CRUD dataset, respectively in Section 4.2.1 and 4.2.2. The chunk length and overlap length are
shown in Table 6. Additionally, the specific values for the bar chart presented in Figure 3 are de-
tailed in Table 7.
Further analysis demonstrates that in single-hop and double-hop query scenarios presented in Table
7, PPL Chunking achieved significant performance improvements compared to traditional chunking
methods on BLEU series metrics and ROUGE-L. This indicates that our methods enhance the accu-
racy and fluency of the generated text to the reference text. However, the relatively smaller margin
of improvement observed on the BERTScore, a BERT-based semantic similarity evaluation metric,
may reflect a lower sensitivity of deep semantic understanding to chunking, as well as the limitations
of the current BERTScore models in capturing precise semantics.
Finally, for three-hop query, although the performance of Qwen2-1.5B and Qwen2-7B using PPL
Chunking was slightly lower than traditional methods, Baichuan2-7B performed comparably. How-
ever, when chunk overlap is introduced, the PPL Chunking method exhibits positive changes (as
shown in Tables 2). This suggests that the effectiveness of segmentation strategies may be jointly
influenced by query complexity and text characteristics.
17
Preprint. Under review.
Table 6: Settings of overlap length and chunk length for different chunking methods in the CRUD
dataset. ppl represents direct PPL Chunking, with a threshold of 0.5. comb. indicates PPL Chunking
with dynamic combination, with a threshold of 0 when performing PPL Chunking.
Chunking Method Overlap Length Chunk Length
Chunking with Overlap
Original 50 218
Llama index 48.78 217.03
Qwen2-1.5Bppl 49.97 212.79
Qwen2-7Bppl 50.41 217.53
Baichuan2-7Bppl 48.91 201.35
Chunking without Overlap
Original 0 179
Llama index 0 177.53
Qwen2-1.5Bppl 0 173.88
Qwen2-7Bppl 0 178.59
Baichuan2-7Bppl 0 162.56
Qwen2-1.5Bcomb. 0 177.95
Qwen2-7Bcomb. 0 178.09
Baichuan2-7Bcomb. 0 178.09
shown in Tables 8 and 9. Additionally, the specific values presented in Figures 4 and 5 correspond
to Tables 10 and 11.
According to Table 9, it can be observed that HotpotQA, MuSiQue, and DuReader achieve a suit-
able chunk length with a lower threshold, while NarrativeQA only reaches it when the threshold
is set to 1.34. This indicates that PPL distribution of the first three datasets is relatively flat with
small oscillations, whereas NarrativeQA exhibits significant fluctuations. Considering the chunk-
ing performance presented in Table 11, it suggests that direct PPL Chunking is more suitable when
chunk length is small, while the combination of PPL Chunking and dynamic merging is preferable
for larger chunk lengths. Furthermore, regarding the approach of PPL Chunking with dynamic com-
bination, it is more appropriate to select a smaller threshold when the PPL amplitude is small, and a
larger threshold when the PPL amplitude is significant.
Tables 12 and 13 present chunk lengths that need to be set for Figure 6 and the specific values
for drawing, respectively. Focusing on this batch of experiments, we first retrieve 10 relevant text
chunks for each question through a dense retriever, and then applied various re-ranking methods for
secondary sorting to analyze changes in recall performance.
18
Preprint. Under review.
Table 7: Performance of different methods on the CRUD QA dataset. ppl represents direct PPL
Chunking, with a threshold of 0.5. comb. indicates PPL Chunking with dynamic combination, with
a threshold of 0 when performing PPL Chunking.
Chunking Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 BLEU-Avg ROUGE-L BERTScore
Single-hop Query
Original 0.3515 0.2788 0.2340 0.1997 0.2548 0.4213 0.8489
Llama index 0.3620 0.2920 0.2480 0.2134 0.2682 0.4326 0.8521
Qwen2-1.5Bppl 0.3714 0.3013 0.2569 0.2223 0.2778 0.4426 0.8563
Qwen2-7Bppl 0.3661 0.2935 0.2481 0.2127 0.2691 0.4379 0.8558
Baichuan2-7Bppl 0.3725 0.3011 0.2558 0.2207 0.2772 0.4429 0.8562
Qwen2-1.5Bcomb. 0.3760 0.3034 0.2577 0.2224 0.2797 0.4443 0.8586
Qwen2-7Bcomb. 0.3724 0.3012 0.2561 0.2206 0.2774 0.4445 0.8584
Baichuan2-7Bcomb. 0.3812 0.3091 0.2622 0.2259 0.2840 0.4494 0.8603
Two-hop Query
Original 0.2322 0.1324 0.0919 0.0695 0.1133 0.2613 0.8768
Llama index 0.2315 0.1321 0.0923 0.0697 0.1133 0.2585 0.8762
Qwen2-1.5Bppl 0.2328 0.1326 0.0918 0.0694 0.1133 0.2611 0.8749
Qwen2-7Bppl 0.2310 0.1323 0.0916 0.0691 0.1124 0.2597 0.8752
Baichuan2-7Bppl 0.2350 0.1341 0.0924 0.0695 0.1141 0.2637 0.8772
Qwen2-1.5Bcomb. 0.2372 0.1363 0.0950 0.0722 0.1164 0.2658 0.8743
Qwen2-7Bcomb. 0.2364 0.1360 0.0945 0.0713 0.1161 0.2661 0.8761
Baichuan2-7Bcomb. 0.2325 0.1329 0.0917 0.0689 0.1133 0.2623 0.8754
Three-hop Query
Original 0.2494 0.1317 0.0869 0.0636 0.1110 0.2595 0.8827
Llama index 0.2464 0.1327 0.0883 0.0644 0.1120 0.2596 0.8840
Qwen2-1.5Bppl 0.2402 0.1260 0.0827 0.0596 0.1054 0.2531 0.8802
Qwen2-7Bppl 0.2415 0.1266 0.0828 0.0597 0.1058 0.2549 0.8816
Baichuan2-7Bppl 0.2460 0.1293 0.0851 0.0615 0.1084 0.2568 0.8828
Qwen2-1.5Bcomb. 0.2449 0.1294 0.0855 0.0624 0.1086 0.2566 0.8828
Qwen2-7Bcomb. 0.2408 0.1274 0.0837 0.0610 0.1068 0.2551 0.8825
Baichuan2-7Bcomb. 0.2494 0.1324 0.0870 0.0632 0.1111 0.2613 0.8832
Table 8: Settings of overlap length and chunk length for different chunking methods in the CUAD
dataset. ppl represents direct PPL Chunking, with a threshold of 0.
Chunking Method Overlap Length Chunk Length
Original 0 98.00
Llama index 0 98.49
Qwen2-1.5Bppl 0 97.70
Qwen2-7Bppl 0 96.08
Baichuan2-7Bppl 0 97.59
19
Preprint. Under review.
Table 9: Chunk length and corresponding threshold settings for different chunking methods in four
long-text QA datasets of LongBench. - indicates no relevant setting. In Llama index, a(b) represents
that a chunk length of a can be obtained by setting the chunking parameter to b. The remaining a(b)
indicates that a final chunk length of a is obtained by setting the combination length to b.
Dataset HotpotQA MuSiQue NarrativeQA DuReader
Chunking Method Length Threshold Length Threshold Length Threshold Length Threshold
Original 87 - 90 - 71 - 262 -
Llama index 86.73(154) - 89.94(157) - 70.35(139) - 262.06(330) -
Qwen2-1.5Bppl 86.72 0.5 89.51 0.5 70.28 1.34 261.41 0.5
Qwen2-1.5Bcomb. 86.80(98) 0+comb. 89.59(103) 0+comb. 70.32(82) 0+comb. 261.34(213) 0+comb.
Qwen2-1.5Bcomb. 86.52(96) 0.1+comb. 89.60(100) 0.1+comb. 70.47(82) 0.1+comb. 261.98(200) 0.1+comb.
Qwen2-1.5Bcomb. 86.58(92) 0.2+comb. 89.75(96) 0.2+comb. 70.17(81) 0.2+comb. 261.92(189) 0.2+comb.
Qwen2-1.5Bcomb. 86.77(85) 0.3+comb. 89.60(88) 0.3+comb. 70.19(79) 0.3+comb. 261.06(170) 0.3+comb.
Qwen2-1.5Bcomb. 86.81(70) 0.4+comb. 89.68(75) 0.4+comb. 70.66(78) 0.4+comb. 261.48(140) 0.4+comb.
Table 10: Performance of different methods on CUAD QA datasets. ppl indicates direct PPL Chunk-
ing, with a threshold of 0.
Chunking Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 BLEU-Avg ROUGE-L BERTScore
Original 0.6845 0.4496 0.2997 0.1798 0.3513 0.4217 0.8043
Llama index 0.6966 0.4573 0.3006 0.1730 0.3493 0.4137 0.8001
Qwen2-1.5Bppl 0.7098 0.4722 0.3180 0.1932 0.3677 0.4060 0.8006
Qwen2-7Bppl 0.7038 0.4670 0.3143 0.1911 0.3638 0.4070 0.8018
Baichuan2-7Bppl 0.7195 0.4738 0.3160 0.1884 0.3665 0.4111 0.8025
Table 11: Performance of different methods in four long-text QA datasets of LongBench. ppl rep-
resents direct PPL Chunking, and comb. indicates PPL Chunking with dynamic combination. Multi
represents threshold values of the parallel method in four datasets, which are 0.5, 0.5, 1.34, and 0.5
respectively, resulting in chunk lengths of 87, 90, 71, and 262 in sequence.
Dataset HotpotQA MuSiQue NarrativeQA DuReader
Chunking Method
Threshold F1 F1 F1 ROUGE-L
Original - 15.79 7.21 5.72 20.69
Llama index - 15.72 8.19 5.03 21.41
Qwen2-1.5Bppl Multi 17.74 8.39 6.12 20.77
Qwen2-1.5Bcomb. 0 17.47 8.08 4.93 20.77
Qwen2-1.5Bcomb. 0.1 17.19 7.48 4.91 20.33
Qwen2-1.5Bcomb. 0.2 17.70 7.31 5.20 20.95
Qwen2-1.5Bcomb. 0.3 17.46 7.92 5.08 21.22
Qwen2-1.5Bcomb. 0.4 16.44 8.05 5.80 21.65
20
Preprint. Under review.
Table 12: Chunk length and its corresponding threshold settings when exploring the impact of
chunking on re-ranking. - indicates no relevant setting.
Chunking and Re-ranking Chunk Length Threshold
Original 78 -
Original and BgeRerank 78 -
Original and PPLRerank 78 -
Qwen2-1.5Bppl 77.60 0.5
Qwen2-1.5Bppl and BgeRerank 77.60 0.5
Qwen2-1.5Bppl and PPLRerank 77.60 0.5
Table 13: Performance of re-ranking strategies combined with different chunking methods in the
MultiHop-RAG benchmark. ppl represents direct PPL Chunking, with a threshold of 0.5.
Chunking and Re-ranking Hits@8 Hits@6 Hits@4 Hits@2 MAP@10 MRR@10
Original 0.5627 0.5180 0.4523 0.3499 0.1512 0.3507
Original and BgeRerank 0.5818 0.5406 0.4741 0.3379 0.1486 0.3391
Original and PPLRerank 0.5769 0.5521 0.5055 0.4102 0.1849 0.4147
Qwen2-1.5Bppl 0.6838 0.6244 0.5503 0.4151 0.1954 0.4195
Qwen2-1.5Bppl and BgeRerank 0.6927 0.6435 0.5721 0.4381 0.2075 0.4413
Qwen2-1.5Bppl and PPLRerank 0.7197 0.6931 0.6568 0.5721 0.2590 0.5558
21