Aryn Cidr 25 Camera Ready
Aryn Cidr 25 Camera Ready
Aryn Cidr 25 Camera Ready
Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry Lindeman, Alex Meyer,
Parthkumar Parmar, Tanvi Ranade, Mehul A. Shah, Benjamin Sowell, Dan Tecuci, Vinayak Thapliyal,
Matt Welsh
Aryn, Inc.
ABSTRACT In addition to simple “hunt and peck” queries, for which RAG is
LLMs demonstrate an uncanny ability to process unstructured data, tailored, these analyses often require “sweep and harvest” patterns.
and as such, have the potential to go beyond search and run complex, An example is a query like “What is the yearly revenue growth
semantic analyses at scale. We describe the design of an unstruc- and outlook of companies whose CEO recently changed?” For this,
tured analytics system, Aryn, and the tenets and use cases that one needs to sweep through large document collections, perform
motivate its design. With Aryn, users specify queries in natural lan- a mix of natural-language semantic operations (e.g., filter, extract,
guage and the system automatically determines a semantic plan and or summarize information) and structured operations (e.g. select,
executes it to compute an answer from a large collection of unstruc- project, or aggregate), and then synthesize an answer. Going a step
tured documents. At the core of Aryn is Sycamore, a declarative further, we see “data integration” patterns where users want to
document processing engine, that provides a reliable distributed combine information from multiple collections or sources. For ex-
abstraction called DocSets. Sycamore allows users to analyze, enrich, ample, “list the fastest growing companies in the BNPL market and
and transform complex documents at scale. Aryn includes Luna, a their competitors,” where the competitive information may involve
query planner that translates natural language queries to Sycamore a lookup in a database in addition to a sweep-and-harvest phase to
scripts, and DocParse, which takes raw PDFs and document images, gather the top companies. We also expect complex compositions of
and converts them to DocSets for downstream processing. We show these patterns to become prevalent.
how these pieces come together to achieve better accuracy than Aryn is an unstructured analytics platform, powered by LLMs,
RAG on analytics queries over real world reports from the National that is designed to answer these types of queries. We take inspira-
Transportation Safety Board (NTSB). Also, given current limita- tion from relational databases, from which we borrow the principles
tions of LLMs, we argue that an analytics system must provide of declarative query processing. With Aryn, users specify what they
explainability to be practical, and show how Aryn’s user interface want to ask in natural language, and the system automatically con-
does this to help build trust. structs a plan (the how) and executes it to compute the answer from
unstructured data.
Aryn consists of several components (see Figure 1). The natural-
1 INTRODUCTION language query planner, Luna, uses LLMs to translate queries to
Large language models have inspired the imagination of industry, semantic query plans with a mix of structured and LLM-based
and companies are starting to use LLMs for product search, cus- semantic operators. Query plans are compiled to Sycamore, a docu-
tomer support chatbots, code co-pilots, and application assistants. ment processing engine used both for ETL and query processing.
In enterprise settings, accuracy is paramount. To limit hallucina- Sycamore is built around DocSets, a reliable abstraction similar to
tions, most of these applications are backed by semantic search Apache Spark DataFrames, but for hierarchical documents. Finally,
architectures that answer queries based on data retrieved from DocParse uses vision models to convert complex documents with
a knowledge base, using techniques such as retrieval-augmented text, tables, and images into DocSets for downstream processing.
generation (RAG) [13]. The main challenge for an analytics system built largely on AI
Still, enterprises want to go beyond RAG and run semantic anal- is to give answers that are accurate and trustworthy. We use LLMs
yses that require complex reasoning across large repositories of and vision models for different purposes throughout our stack and
unstructured documents. For example, financial services companies carefully compose them to provide answers to complex questions.
want to analyze research reports, earnings calls, and presentations Unfortunately, LLMs are inherently imprecise, making LLM output
to understand market trends and discover investment opportunities. difficult to verify.
Consumer goods firms want to improve their marketing strategies Aryn’s database-inspired approach addresses this challenge in
by analyzing interview transcripts to understand sentiment towards multiple ways. First, our experience working with customers has
brands. In legal firms, investigators want to analyze legal case sum- shown how essential ETL is for achieving good quality for both
maries to discover precedents for rule infringement and the actions RAG and analytics use cases. By performing high-quality parsing
taken across a broad set of companies and cases. and metadata extraction, we can provide the LLM with the context
necessary to reduce the likelihood of hallucinations. Second, by
dynamically breaking down complex questions into query plans
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their composed of simple LLM-based operations, we can make the query
personal and corporate Web sites with the appropriate attribution, provided that you plan as a whole more reliable than RAG-based approaches. Third,
attribute the original work to the authors and CIDR 2025. 15th Annual Conference Aryn exposes the query plan and data lineage to users for better
on Innovative Data Systems Research (CIDR ’25). January 19-22, Amsterdam, The
Netherlands explainability. A key component of Aryn is its conversational user
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.
schema = {
queryDatabase Scans documents from an index based on key- "us_state": "string",
word search over the element content or filters "probable_cause": "string",
over the properties. "weather_related": "bool"
map, filter, flatMap Transforms documents using standard func- }
tional operators.
partition Parses a document using DocParse. ds = context.read.binary("/path/to/ntsb_data")
explode Unnests each element and makes it a top-level .partition(DocParse())
document. .llmExtract(
reduceByKey Standard reduce operation that can be used OpenAIPropertyExtractor("gpt-4o", schema=schema))
for a variety of grouping and aggregations on .explode()
properties on the documents. .embed(OpenAIEmbedder("text-embedding-3-small"))
write Writes a DocSet to a database.
(a) Structured operators in Sycamore Figure 4: Sample Sycamore script.
5.3 Execution
operators. We still prefer to separate them out because they can Sycamore adopts a Spark-like execution model where operations are
behave very differently in practice, as LLMs are inherently non- pipelined and executed only when materialization is required. To as-
deterministic and users often want to manually inspect the results sist with debugging and avoid redundant execution, Sycamore also
of semantic operations. Sycamore supports a variety of LLMs, in- supports a flexible materialize operation that can save the output
cluding those from OpenAI and Anthropic, and open source models of intermediate transformations to memory, disk, or cloud storage.
like Llama. Sycamore is built on top of the Ray compute framework [22], which
The code5 in Figure 4 is an example of processing NTSB inci- provides primitives for running distributed Python-based dataflow
dent report documents using Sycamore. The code partitions docu- workloads. We chose Ray because it is based on Python, which has
ments using DocParse, described in Section 4. It then executes the become the language of choice for machine learning applications,
llmExtract transform, which takes a JSON schema and attempts and because it is well-integrated with existing ML libraries.
to extract those fields from each document using an LLM. As shown
in in Figure 5, this approach correctly extracts the state abbreviation
and other fields from the document. Next, we use explode to break
6 LUNA
each document into a collection of document chunks, and then we A hallmark of relational databases is declarative query processing,
generate an embedding vector for each chunk. At this point the which hides the low-level details of how queries are executed and
DocSet is ready to be loaded into a database like OpenSearch for makes it easier for application developers to adapt to changing
later querying (using write). workloads and scale. LLMs make it possible to leverage declarative
Finally, queryDatabase and queryVectorDatabase support read- query processing for natural language queries over complex, un-
ing a previously loaded DocSets from a data store. The queryDatabase structured data. We call this LLM-powered unstructured analytics,
operator is analogous to a standard database scan operator, and sup- or Luna for short.
ports filters on the metadata as well as keyword search (depending More specifically, Luna converts a natural language query into a
on the capabilities of the data store). The queryVectorDatabase query plan that runs over DocSets and returns either raw tabular
operator, in addition to those, also supports semantic search (i.e., results or natural language answers. Query plans are executed using
vector similarity search) over the chunks. While indexing is done Sycamore’s DocSet operators. To aid explainability, Luna exposes
on chunks, Sycamore reassembles these chunks into documents the logical query plan, data lineage, and execution history, and
before passing them to downstream operators. allows users to modify any part of the plan to better align with
their intention. The remainder of this section describes the system
5We have elided a few configuration parameters to enhance readability. in detail.
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.
6.1 Luna Architecture Each of these operators can be implemented in terms of the exist-
Luna consists of a number of pieces that work together to pro- ing Sycamore physical operators. For instance, groupByAggregate
vide an end-to-end natural-language query processing system over and llmCluster can be implemented with a combination of map
complex, unstructured data. and reduce operations, but we see better results from the planner
when we keep them as separate operators.
Data Inputs and Schema. Luna shares the Sycamore data model
and executes queries against one or more DocSets that have been Query Planning. Luna uses an LLM to interpret a natural language
indexed in a database. During query planning, we provide the query and decompose it to a DAG of logical query operators. After
planner with the schema of each DocSet, which consists of the significant experimentation, we found that including the following
properties contained in the documents, along with their data types information in the prompt helps provide the LLM with the right
and sample values, along with a special “text-representation” field context:
representing the entire contents of each Document. The schema of • The schema for the input DocSet. For each schema field, we
DocSets can evolve over time, based on new semantic relationships include a short description as well as a few example values
discovered in the data, potentially driven by the query workload. drawn from the underlying data.
While Sycamore represents documents hierarchically with el- • A list of available logical operators and their syntax.
ements corresponding to document chunks, we found it more ef- • A list of example queries and their associated query plans.
fective to hide this from the planner and always provide a schema
We instruct the LLM to generate the plan in JSON format, which we
for complete documents. The Sycamore engine handles splitting
validate against a schema to ensure that it conforms to the expected
documents into chunks that fit in to the context window of the LLM
syntax. In addition to confirming that the query plan is syntactically
used for embedding and reconstructing the full document during
correct, we also check that it is semantically valid. For example, if a
queries.
QueryDatabase operation performs field-based filtering, we check
In our implementation, we primarily use OpenSearch for storing
that the fields used in the filter are valid for the given DocSet.
and querying DocSets, though other data management systems
can be used as long as they support both “keyword” and “seman- Plan Rewriting and Optimization. Despite significant prompt
tic search” (i.e., vector similarity queries) and basic filtering by engineering, the LLM may still produce a suboptimal or, in some
properties. cases, an incorrect or infeasible query plan. We use a combination
of plan rewriting and rule-based optimization to address these
Logical Query Operators. Luna uses an LLM for interpreting a
issues. For example, if the plan has multiple llmExtract operators
natural language user query. We initially provided the LLM the com-
in sequence, these can combined into a single operator.
plete list of physical operators as part of the prompt. However, in our
experiments with several real-world datasets and query workloads, Execution. After plan rewriting and optimization, the query plan
we found that this approach does not work well for complex and is compiled into Sycamore code in Python. Execution on large
exploratory analysis queries like: “Analyze maintenance-related in- datasets benefits from distributed processing, and using Sycamore’s
cidents by grouping those by aircraft type and maintenance interval distributed execution mode allows us to scale out workloads with
to find patterns of recurring issues.” In particular, we found it diffi- minimal overhead. The compiled query execution code in Sycamore
cult to get the LLM to use grouping operations like reduceByKey is easy for a technically savvy user to understand and modify (in
effectively and the plans generated would often run into context the UI itself).
window size limitations.
Traceability and debugging. The ambiguous nature of some
Instead, we decided to differentiate between logical and physical
queries can result in Luna misinterpreting the user’s intention.
operators with respect to query planning and execution. Luna pro-
It is critical to allow the user to inspect the query execution trace
vides a simpler set of high-level logical operators to the LLM for
and provide feedback to correct itself. With a combination of log-
query planning purposes, and rewrites the resulting logical plan
ging and exposing APIs that allow the user to modify any stage of
into physical operators before execution. This also makes it easier
query execution, users have full control over how their query is
for the user to understand the plan and debug the execution.
answered.
Many simple logical operators map one-to-one to physical Sycamore
operators, including single-pass per-document operations like map,
filter, and llmExtract, but for operations that span multiple 6.2 User Interface and Verifiability
documents, we have found it often works better to have more spe- Luna’s user interface, shown in Figure 6 is designed to make it easy
cific operators rather than low-level primitives. For example, the for users to verify the results from the system. Luna achieves this
following logical operators are exposed to the Luna planner: by: (a) exposing the query plan, (b) allowing the user to inspect
intermediate results, and (c) allowing the user to ask follow-up
• groupByAggregate: Performs a database style group-by and questions to guide the system.
aggregation. Luna exposes the plan generated from a user query as a simple
• llmCluster: Clusters documents using 𝑘-means based on JSON object. This allows a user to understand the exact operations
semantic similarity of one or more fields. that were performed to answer a query, how the dataset was trans-
• llmGenerate: Summarizes one or more documents based formed during each operation, and modify any part of the plan to
on a prompt. This is analogous to the “G” in “RAG” and is better align with their intention. Given the query “Get the latitude
often used at the end of a plan. and longitude of all incidents in 2023 involving Cessna aircraft,” we
The Design of an LLM-powered Unstructured Analytics System CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands
7 EVALUATION
We present a preliminary evaluation of Luna’s ability to answer
complex analytical questions over a dataset of incident reports from
the National Transportation Safety Board, which is the US-based
agency responsible for investigating civil transportation accidents.
Our test dataset consists of 100 PDF reports pulled from the NTSB
CAROL database6 covering aviation incidents between June and
September 2024. Each file is between 4 and 7 pages of text, with
sections covering a summary of the incident, probable cause and
findings, factual information, and administrative information. In-
Figure 6: The Luna user interface shows the query result cident reports have multiple tables covering aspects such as the
visually, allows the use to inspect the generated query plan, pilot’s background, aircraft and operator details, meteorological in-
and lets them drill down to individual documents if needed formation, wreckage, and injuries. Many of the documents contain
photographs of the accident site or maps of the flight trajectory.
6 https://carol.ntsb.gov/
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.
Luna RAG Luna answers 20 out of the 30 questions correctly, and 10 incor-
Correct 20 (67%) 2 (6.7%) rectly. The incorrect answers fall into several categories:
Incorrect 10 (33%) 20 (67%)
Refusal 0 (0%) 8 (26.7%) Counting errors (6 cases). In several cases, there are off-by-
Total 30 30 one errors due to incidents being counted twice. For example,
for the question “How many incidents were there, broken
Table 4: Luna vs. RAG evaluation results on NTSB document down by number of engines?”, there is a single incident in-
questions. volving two aircraft, each with 1 engine. These are counted
as two separate “incidents”. Fixing this would require a dedu-
plication step in the query plan which can be achieved with
We processed the NTSB reports using a Sycamore pipeline. The better few-shot examples for the planner.
pipeline starts by calling DocParse to parse each document as de- Filter errors (3 cases). The LLMFilter operation is occasion-
scribed in Section 4, and then uses the llmExtract transform to ally too generous in its interpretation of whether a given
extract key data from each document. We load the resulting schema, document should pass the filter test. As an example, in the
shown in Table 3, into an OpenSearch index. We also chunk and question “How many incidents were due to engine prob-
embed the text content of the incident reports, and the resulting lems?” the LLM filter operation screens for “Does the docu-
vectors are also stored in OpenSearch for use with vector search ment indicate engine problems?”. Because portions of most
operations. Throughout this evaluation we used OpenAI’s gpt-4o NTSB reports mention engines in various contexts, the filter
model for our LLM, all-MiniLM-L6-v2 for the embeddings, and tends to pass through documents where an engine problem
OpenSearch 2.17. was not indicated. Better prompting for the filter conditions
would help here.
7.1 Benchmark questions Query interpretation (1 case). For the question “What was
There does not exist a standard benchmark for document analytics the breakdown of incident types by aircraft manufacturer?”,
against this type of dataset. Through manual inspection, we derived the LLM interprets “aircraft manufacturer” to mean whether
a set of 30 questions that represent a broad range of query types the aircraft was military, commercial, a helicopter, or some
and varying degrees of difficulty to answer. Some examples of the other type, rather than the name of the manufacturer (which
benchmark questions include: is indeed present in the dataset). This would be fixable with
• How many incidents were there by state? some additional few-shotting, but points more broadly to
• What fraction of incidents that resulted in substantial dam- the challenge of teaching the LLM about the semantic inter-
age were due to engine problems? pretation of the schema.
• In incidents involving Piper aircraft, what was the most As we expected to see, RAG does poorly on most of these ques-
commonly damaged part of the aircraft? tions. The two cases in which RAG gets the correct answer are
• Which incidents occurred in July involving birds? “How many incidents were there in Hawaii?” (for which the correct
A few of the benchmark questions can be answered more or less answer is zero), and “Which incidents occurred in July involving
directly by querying the extracted metadata shown in Table 3. How- birds?” (two incidents). Both of these are answerable using the RAG
ever, in most cases, the benchmark questions refer to information approach when the number of records retrieved from the vector
not explicitly captured in the schema, such as whether an incident search is small enough to fix in the LLM’s context window. RAG
involved birds or engine problems. For these cases, Luna needs to does not yield the correct answer in any case where the number
use a combination of metadata lookup and LLM-based extraction of matching incidents exceeds a modest threshold, such as “How
or filtering based on the documents’ textual content. many incidents involved substantial damage?” (correct answer: 94,
Many of our benchmark questions would be difficult, or impossi- RAG answer: 10).
ble, for a RAG-based system to answer, given that the information A substantial number of RAG queries resulted in a refusal of the
required to answer the question is spread across multiple portions LLM to answer the question at all. For example, on the question
of each document, and a vector search would not be expected to “How many incidents were due to engine problems?”, the LLM re-
return meaningful chunks of context for downstream analysis by sponds with “The NTSB does not assign fault or blame for accidents
the LLM. or incidents, including those related to engine problems.” This is
caused by context poisoning during the RAG process. Each of the
7.2 Results NTSB reports contains a boilerplate disclaimer that states,
We ran Luna against each of our 30 benchmark questions and “The NTSB does not assign fault or blame for an acci-
compared the result to ground truth answers determined through dent or incident; rather, as specified by NTSB regula-
manual inspection. As a comparison point, we also used RAG to tion, ’accident/incident investigations are fact-finding
answer each question, using a standard RAG approach that first proceedings with no formal issues and no adverse
converts the question into a vector search against the embedded set parties ... and are not conducted for the purpose of de-
of text chunks, retrieves the 𝑘 nearest documents for each question, termining the rights or liabilities of any person’ (Title
and provides those chunks as context to the LLM to answer the 49 Code of Federal Regulations section 831.4).”
original question. For this test we set 𝑘 = 100. The results are shown Whenever these text chunks are included in the vector search results
in Table 4. fed as context to the LLM, the final response is effectively poisoned
The Design of an LLM-powered Unstructured Analytics System CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands
by the disclaimer. While this could be addressed though a range of Aryn to be a human-in-the-loop system; as the LLMs improve, the
prompting and santization techniques, we chose to highlight this need for human interventions will diminish, but it is unlikely to
as an interesting failure mode of the conventional RAG approach. completely vanish. Our experience across a variety of application
domains supports that our overall design as well as Aryn’s indi-
8 RELATED WORK vidual components are promising. Nonetheless, many challenges
Machine learning has revolutionized many aspects of data man- still remain. We need to continue to improve accuracy and make
agement over the last decade. First, there is a long line of work it easier to adapt Aryn to new use cases. We need ways to correct
on natural language to SQL [12, 27, 28]. While the early work fo- and evolve the system and automatically learn from users as they
cused on building specialized models for this purpose, LLM-based exercise the system. We need to extend Aryn to support joins and
approaches have proven superior in recent years7 . Several recent allow queries to incorporate external sources like data warehouses.
works have focused on generating queries that incorporate LLM Finally, we’ve just started the journey on improving performance,
calls [18, 20, 34]. Our Luna framework is differentiated by a broader cost, and scale.
set of LLM-based operations, a focus on hierarchical documents,
and our emphasis on interactive interfaces. ACKNOWLEDGMENTS
There is also much work on using LLMs for specific ETL tasks We thank Amol Deshpande for his insights, detailed advice, and
such as entity resolution, information extraction, named entity contributions from the start and throughout our journey at Aryn.
recognition, and data cleaning [15, 23, 31]. In addition, there’s also We also thank our newest members who relentlessly make the
work in detecting and extracting tables using modern transformer platform better: Akarsh Gupta, Soeb Hussain, Dhruv Kaliraman,
models [25, 30], OCR [9, 14], and segmentation and labeling [5, 26]. Soham Kasar, Abijit Puhare, Karan Sampath, Aanya Pratapneni, and
To date, ours is the only work that combines the best of these into Ritam Saha. We are indebted to our customers whose partnership
a unified cloud service and is deeply integrated with a declarative makes our contribution unique and differentiated. Finally, we thank
document processing framework for ETL like Sycamore. our reviewers for their suggestions.
DocParse is based on a long line of work in document segmenta-
tion. Current approaches commonly use object detection models
such as DETR [7]. DocParse follows this approach and leverages
REFERENCES
[1] Amazon. 2024. Amazon Textract. https://aws.amazon.com/textract/
Deformable DETR [36]. An alternate line of work has led to multi- [2] Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Ho-
modal models such as Donut [11] and LayoutLMV3 [8] that seek jel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable
to directly solve document understanding tasks like visual ques- Simple Systems for Generating Structured Views of Heterogeneous Data Lakes.
Proceedings of the VLDB Endowment 17, 2 (2023), 92–105.
tion answering (VQA) without the need for explicit segmentation. [3] Aryn. 2024. Aryn/deformable-detr-DocLayNet. https://huggingface.co/Aryn/
Sycamore can eventually incorporate these models, but we con- deformable-detr-DocLayNet
[4] Aryn. 2024. Sycamore Repository. https://github.com/aryn-ai/sycamore
tinue to find segmentation valuable as we can index the segments [5] Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Li-
to reduce work at query time. vathinos, and Peter Staar. 2023. ICDAR 2023 Competition on Robust Layout
There is less work on building end-to-end systems that encom- Segmentation in Corporate Documents. In Document Analysis and Recognition -
ICDAR 2023. Springer Nature Switzerland, 471–482. https://doi.org/10.1007/978-
pass the entire spectrum of tasks from document parsing to ETL to 3-031-41679-8_27
querying for unstructured document analytics. Nonetheless, several [6] Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gon-
similar efforts have started over the last year including ZenDB [17], zalez, Carlos Guestrin, and Matei Zaharia. 2024. Text2SQL is Not Enough:
Unifying AI and Databases with TAG. arXiv:2408.14717 [cs.DB] https:
LOTUS [24], EVAPORATE [2], CHORUS [10], and Palimpzest [18]. //arxiv.org/abs/2408.14717
TAG [6] is similar in spirit to Luna, but translates to SQL and [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan-
der Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with
does not include LLM-based operators post database query. Most Transformers. In ECCV.
recently, DocETL [29] proposes to use agent-based rewrites to au- [8] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3:
tomatically optimize document processing pipelines for improved Pre-training for document ai with unified text and image masking. In Proceedings
of the 30th ACM International Conference on Multimedia. 4083–4091.
accuracy. In contrast to these works, while we have incorporated [9] JaidedAI. 2024. EasyOCR. https://github.com/JaidedAI/EasyOCR
similar pipelining and rewriting mechanisms to start, we do not [10] Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and
believe it is possible to fully automate and optimize the entire Dan Suciu. 2024. CHORUS: Foundation Models for Unified Data Discovery and
Exploration. Proceedings of the VLDB Endowment 17, 8 (2024), 2104–2114.
pipeline in practice. As a result, we have designed Aryn to facilitate [11] Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim,
a human-in-the-loop paradigm. Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2021.
Donut: Document understanding transformer without ocr. arXiv preprint
arXiv:2111.15664 7, 15 (2021), 2.
9 CONCLUSIONS AND FUTURE WORK [12] Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural
language to SQL: Where are we today? PVLDB 13, 10 (2020).
We are building Aryn to make unstructured data as easy to query [13] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
as structured data by leveraging the immense potential of LLMs Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel,
to process multi-modal datasets. We take a database-inspired ap- Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks. arXiv:2005.11401 https://arxiv.org/abs/2005.
proach of decomposing analytics queries into semantic query plans 11401
which not only improves answer accuracy but also provides ex- [14] Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du,
Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun
plainability and an avenue for intervention and iteration. At the Ma. 2022. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight
same time, given the limitations of current models, we are building OCR System. arXiv:2206.03001 [cs.CV] https://arxiv.org/abs/2206.03001
[15] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan.
7 See the leaderboard at https://yale-lily.github.io/spider. 2020. Deep entity matching with pre-trained language models. Proceedings of
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.
the VLDB Endowment 14, 1 (2020), 50–60. [26] Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar.
[16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva [n. d.]. DocLayNet: A Large Human-Annotated Dataset for Document-Layout
Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Segmentation. In KDD (KDD ’22). ACM.
Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, [27] Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei,
Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O. Arik.
Cham, 740–755. 2024. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate
[17] Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Selection in Text-to-SQL. arXiv:2410.01943 [cs.LG] https://arxiv.org/abs/2410.
Aditya G. Parameswaran, and Eugene Wu. 2024. Towards Accurate and Efficient 01943
Document Analytics with Large Language Models. arXiv:2405.04674 [cs.DB] [28] Mohammadreza Pourreza and Davood Rafiei. 2024. DIN-SQL: Decomposed
https://arxiv.org/abs/2405.04674 in-context learning of text-to-SQL with self-correction. Advances in Neural
[18] Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Information Processing Systems 36 (2024).
Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano. [29] Shreya Shankar, Aditya G. Parameswaran, and Eugene Wu. 2024. DocETL:
2024. A Declarative System for Optimizing AI Workloads. arXiv preprint Agentic Query Rewriting and Evaluation for Complex Document Processing.
arXiv:2405.14696 (2024). arXiv:2410.12189 [cs.DB] https://arxiv.org/abs/2410.12189
[19] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, [30] Brandon Smock, Rohith Pesala, and Robin Abraham. 2022. PubTables-1M: To-
Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models wards Comprehensive Table Extraction From Unstructured Documents. In Pro-
Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172 ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
[20] Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina Semnani, Chen Yu, and Monica (CVPR). 4634–4642.
Lam. 2024. SUQL: Conversational Search over Structured and Unstructured Data [31] Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen
with Large Language Models. In NAACL. 4535–4555. Chen, and Wang-Chiew Tan. 2022. Annotating columns with pre-trained lan-
[21] Microsoft. 2024. Azure AI Document Intelligence. https://azure.microsoft.com/ guage models. In Proceedings of the 2022 International Conference on Management
en-us/products/ai-services/ai-document-intelligence of Data. 1493–1503.
[22] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard [32] The Aryn Team. 2024. Benchmarking PDF segmentation and parsing models.
Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, [33] Unstructured. 2024. Unstructured Serverless API. https://unstructured.io/api-
and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. key-hosted
In OSDI. [34] Matthias Urban and Carsten Binnig. 2023. CAESURA: Language Models as
[23] Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foun- Multi-Modal Query Planners. CIDR (2023).
dation Models Wrangle Your Data? PVLDB (2022). [35] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
[24] Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. LOTUS: En- Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
abling Semantic Queries with LLMs Over Tables of Unstructured and Structured silient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Data. arXiv preprint arXiv:2407.11418 (2024). Computing. In NSDI.
[25] ShengYun Peng, Aishwarya Chakravarthy, Seongmin Lee, Xiaojing Wang, Ra- [36] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020.
jarajeswari Balasubramaniyan, and Duen Horng Chau. 2024. UniTable: Towards Deformable detr: Deformable transformers for end-to-end object detection. arXiv
a Unified Framework for Table Recognition via Self-Supervised Pretraining. preprint arXiv:2010.04159 (2020).
arXiv:2403.04822 [cs.CV] https://arxiv.org/abs/2403.04822