0% found this document useful (0 votes)
21 views10 pages

Aryn Cidr 25 Camera Ready

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

The Design of an LLM-powered Unstructured Analytics System

Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry Lindeman, Alex Meyer,
Parthkumar Parmar, Tanvi Ranade, Mehul A. Shah, Benjamin Sowell, Dan Tecuci, Vinayak Thapliyal,
Matt Welsh
Aryn, Inc.
ABSTRACT In addition to simple “hunt and peck” queries, for which RAG is
LLMs demonstrate an uncanny ability to process unstructured data, tailored, these analyses often require “sweep and harvest” patterns.
and as such, have the potential to go beyond search and run complex, An example is a query like “What is the yearly revenue growth
semantic analyses at scale. We describe the design of an unstruc- and outlook of companies whose CEO recently changed?” For this,
tured analytics system, Aryn, and the tenets and use cases that one needs to sweep through large document collections, perform
motivate its design. With Aryn, users specify queries in natural lan- a mix of natural-language semantic operations (e.g., filter, extract,
guage and the system automatically determines a semantic plan and or summarize information) and structured operations (e.g. select,
executes it to compute an answer from a large collection of unstruc- project, or aggregate), and then synthesize an answer. Going a step
tured documents. At the core of Aryn is Sycamore, a declarative further, we see “data integration” patterns where users want to
document processing engine, that provides a reliable distributed combine information from multiple collections or sources. For ex-
abstraction called DocSets. Sycamore allows users to analyze, enrich, ample, “list the fastest growing companies in the BNPL market and
and transform complex documents at scale. Aryn includes Luna, a their competitors,” where the competitive information may involve
query planner that translates natural language queries to Sycamore a lookup in a database in addition to a sweep-and-harvest phase to
scripts, and DocParse, which takes raw PDFs and document images, gather the top companies. We also expect complex compositions of
and converts them to DocSets for downstream processing. We show these patterns to become prevalent.
how these pieces come together to achieve better accuracy than Aryn is an unstructured analytics platform, powered by LLMs,
RAG on analytics queries over real world reports from the National that is designed to answer these types of queries. We take inspira-
Transportation Safety Board (NTSB). Also, given current limita- tion from relational databases, from which we borrow the principles
tions of LLMs, we argue that an analytics system must provide of declarative query processing. With Aryn, users specify what they
explainability to be practical, and show how Aryn’s user interface want to ask in natural language, and the system automatically con-
does this to help build trust. structs a plan (the how) and executes it to compute the answer from
unstructured data.
Aryn consists of several components (see Figure 1). The natural-
1 INTRODUCTION language query planner, Luna, uses LLMs to translate queries to
Large language models have inspired the imagination of industry, semantic query plans with a mix of structured and LLM-based
and companies are starting to use LLMs for product search, cus- semantic operators. Query plans are compiled to Sycamore, a docu-
tomer support chatbots, code co-pilots, and application assistants. ment processing engine used both for ETL and query processing.
In enterprise settings, accuracy is paramount. To limit hallucina- Sycamore is built around DocSets, a reliable abstraction similar to
tions, most of these applications are backed by semantic search Apache Spark DataFrames, but for hierarchical documents. Finally,
architectures that answer queries based on data retrieved from DocParse uses vision models to convert complex documents with
a knowledge base, using techniques such as retrieval-augmented text, tables, and images into DocSets for downstream processing.
generation (RAG) [13]. The main challenge for an analytics system built largely on AI
Still, enterprises want to go beyond RAG and run semantic anal- is to give answers that are accurate and trustworthy. We use LLMs
yses that require complex reasoning across large repositories of and vision models for different purposes throughout our stack and
unstructured documents. For example, financial services companies carefully compose them to provide answers to complex questions.
want to analyze research reports, earnings calls, and presentations Unfortunately, LLMs are inherently imprecise, making LLM output
to understand market trends and discover investment opportunities. difficult to verify.
Consumer goods firms want to improve their marketing strategies Aryn’s database-inspired approach addresses this challenge in
by analyzing interview transcripts to understand sentiment towards multiple ways. First, our experience working with customers has
brands. In legal firms, investigators want to analyze legal case sum- shown how essential ETL is for achieving good quality for both
maries to discover precedents for rule infringement and the actions RAG and analytics use cases. By performing high-quality parsing
taken across a broad set of companies and cases. and metadata extraction, we can provide the LLM with the context
necessary to reduce the likelihood of hallucinations. Second, by
dynamically breaking down complex questions into query plans
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their composed of simple LLM-based operations, we can make the query
personal and corporate Web sites with the appropriate attribution, provided that you plan as a whole more reliable than RAG-based approaches. Third,
attribute the original work to the authors and CIDR 2025. 15th Annual Conference Aryn exposes the query plan and data lineage to users for better
on Innovative Data Systems Research (CIDR ’25). January 19-22, Amsterdam, The
Netherlands explainability. A key component of Aryn is its conversational user
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.

make them available for analysts to investigate and generate new


insights and strategies.
Report Generation and Business Intelligence (BI): As LLMs
become cheaper and faster, companies have started building LLM-
powered document pipelines to generate reports. For example, these
may be summaries of hours of user interviews, daily highlights ex-
tracted from medical notes of a patient, or legal claims derived from
accident reports. Going a step further, we see customers extracting
structured summary datasets from document collections to help
in critical business decisions. For example, auto insurance firms
want to extract damage and repair data from claim summaries to
understand trends and spot anomalies as potential fraud.
Challenges: LLMs have inspired many new enterprise use cases be-
cause of their remarkable ability to process unstructured documents.
While LLMs hold promise, they are not enough. LLMs inherently
hallucinate, which is a liability in the use cases described above. In
these settings, users need accurate and explainable answers.
RAG is a popular method to answer questions from documents,
but is fundamentally limited. RAG uses semantic search to retrieve
Figure 1: Aryn Architecture relevant chunks of documents that are then supplied to an LLM
as context to answer a question. While the RAG approach some-
what mitigates hallucination, LLM context windows are limited,
interface. Users can inspect and debug the generated plans, analyze
and studies show that LLMs with extremely long contexts cannot
data traces from execution, and ask follow-up questions to dig in
“attend” to everything in the prompt [19]. RAG works for simple
and iterate. This approach makes it easier for users to navigate their
factual questions where an answer is contained in a small num-
data and helps build trust in the answers.
ber of relevant chunks of text, but fails when the answer involves
In this paper, we describe the motivating use cases for Aryn,
synthesizing information across a large document collection.
tenets driving its design, and its architecture. We discuss how each
Another approach is to extract metadata from documents through
component of Aryn works and how they fit together, through an
an ETL process (perhaps using LLMs) and load it into a database.
end-to-end use case analyzing NTSB1 incident reports, which con-
While this addresses scale concerns, this does not handle analyses
sist of a large collection of unstructured PDF documents containing
that require semantic operations at query time.
text, images, figures, and tables. We also present the user inter-
As an example, consider the question, “What are the top three
faces to inspect, analyze, and debug the plans generated by Aryn,
most common parts with substantial damage in accidents involving
and highlight the simplicity of the Sycamore programming frame-
single engine aircraft in 2023?”. In NTSB aviation incident reports
work that makes it easy to analyze vast collections of hierarchical
(about 170K PDFs reporting on incidents since 1962), the parts
unstructured documents. Aryn is fully open source, Apache v2.0
damage details are in free form text descriptions of the incidents.
licensed, and available at https://github.com/aryn-ai/sycamore.
RAG fails for this because the relevant reports don’t fit into the
LLM context. Moreover, a pure database approach is unhelpful if
2 USE CASES, CHALLENGES, AND TENETS the fields to be queried have not been extracted during the ETL
Enterprises often have document collections with a theme, such phase. In contrast, Aryn generates a plan that quickly narrows to
as interview transcripts, earnings reports, insurance claims, city the relevant incidents in 2023 with a metadata search, and extracts
budgets, or product manuals. Typically, these documents are a mix the parts data at query time using LLM-based semantic operators.
of unstructured text with multi-modal data in tables, graphs, and
Tenets: While our approach mitigates concerns of scale, halluci-
images. Users often want to ask complex questions or run analyses
nations, and reliability, it does not completely eliminate them. We
that span multiple documents, if not across a broad subset of the
argue that all systems and approaches built on AI inherently cannot.
collection.
To address these challenges, we adopted the following tenets in the
In working with customers, we see two main classes of use cases:
design of Aryn.
Ad-hoc Question Answering: There’s been a recent surge in AI
assistants and chatbots for customer support, typically powered by
technical documentation, ticketing systems like Jira, and internal • Use AI for solutions hard for humans to come by, but easy
messaging boards like Slack. Beyond search-based bots, companies for humans to verify. The most successful applications of AI
are also building research and discovery platforms for ad-hoc, chat- have been where AI is used to generate solutions and those
driven analytics. For example, financial and legal firms use such solutions are verified independently. For example, GitHub
platforms to aggregate internal research or case summaries and Copilot generates code, and developers verify its correctness
in their natural review and testing workflow. Similarly, Aryn
1 National Transportation Safety Board (https://www.ntsb.gov/) uses LLMs to generate an initial query plan from natural
The Design of an LLM-powered Unstructured Analytics System CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands

Figure 3: The DocParse Pipeline.

unstructured data is kept, and can index processed data in a variety


Figure 2: Output of Aryn DocParse (including table and cell of databases, including keyword and vector stores, for use during
identification) on a typical PDF NTSB accident report. query processing.
Our query service, Luna, includes the planner that translates
natural language questions into semantic query plans, which are
language, but a human is able to inspect and modify the plan compiled to Sycamore scripts for execution. We use LLMs for gener-
if needed. ating query plans that users can inspect and validate. This provides
• Ensure explainability of results. Answers to analytics ques- explainability for answers and also allows for debugging and quick
tions are hard to verify without manually repeating the work. iteration. We also use LLMs for implementing semantic query op-
We should make it easy for the user to understand the oper- erators like filtering, summarization, comparison, and information
ation of the system and to audit the correctness of any re- extraction.
sult. For example, Aryn provides a detailed trace of how the The following sections describe each component in more detail.
answer was computed, including the provenance of interme-
diate results. In addition, users can ask follow-up questions 4 DOCPARSE
to navigate the results and build trust. One of the lessons we learned early on while building the Aryn
system is that we have to treat data preparation as a key part of
• Compose narrow AI models and focused AI tasks into a larger
any unstructured analytics system rather than an add on. Parsing
whole. Instead of attempting to build the one true model
complex documents is difficult, and it is not reasonable to expect
in the vein of AGI, we have found it more practical to get
users to be able to convert their data into a text-based format. To
reliability and better quality if we take a systems approach. In
address this need, we built the DocParse service to parse documents
DocParse, we compose a variety of vision models for different
and extract information like text, tables, and images. DocParse
tasks: segmentation, table extraction, and OCR. Similarly,
exposes a simple REST API that takes a document in a common
instead of a single LLM invocation per query, we break down
format (PDF, DOCX, PPT, etc) and returns a collection of labeled
queries into narrower, more focused operators, potentially
chunks that correspond to entities in the source document. For
executed by different LLMs. This improves the reliability of
example, Figure 2 shows a visual representation of how DocParse
each task, and thereby reliability of the whole system.
parses an NTSB document. It identifies headers, text, and tables, and
further breaks down the structure of the table down to individual
3 ARCHITECTURE cells.
Figure 1 shows the high level architecture of Aryn. The first step in DocParse is a compound system composed of multiple stages,
preparing unstructured data for analytics is to parse and label raw as illustrated in Figure 3. We first split each document into pages
documents for further processing. The Aryn DocParse service uses and convert each page to an image so that we can leverage vision
modern vision models to decompose and extract structure from models in later stages of the pipeline. This approach, which is com-
raw documents and transforms them into DocSets. We developed monly used in document processing, allows us to support a variety
our own model based on the deformable DETR architecture [36] of formats in a consistent way and take advantage of semantic in-
and trained on DocLayNet [26]. formation in the rendered document that may be difficult to extract
At the core of Aryn is Sycamore, a document processing engine from the underlying file format such as the relative size or position
that is built on DocSets. DocSets are reliable distributed collections, of objects on the page.
similar to Spark [35] DataFrames, but the elements are hierarchical The next step in the pipeline is segmentation, which uses an
documents represented with semantic trees and additional metadata. object detection model to identify bounding boxes and label them
Sycamore includes transformations on these documents for both as one of 11 categories, including titles, images, paragraphs, and
ETL purposes, e.g., flatten and embed, as well for analytics, e.g., tables. As we were developing DocParse, we found that many of
filter, summarize, and extract. We use LLMs to power many of the existing open source object detection models performed poorly
these transformations, with lineage to help track and debug issues on document segmentation, so we trained our own. We used the
when they arise. Sycamore can read data from a data lake where Deformable DEtection TRansformer (DETR) architecture [36] and
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.

Service mAP mAR 5 SYCAMORE


DocParse 0.640 0.747
Amazon Textract [1] 0.423 0.507
Sycamore is the open-source document processing engine at the
Unstructured (REST API with YoloX) [33] 0.347 0.505 center of the Aryn system [4]. We built Sycamore to support both
Azure AI Document Intelligence [21] 0.266 0.475 data preparation and analytics over complex document sets. One
of the primary motivations for Sycamore was the observation that
Table 1: Segmentation performance on the DocLayNet com- the line between ETL and analytics gets blurred when dealing
petition benchmark [32] with unstructured data. In particular, we need the flexibility to
. run certain document processing operations either at ETL time or
at query time. For example, the cost of an expensive LLM-based
processing step can be amortized over many queries by running it
trained it on DocLayNet [26], an open source, human-annotated once during ETL, but because the space of potential queries is very
document layout segmentation dataset. We have made this model large, not all operations can be performed in advance.
available for use with a permissive Apache v2 license on Hugging To accommodate these challenges, we built Sycamore as a dataflow
Face [3] and have continued to update the version used by DocParse system inspired by Apache Spark [35], with extensions to integrate
by collecting and labeling customer documents. We evaluate the with LLMs and support unstructured documents.
performance of this model in Section 4.1.
The segmentation model outputs labeled bounding boxes, but 5.1 Data Model
it doesn’t have any information about the text in the document. Documents in Sycamore are hierarchical and multi-modal. A long
The next stage in the pipeline is to extract this text. Depending on document may have chapters that are broken into sections, which
the document format, this is done by reading text directly from in turn contain individual chunks of text, or entities like tables and
the underlying file format with a tool like PDF Miner2 or with an images. The latter data types are particularly important for many
OCR tool like EasyOCR3 or PaddleOCR4 . Once we have have the analytics queries and need special treatment. More precisely, a docu-
text and the labeled bounding boxes, we can perform additional ment in Sycamore is a tree, where each node contains some content,
type-specific processing. For instance, for tables, we use a Table which may be text or binary, an ordered list of child nodes, and a set
Transformer-based model [30] to identify the individual cells, while of JSON-like key-value properties. We refer to leaf-level nodes in
for images we can use a multi-modal LLM to compute a textual the tree as elements. Each element corresponds to a concrete chunk
summary. of the document and is identified as one of 11 types, such as a text,
As part of post-processing, DocParse combines the output from image, or table. Each element may have special reserved proper-
each page into a final result, either in JSON or a higher-level format ties based on its type. For example, a TableElement has properties
like Markdown. Users can leverage Sycamore to import and manip- containing rows and columns, while an ImageElement has infor-
ulate the JSON directly and perform more complex data processing mation about format and resolution. DocSets are flexible enough to
transformations. represent documents at different stages of processing. For example,
when first reading a PDF, it may be represented as a single-node
4.1 Evaluation document with the raw PDF binary as the content. After parsing,
In order to evaluate the performance of our segmentation model, each section is an internal node and tables and text are identified
we used the DocLayNet competition benchmark [5]. This bench- as leaf-level elements.
mark was developed by the authors of the DocLayNet dataset, it
includes documents drawn from a variety of domains, including 5.2 Programming Model and Operators
those not directly represented in the training dataset. The evalu-
Programmers interact with Sycamore in Python using a Spark-like
ation is done using the standard COCO framework [16], which
model of functional transformations on DocSets. Table 2 shows sev-
measures mean average precision (mAP) and mean average recall
eral of Sycamore’s operators. We classify these operators as either
(mAR) across the 11 DocLayNet object classes. Table 1 shows the a
structured or semantic. Structured operators correspond to standard
comparison of DocParse against several other document processing
dataflow-style operations. These include functional operators like
services. In order to make the comparison as fair as possible, we
map and filter that take in arbitrary Python functions, as well
standardized the set of labels across all four services, and removed
as transformations like partition and explode that modify the
results containing labels that were not present in one or more of the
structure of documents by creating or unnesting elements, respec-
services. More information on our methodology can be found in
tively. The reduceByKey operation makes it possible to support
the corresponding blog post [32]. Our results show that DocParse
map-reduce style operations and implement aggregation by docu-
is between 1.5 and 2.4 times more accurate than competing services
ment properties. These transforms accommodate the fact that some
in terms of mAP, and between 1.5 and 1.6 times more accurate in
documents may be missing certain fields. Sycamore does not yet
terms of mAR. These results validate our approach and suggest that
support full joins.
DocParse can serve as the first step towards ingesting documents
Semantic operators leverage LLMs to perform transformations
into an unstructured analytics system.
based on the content or meaning of documents. These operators
2 https://pypi.org/project/pdfminer/ are often driven by natural language prompts and are typically
3 https://github.com/JaidedAI/EasyOCR used to enrich document metadata. Many of the semantic opera-
4 https://github.com/PaddlePaddle/PaddleOCR tors, like llmFilter, can be implemented in terms of the structured
The Design of an LLM-powered Unstructured Analytics System CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands

schema = {
queryDatabase Scans documents from an index based on key- "us_state": "string",
word search over the element content or filters "probable_cause": "string",
over the properties. "weather_related": "bool"
map, filter, flatMap Transforms documents using standard func- }
tional operators.
partition Parses a document using DocParse. ds = context.read.binary("/path/to/ntsb_data")
explode Unnests each element and makes it a top-level .partition(DocParse())
document. .llmExtract(
reduceByKey Standard reduce operation that can be used OpenAIPropertyExtractor("gpt-4o", schema=schema))
for a variety of grouping and aggregations on .explode()
properties on the documents. .embed(OpenAIEmbedder("text-embedding-3-small"))
write Writes a DocSet to a database.
(a) Structured operators in Sycamore Figure 4: Sample Sycamore script.

queryVectorDatabase Performs semantic search over a collection of


indexed documents, returning a DocSet with
the matches. {
llmFilter Uses an LLM prompt to drop or retain docu- "us_state_abbrev":"AK",
ments in a DocSet. "probable_cause": "The pilot's failure to remove
llmExtract Extracts one or more fields from each doc- all water from the fuel tank, which resulted in fuel
ument using an LLM, saving the results as contamination and a subsequent partial loss of engine power.",
document properties. "weather_related': True
llmReduceByKey Similar to reduceByKey, but uses an LLM to }
combine multiple documents.
embed Computes embeddings for each document. Figure 5: Output of the llmExtract transform.
(b) Semantic operators in Sycamore

Table 2: Example Sycamore Operators

5.3 Execution
operators. We still prefer to separate them out because they can Sycamore adopts a Spark-like execution model where operations are
behave very differently in practice, as LLMs are inherently non- pipelined and executed only when materialization is required. To as-
deterministic and users often want to manually inspect the results sist with debugging and avoid redundant execution, Sycamore also
of semantic operations. Sycamore supports a variety of LLMs, in- supports a flexible materialize operation that can save the output
cluding those from OpenAI and Anthropic, and open source models of intermediate transformations to memory, disk, or cloud storage.
like Llama. Sycamore is built on top of the Ray compute framework [22], which
The code5 in Figure 4 is an example of processing NTSB inci- provides primitives for running distributed Python-based dataflow
dent report documents using Sycamore. The code partitions docu- workloads. We chose Ray because it is based on Python, which has
ments using DocParse, described in Section 4. It then executes the become the language of choice for machine learning applications,
llmExtract transform, which takes a JSON schema and attempts and because it is well-integrated with existing ML libraries.
to extract those fields from each document using an LLM. As shown
in in Figure 5, this approach correctly extracts the state abbreviation
and other fields from the document. Next, we use explode to break
6 LUNA
each document into a collection of document chunks, and then we A hallmark of relational databases is declarative query processing,
generate an embedding vector for each chunk. At this point the which hides the low-level details of how queries are executed and
DocSet is ready to be loaded into a database like OpenSearch for makes it easier for application developers to adapt to changing
later querying (using write). workloads and scale. LLMs make it possible to leverage declarative
Finally, queryDatabase and queryVectorDatabase support read- query processing for natural language queries over complex, un-
ing a previously loaded DocSets from a data store. The queryDatabase structured data. We call this LLM-powered unstructured analytics,
operator is analogous to a standard database scan operator, and sup- or Luna for short.
ports filters on the metadata as well as keyword search (depending More specifically, Luna converts a natural language query into a
on the capabilities of the data store). The queryVectorDatabase query plan that runs over DocSets and returns either raw tabular
operator, in addition to those, also supports semantic search (i.e., results or natural language answers. Query plans are executed using
vector similarity search) over the chunks. While indexing is done Sycamore’s DocSet operators. To aid explainability, Luna exposes
on chunks, Sycamore reassembles these chunks into documents the logical query plan, data lineage, and execution history, and
before passing them to downstream operators. allows users to modify any part of the plan to better align with
their intention. The remainder of this section describes the system
5We have elided a few configuration parameters to enhance readability. in detail.
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.

6.1 Luna Architecture Each of these operators can be implemented in terms of the exist-
Luna consists of a number of pieces that work together to pro- ing Sycamore physical operators. For instance, groupByAggregate
vide an end-to-end natural-language query processing system over and llmCluster can be implemented with a combination of map
complex, unstructured data. and reduce operations, but we see better results from the planner
when we keep them as separate operators.
Data Inputs and Schema. Luna shares the Sycamore data model
and executes queries against one or more DocSets that have been Query Planning. Luna uses an LLM to interpret a natural language
indexed in a database. During query planning, we provide the query and decompose it to a DAG of logical query operators. After
planner with the schema of each DocSet, which consists of the significant experimentation, we found that including the following
properties contained in the documents, along with their data types information in the prompt helps provide the LLM with the right
and sample values, along with a special “text-representation” field context:
representing the entire contents of each Document. The schema of • The schema for the input DocSet. For each schema field, we
DocSets can evolve over time, based on new semantic relationships include a short description as well as a few example values
discovered in the data, potentially driven by the query workload. drawn from the underlying data.
While Sycamore represents documents hierarchically with el- • A list of available logical operators and their syntax.
ements corresponding to document chunks, we found it more ef- • A list of example queries and their associated query plans.
fective to hide this from the planner and always provide a schema
We instruct the LLM to generate the plan in JSON format, which we
for complete documents. The Sycamore engine handles splitting
validate against a schema to ensure that it conforms to the expected
documents into chunks that fit in to the context window of the LLM
syntax. In addition to confirming that the query plan is syntactically
used for embedding and reconstructing the full document during
correct, we also check that it is semantically valid. For example, if a
queries.
QueryDatabase operation performs field-based filtering, we check
In our implementation, we primarily use OpenSearch for storing
that the fields used in the filter are valid for the given DocSet.
and querying DocSets, though other data management systems
can be used as long as they support both “keyword” and “seman- Plan Rewriting and Optimization. Despite significant prompt
tic search” (i.e., vector similarity queries) and basic filtering by engineering, the LLM may still produce a suboptimal or, in some
properties. cases, an incorrect or infeasible query plan. We use a combination
of plan rewriting and rule-based optimization to address these
Logical Query Operators. Luna uses an LLM for interpreting a
issues. For example, if the plan has multiple llmExtract operators
natural language user query. We initially provided the LLM the com-
in sequence, these can combined into a single operator.
plete list of physical operators as part of the prompt. However, in our
experiments with several real-world datasets and query workloads, Execution. After plan rewriting and optimization, the query plan
we found that this approach does not work well for complex and is compiled into Sycamore code in Python. Execution on large
exploratory analysis queries like: “Analyze maintenance-related in- datasets benefits from distributed processing, and using Sycamore’s
cidents by grouping those by aircraft type and maintenance interval distributed execution mode allows us to scale out workloads with
to find patterns of recurring issues.” In particular, we found it diffi- minimal overhead. The compiled query execution code in Sycamore
cult to get the LLM to use grouping operations like reduceByKey is easy for a technically savvy user to understand and modify (in
effectively and the plans generated would often run into context the UI itself).
window size limitations.
Traceability and debugging. The ambiguous nature of some
Instead, we decided to differentiate between logical and physical
queries can result in Luna misinterpreting the user’s intention.
operators with respect to query planning and execution. Luna pro-
It is critical to allow the user to inspect the query execution trace
vides a simpler set of high-level logical operators to the LLM for
and provide feedback to correct itself. With a combination of log-
query planning purposes, and rewrites the resulting logical plan
ging and exposing APIs that allow the user to modify any stage of
into physical operators before execution. This also makes it easier
query execution, users have full control over how their query is
for the user to understand the plan and debug the execution.
answered.
Many simple logical operators map one-to-one to physical Sycamore
operators, including single-pass per-document operations like map,
filter, and llmExtract, but for operations that span multiple 6.2 User Interface and Verifiability
documents, we have found it often works better to have more spe- Luna’s user interface, shown in Figure 6 is designed to make it easy
cific operators rather than low-level primitives. For example, the for users to verify the results from the system. Luna achieves this
following logical operators are exposed to the Luna planner: by: (a) exposing the query plan, (b) allowing the user to inspect
intermediate results, and (c) allowing the user to ask follow-up
• groupByAggregate: Performs a database style group-by and questions to guide the system.
aggregation. Luna exposes the plan generated from a user query as a simple
• llmCluster: Clusters documents using 𝑘-means based on JSON object. This allows a user to understand the exact operations
semantic similarity of one or more fields. that were performed to answer a query, how the dataset was trans-
• llmGenerate: Summarizes one or more documents based formed during each operation, and modify any part of the plan to
on a prompt. This is analogous to the “G” in “RAG” and is better align with their intention. Given the query “Get the latitude
often used at the end of a plan. and longitude of all incidents in 2023 involving Cessna aircraft,” we
The Design of an LLM-powered Unstructured Analytics System CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands

Field Example value


accidentNumber CEN23FA095
aircraft Piper PA-38-112
aircraftDamage Destroyed
conditionOfLight Dusk
conditions Visual (VMC)
dateAndTime June 28, 2024 19:02:00
departureAirport Winchester, Virginia (OKV)
destinationAirport Yelm; Washington
flightConductedUnder Part 137: Agricultural
injuries 3 Serious
location Gilbertsville, Kentucky
lowestCeiling Broken / 5500 ft AGL
lowestCloudCondition Scattered / 12000 ft AGL
operator Anderson Aviation LLC
registration N220SW
temperature 15.8C
visibility 7 miles
windDirection 190°
windSpeed 19 knots gusting to 22 knots

Table 3: Schema extracted from NTSB incident reports.

can see the resulting plan as a queryDatabase operation followed


by a llmExtract operation. The Luna UI also shows the user the
Sycamore code that was generated for the query, which they can
edit and re-run.
While inspecting the query plan is often enough to convince
oneself that the data generated by the query is likely to be correct,
further validation is possible by inspecting the data flowing out of
each of the operators. The Luna UI allows the user to explore the
raw data at each stage of the query plan, drilling down to individual
records and linking back to the original source documents.
Finally, we find that supporting an iterative, exploratory mode of
interaction with the system is essential. Users can test hypotheses
and explore different aspects of the data by asking follow-up ques-
tions, such as “what about incidents without substantial damage”
or “show only results in California.” The conversational history
with the system allows a user to refer to previous queries or results
implicitly, making this interaction much more natural, much like
asking questions of a human analyst.

7 EVALUATION
We present a preliminary evaluation of Luna’s ability to answer
complex analytical questions over a dataset of incident reports from
the National Transportation Safety Board, which is the US-based
agency responsible for investigating civil transportation accidents.
Our test dataset consists of 100 PDF reports pulled from the NTSB
CAROL database6 covering aviation incidents between June and
September 2024. Each file is between 4 and 7 pages of text, with
sections covering a summary of the incident, probable cause and
findings, factual information, and administrative information. In-
Figure 6: The Luna user interface shows the query result cident reports have multiple tables covering aspects such as the
visually, allows the use to inspect the generated query plan, pilot’s background, aircraft and operator details, meteorological in-
and lets them drill down to individual documents if needed formation, wreckage, and injuries. Many of the documents contain
photographs of the accident site or maps of the flight trajectory.
6 https://carol.ntsb.gov/
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.

Luna RAG Luna answers 20 out of the 30 questions correctly, and 10 incor-
Correct 20 (67%) 2 (6.7%) rectly. The incorrect answers fall into several categories:
Incorrect 10 (33%) 20 (67%)
Refusal 0 (0%) 8 (26.7%) Counting errors (6 cases). In several cases, there are off-by-
Total 30 30 one errors due to incidents being counted twice. For example,
for the question “How many incidents were there, broken
Table 4: Luna vs. RAG evaluation results on NTSB document down by number of engines?”, there is a single incident in-
questions. volving two aircraft, each with 1 engine. These are counted
as two separate “incidents”. Fixing this would require a dedu-
plication step in the query plan which can be achieved with
We processed the NTSB reports using a Sycamore pipeline. The better few-shot examples for the planner.
pipeline starts by calling DocParse to parse each document as de- Filter errors (3 cases). The LLMFilter operation is occasion-
scribed in Section 4, and then uses the llmExtract transform to ally too generous in its interpretation of whether a given
extract key data from each document. We load the resulting schema, document should pass the filter test. As an example, in the
shown in Table 3, into an OpenSearch index. We also chunk and question “How many incidents were due to engine prob-
embed the text content of the incident reports, and the resulting lems?” the LLM filter operation screens for “Does the docu-
vectors are also stored in OpenSearch for use with vector search ment indicate engine problems?”. Because portions of most
operations. Throughout this evaluation we used OpenAI’s gpt-4o NTSB reports mention engines in various contexts, the filter
model for our LLM, all-MiniLM-L6-v2 for the embeddings, and tends to pass through documents where an engine problem
OpenSearch 2.17. was not indicated. Better prompting for the filter conditions
would help here.
7.1 Benchmark questions Query interpretation (1 case). For the question “What was
There does not exist a standard benchmark for document analytics the breakdown of incident types by aircraft manufacturer?”,
against this type of dataset. Through manual inspection, we derived the LLM interprets “aircraft manufacturer” to mean whether
a set of 30 questions that represent a broad range of query types the aircraft was military, commercial, a helicopter, or some
and varying degrees of difficulty to answer. Some examples of the other type, rather than the name of the manufacturer (which
benchmark questions include: is indeed present in the dataset). This would be fixable with
• How many incidents were there by state? some additional few-shotting, but points more broadly to
• What fraction of incidents that resulted in substantial dam- the challenge of teaching the LLM about the semantic inter-
age were due to engine problems? pretation of the schema.
• In incidents involving Piper aircraft, what was the most As we expected to see, RAG does poorly on most of these ques-
commonly damaged part of the aircraft? tions. The two cases in which RAG gets the correct answer are
• Which incidents occurred in July involving birds? “How many incidents were there in Hawaii?” (for which the correct
A few of the benchmark questions can be answered more or less answer is zero), and “Which incidents occurred in July involving
directly by querying the extracted metadata shown in Table 3. How- birds?” (two incidents). Both of these are answerable using the RAG
ever, in most cases, the benchmark questions refer to information approach when the number of records retrieved from the vector
not explicitly captured in the schema, such as whether an incident search is small enough to fix in the LLM’s context window. RAG
involved birds or engine problems. For these cases, Luna needs to does not yield the correct answer in any case where the number
use a combination of metadata lookup and LLM-based extraction of matching incidents exceeds a modest threshold, such as “How
or filtering based on the documents’ textual content. many incidents involved substantial damage?” (correct answer: 94,
Many of our benchmark questions would be difficult, or impossi- RAG answer: 10).
ble, for a RAG-based system to answer, given that the information A substantial number of RAG queries resulted in a refusal of the
required to answer the question is spread across multiple portions LLM to answer the question at all. For example, on the question
of each document, and a vector search would not be expected to “How many incidents were due to engine problems?”, the LLM re-
return meaningful chunks of context for downstream analysis by sponds with “The NTSB does not assign fault or blame for accidents
the LLM. or incidents, including those related to engine problems.” This is
caused by context poisoning during the RAG process. Each of the
7.2 Results NTSB reports contains a boilerplate disclaimer that states,
We ran Luna against each of our 30 benchmark questions and “The NTSB does not assign fault or blame for an acci-
compared the result to ground truth answers determined through dent or incident; rather, as specified by NTSB regula-
manual inspection. As a comparison point, we also used RAG to tion, ’accident/incident investigations are fact-finding
answer each question, using a standard RAG approach that first proceedings with no formal issues and no adverse
converts the question into a vector search against the embedded set parties ... and are not conducted for the purpose of de-
of text chunks, retrieves the 𝑘 nearest documents for each question, termining the rights or liabilities of any person’ (Title
and provides those chunks as context to the LLM to answer the 49 Code of Federal Regulations section 831.4).”
original question. For this test we set 𝑘 = 100. The results are shown Whenever these text chunks are included in the vector search results
in Table 4. fed as context to the LLM, the final response is effectively poisoned
The Design of an LLM-powered Unstructured Analytics System CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands

by the disclaimer. While this could be addressed though a range of Aryn to be a human-in-the-loop system; as the LLMs improve, the
prompting and santization techniques, we chose to highlight this need for human interventions will diminish, but it is unlikely to
as an interesting failure mode of the conventional RAG approach. completely vanish. Our experience across a variety of application
domains supports that our overall design as well as Aryn’s indi-
8 RELATED WORK vidual components are promising. Nonetheless, many challenges
Machine learning has revolutionized many aspects of data man- still remain. We need to continue to improve accuracy and make
agement over the last decade. First, there is a long line of work it easier to adapt Aryn to new use cases. We need ways to correct
on natural language to SQL [12, 27, 28]. While the early work fo- and evolve the system and automatically learn from users as they
cused on building specialized models for this purpose, LLM-based exercise the system. We need to extend Aryn to support joins and
approaches have proven superior in recent years7 . Several recent allow queries to incorporate external sources like data warehouses.
works have focused on generating queries that incorporate LLM Finally, we’ve just started the journey on improving performance,
calls [18, 20, 34]. Our Luna framework is differentiated by a broader cost, and scale.
set of LLM-based operations, a focus on hierarchical documents,
and our emphasis on interactive interfaces. ACKNOWLEDGMENTS
There is also much work on using LLMs for specific ETL tasks We thank Amol Deshpande for his insights, detailed advice, and
such as entity resolution, information extraction, named entity contributions from the start and throughout our journey at Aryn.
recognition, and data cleaning [15, 23, 31]. In addition, there’s also We also thank our newest members who relentlessly make the
work in detecting and extracting tables using modern transformer platform better: Akarsh Gupta, Soeb Hussain, Dhruv Kaliraman,
models [25, 30], OCR [9, 14], and segmentation and labeling [5, 26]. Soham Kasar, Abijit Puhare, Karan Sampath, Aanya Pratapneni, and
To date, ours is the only work that combines the best of these into Ritam Saha. We are indebted to our customers whose partnership
a unified cloud service and is deeply integrated with a declarative makes our contribution unique and differentiated. Finally, we thank
document processing framework for ETL like Sycamore. our reviewers for their suggestions.
DocParse is based on a long line of work in document segmenta-
tion. Current approaches commonly use object detection models
such as DETR [7]. DocParse follows this approach and leverages
REFERENCES
[1] Amazon. 2024. Amazon Textract. https://aws.amazon.com/textract/
Deformable DETR [36]. An alternate line of work has led to multi- [2] Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Ho-
modal models such as Donut [11] and LayoutLMV3 [8] that seek jel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable
to directly solve document understanding tasks like visual ques- Simple Systems for Generating Structured Views of Heterogeneous Data Lakes.
Proceedings of the VLDB Endowment 17, 2 (2023), 92–105.
tion answering (VQA) without the need for explicit segmentation. [3] Aryn. 2024. Aryn/deformable-detr-DocLayNet. https://huggingface.co/Aryn/
Sycamore can eventually incorporate these models, but we con- deformable-detr-DocLayNet
[4] Aryn. 2024. Sycamore Repository. https://github.com/aryn-ai/sycamore
tinue to find segmentation valuable as we can index the segments [5] Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Li-
to reduce work at query time. vathinos, and Peter Staar. 2023. ICDAR 2023 Competition on Robust Layout
There is less work on building end-to-end systems that encom- Segmentation in Corporate Documents. In Document Analysis and Recognition -
ICDAR 2023. Springer Nature Switzerland, 471–482. https://doi.org/10.1007/978-
pass the entire spectrum of tasks from document parsing to ETL to 3-031-41679-8_27
querying for unstructured document analytics. Nonetheless, several [6] Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gon-
similar efforts have started over the last year including ZenDB [17], zalez, Carlos Guestrin, and Matei Zaharia. 2024. Text2SQL is Not Enough:
Unifying AI and Databases with TAG. arXiv:2408.14717 [cs.DB] https:
LOTUS [24], EVAPORATE [2], CHORUS [10], and Palimpzest [18]. //arxiv.org/abs/2408.14717
TAG [6] is similar in spirit to Luna, but translates to SQL and [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan-
der Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with
does not include LLM-based operators post database query. Most Transformers. In ECCV.
recently, DocETL [29] proposes to use agent-based rewrites to au- [8] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3:
tomatically optimize document processing pipelines for improved Pre-training for document ai with unified text and image masking. In Proceedings
of the 30th ACM International Conference on Multimedia. 4083–4091.
accuracy. In contrast to these works, while we have incorporated [9] JaidedAI. 2024. EasyOCR. https://github.com/JaidedAI/EasyOCR
similar pipelining and rewriting mechanisms to start, we do not [10] Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and
believe it is possible to fully automate and optimize the entire Dan Suciu. 2024. CHORUS: Foundation Models for Unified Data Discovery and
Exploration. Proceedings of the VLDB Endowment 17, 8 (2024), 2104–2114.
pipeline in practice. As a result, we have designed Aryn to facilitate [11] Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim,
a human-in-the-loop paradigm. Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2021.
Donut: Document understanding transformer without ocr. arXiv preprint
arXiv:2111.15664 7, 15 (2021), 2.
9 CONCLUSIONS AND FUTURE WORK [12] Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural
language to SQL: Where are we today? PVLDB 13, 10 (2020).
We are building Aryn to make unstructured data as easy to query [13] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
as structured data by leveraging the immense potential of LLMs Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel,
to process multi-modal datasets. We take a database-inspired ap- Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks. arXiv:2005.11401 https://arxiv.org/abs/2005.
proach of decomposing analytics queries into semantic query plans 11401
which not only improves answer accuracy but also provides ex- [14] Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du,
Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun
plainability and an avenue for intervention and iteration. At the Ma. 2022. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight
same time, given the limitations of current models, we are building OCR System. arXiv:2206.03001 [cs.CV] https://arxiv.org/abs/2206.03001
[15] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan.
7 See the leaderboard at https://yale-lily.github.io/spider. 2020. Deep entity matching with pre-trained language models. Proceedings of
CIDR’25, January 19-22, 2025, Amsterdam, The Netherlands Anderson et al.

the VLDB Endowment 14, 1 (2020), 50–60. [26] Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar.
[16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva [n. d.]. DocLayNet: A Large Human-Annotated Dataset for Document-Layout
Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Segmentation. In KDD (KDD ’22). ACM.
Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, [27] Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei,
Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O. Arik.
Cham, 740–755. 2024. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate
[17] Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Selection in Text-to-SQL. arXiv:2410.01943 [cs.LG] https://arxiv.org/abs/2410.
Aditya G. Parameswaran, and Eugene Wu. 2024. Towards Accurate and Efficient 01943
Document Analytics with Large Language Models. arXiv:2405.04674 [cs.DB] [28] Mohammadreza Pourreza and Davood Rafiei. 2024. DIN-SQL: Decomposed
https://arxiv.org/abs/2405.04674 in-context learning of text-to-SQL with self-correction. Advances in Neural
[18] Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Information Processing Systems 36 (2024).
Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano. [29] Shreya Shankar, Aditya G. Parameswaran, and Eugene Wu. 2024. DocETL:
2024. A Declarative System for Optimizing AI Workloads. arXiv preprint Agentic Query Rewriting and Evaluation for Complex Document Processing.
arXiv:2405.14696 (2024). arXiv:2410.12189 [cs.DB] https://arxiv.org/abs/2410.12189
[19] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, [30] Brandon Smock, Rohith Pesala, and Robin Abraham. 2022. PubTables-1M: To-
Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models wards Comprehensive Table Extraction From Unstructured Documents. In Pro-
Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172 ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
[20] Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina Semnani, Chen Yu, and Monica (CVPR). 4634–4642.
Lam. 2024. SUQL: Conversational Search over Structured and Unstructured Data [31] Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen
with Large Language Models. In NAACL. 4535–4555. Chen, and Wang-Chiew Tan. 2022. Annotating columns with pre-trained lan-
[21] Microsoft. 2024. Azure AI Document Intelligence. https://azure.microsoft.com/ guage models. In Proceedings of the 2022 International Conference on Management
en-us/products/ai-services/ai-document-intelligence of Data. 1493–1503.
[22] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard [32] The Aryn Team. 2024. Benchmarking PDF segmentation and parsing models.
Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, [33] Unstructured. 2024. Unstructured Serverless API. https://unstructured.io/api-
and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. key-hosted
In OSDI. [34] Matthias Urban and Carsten Binnig. 2023. CAESURA: Language Models as
[23] Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foun- Multi-Modal Query Planners. CIDR (2023).
dation Models Wrangle Your Data? PVLDB (2022). [35] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
[24] Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. LOTUS: En- Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
abling Semantic Queries with LLMs Over Tables of Unstructured and Structured silient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Data. arXiv preprint arXiv:2407.11418 (2024). Computing. In NSDI.
[25] ShengYun Peng, Aishwarya Chakravarthy, Seongmin Lee, Xiaojing Wang, Ra- [36] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020.
jarajeswari Balasubramaniyan, and Duen Horng Chau. 2024. UniTable: Towards Deformable detr: Deformable transformers for end-to-end object detection. arXiv
a Unified Framework for Table Recognition via Self-Supervised Pretraining. preprint arXiv:2010.04159 (2020).
arXiv:2403.04822 [cs.CV] https://arxiv.org/abs/2403.04822

You might also like