Semantic Operators: A Declarative Model for Rich, AI-based Data Processing

Patel, Liana; Jha, Siddharth; Pan, Melissa; Gupta, Harshit; Asawa, Parth; Guestrin, Carlos; Zaharia, Matei

Abstract:The semantic capabilities of large language models (LLMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems either empirically optimize expensive LLM-powered operations with no performance guarantees, or serve a limited set of row-wise LLM operations, providing limited robustness, expressiveness and usability. We introduce semantic operators, the first formalism for declarative and general-purpose AI-based transformations based on natural language specifications (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator opens a rich space for execution plans, similar to relational operators. Our model specifies the expected behavior of each operator with a high-quality gold algorithm, and we develop an optimization framework that reduces cost, while providing accuracy guarantees with respect to a gold algorithm. Using this approach, we propose several novel optimizations to accelerate semantic filtering, joining, group-by and top-k operations by up to $1,000\times$. We implement semantic operators in the LOTUS system and demonstrate LOTUS' effectiveness on real, bulk-semantic processing applications, including fact-checking, biomedical multi-label classification, search, and topic analysis. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that match or exceed quality of recent LLM-based analytic systems by up to $170\%$, while offering accuracy guarantees. Overall, LOTUS programs match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to $3.6\times$ faster than the highest-quality baselines. LOTUS is publicly available at this https URL.

Subjects:	Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2407.11418 [cs.DB]
	(or arXiv:2407.11418v3 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2407.11418

Computer Science > Databases

Title:Semantic Operators: A Declarative Model for Rich, AI-based Data Processing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators