0% found this document useful (0 votes)
27 views12 pages

LLM As Dba: Xuanhe Zhou Guoliang Li Zhiyuan Liu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views12 pages

LLM As Dba: Xuanhe Zhou Guoliang Li Zhiyuan Liu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

LLM As DBA

Xuanhe Zhou Guoliang Li Zhiyuan Liu


Tsinghua University Tsinghua University Tsinghua University
Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected]

ABSTRACT
Thought: High memory usage seems to be caused by
Database administrators (DBAs) play a crucial role in managing, poor join performance and much inactive memory
maintaining and optimizing a database system to ensure data avail- Reasoning: Poor joins can be solved by plan optimization
ability, performance, and reliability. However, it is hard and tedious Action: optimize_query_plan
arXiv:2308.05481v1 [cs.DB] 10 Aug 2023

for DBAs to manage a large number of database instances (e.g., mil-


lions of instances on the cloud databases). Recently large language
Rewrite logical query …
models (LLMs) have shown great potential to understand valuable
documents and accordingly generate reasonable answers. Thus, we
propose D-Bot, a LLM-based database administrator that can con- Optimize query plan …
tinuously acquire database maintenance experience from textual
sources, and provide reasonable, well-founded, in-time diagnosis Add lacking indexes …
and optimization advice for target databases. This paper presents a D-Bot
revolutionary LLM-centric framework for database maintenance,
including (𝑖) database maintenance knowledge detection from doc-
Documents
uments and tools, (𝑖𝑖) tree of thought reasoning for root cause Query Query Query
analysis, and (𝑖𝑖𝑖) collaborative diagnosis among multiple LLMs. Rewriter Planner Executor
Our preliminary experimental results that D-Bot can efficiently
System Configuration
and effectively diagnose the root causes and our code is available
at github.com/TsinghuaDatabaseGroup/DB-GPT. Figure 1: LLM As DBA
on the initial analysis results. This capability is vital to detect useful
1 INTRODUCTION information in complex cases.
Limitations of DBAs. Currently, most companies still rely on Our Vision: A Human-Beyond Database Adminstrator. To this
DBAs for database maintenance (DM, e.g., tuning, configuring, end, we aim to build a human-beyond “DBA” that can tirelessly learn
diagnosing, optimizing) to ensure high performance, availability from documents (see Figure 1), which, given a set of documents,
and reliability of the databases. However, there is a significant gap automatically (1) learns experience from documents, (2) obtains
between DBAs and DM tasks. First, it takes a long time to train a DBA. status metrics by interacting with the database, (3) reasons about
There are numerous relevant documents (e.g., administrator guides), possible root causes with the abnormal metrics, and (4) accordingly
which can span over 10,000 pages for just one database product gives optimization advice by calling proper tools.
and consumes DBAs several years to partially grasp the skills by Challenges. Recent advances in Large Language Models (LLMs)
applying in real practice. Second, it is hard to obtain enough DBAs have demonstrated superiority in understanding natural language,
to manage a large number of database instances, e.g. millions of generating basic codes, and using external tools. However, leverag-
instance on cloud databases. Third, a DBA may not provide in-time ing LLM to design a “human-beyond DBA” is still challenging.
response in emergent cases (especially for correlated issues across (1) Experience learning from documents. Just like human learners
multiple database modules) and cause great financial losses. taking notes in classes, although LLMs have undergone training on
Limitations of Database Tools. Many database products are vast corpus, important knowledge points (e.g., diagnosis experience)
equipped with semi-automatic maintenance tools to relieve the cannot be easily utilized without careful attention. However, most
pressure of human DBAs [5, 6, 10–12]. However, they have sev- texts are of long documents (with varying input lengths and section
eral limitations. First, they are built by empirical rules [4, 24] or correlations) and different formats of the extracted experience can
small-scale ML models (e.g., classifiers [13]), which have poor text greatly affect the utilization capability of the LLM.
processing capability and cannot utilize available documents to (2) Reasoning by interacting with database. With the extracted
answer basic questions. Second, they cannot flexibly generalize experience, we need to inspire LLM to reason about the given
to scenario changes. For empirical methods, it is tedious to man- anomalies. Different from basic prompt design in machine learning,
ually update rules by newest versions of documents. And learned database diagnosis is an interactive procedure with the database
methods require costly model retraining and are not suitable for (e.g., looking up system views or metrics). However, LLM responses
online maintenance. Third, they cannot reason the root cause of an are often untrustworthy (“hallucination” problem), and it is critical
anomaly like DBAs, such as looking up more system views based to design strategies that guide LLM to utilize proper interfaces of
the database and derive reasonable analysis.
xxxx (3) Mechanism for communication across multiple LLMs. Similar
to human beings, one LLM alone may be stuck in sub-optimal
solutions, and it is vital to derive a framework where multiple record a large scale of data and are generally not enabled in on-
LLMs collaborate to tackle complex database problems. By pooling line stage. (2) Metrics capture the aggregated database and system
their collective intelligence, these LLMs can provide comprehensive statistics. For example, views like pg_stat_statements record the
and smart solutions that a single LLM or even skilled human DBA templates and statistics of slow queries; tools like Prometheus [20]
would struggle to think out. provide numerous monitoring metrics, making it possible to cap-
Idea of LLM as DBA. Based on above observations, we introduce ture the real time system status. (3) Traces provide visibility into
D-Bot, an LLM based database administrator. First, D-Bot trans- how requests behave during executing in the database. Different
forms documents into experiential knowledge by dividing them into from logs that help to identify the database problem, traces help to
manageable chunks and summarizing them for further extraction locate the specific abnormal workload or application.
of maintenance insights with LLM. Second, it iteratively generates Optimization Tools for Anomaly Solving. Users mainly con-
and assesses different formats of task descriptions to assist LLM in cern how to restore to the normal status after an anomaly occurs.
understanding the maintenance tasks better. Third, D-Bot utilizes Here we showcase some optimization tools. (1) For slow queries,
external tools by employing matching algorithms to select appro- since most open-source databases are weak in logical transfor-
priate tools and providing LLM with instructions on how to use the mation, there are external engines (e.g., Calcite with ∼120 query
APIs of selected tools. Once equipped with the experience, tools, rewrite rules) and tuning guides (e.g., Oracle with over 34 trans-
and input prompt, LLM can detect anomalies, analyze root causes, formation suggestions) that help to optimize slow queries. (2) For
and provide suggestions, following a tree of thought strategy to re- knob tuning, many failures (e.g., max_connections in Postgres)
vert to previous steps if a failure occurs. Moreover, D-Bot promotes or bad performance (e.g., memory management knobs) are cor-
collaborative diagnosis by allowing multiple LLMs to communicate related with database knobs (e.g., for a slow workload, incresae
based on predefined environmental settings, inspiring more robust innodb_buffer_pool_size in MySQL by 5% if the memory usage
solutions via debate-like communications. is lower than 60%). Similarly, there are index tuning rules that
Contributions. We make the following contributions. generate potentially useful indexes (e.g., taking columns within
(1) We design a LLM-centric database maintenance framework, and the same predicate as a composite index). Besides, we can utilize
explore potential to overcome limitations of traditional strategies. more advanced methods, such as selecting among heuristic meth-
(2) We propose an effective data collection mechanism by (𝑖) de- ods [3, 21, 22] and learned methods [7–9, 15, 23, 25, 26] for problems
tecting experiential knowledge from documents and (𝑖𝑖) leveraging like index lacking, which is not within the scope of this paper.
external tools with matching algorithms. We aim to design D-Bot, an LLM-based DBA, for automatically
(3) We propose a root cause analysis method that utilizes LLM and diagnosing the database anomalies and use LLM to directly (or call
tree search algorithm for accurate diagnosis. appropriate tools to indirectly) provide the root causes.
(4) We propose an innovative concept of collaborative diagnosis
among LLMs, thereby offering more comprehensive and robust
solutions to complex database problems.
3 THE VISON OF D-BOT
(5) Our preliminary experimental results that D-Bot can efficiently Existing LLMs are criticized for problems like “Brain in a Vat” [14].
and effectively diagnose the root causes. Thus, it is essential to establish close connections between LLMs
and the target database, allowing us to guide LLMs in effectively
maintaining the database’s health and functionality. Hence, we
2 PRELIMINARIES propose D-Bot, which is composed of two stages.
First, in preparation stage, D-Bot generates experience (from
Database Anomalies. In databases, there are five common prob- documents) and prompt template (from diagnosis samples), which
lems that can negatively affect the normal execution status. (1) are vital to guide online maintenance.
Running Slow. The database exhibits longer response time than • Documents → Experience. Given a large volume of diverse,
expectancy, leading to bad execution performance. (2) Full Disk long, unstructured database documents (e.g., database man-
Capacity. The database’s disk space is exhausted, preventing it from ual, white paper, blogs), we first split each document into
storing new data. (3) Execution Errors. The database experiences chunks that can be processed by the LLM. To aggregate
errors, potentially due to improper error handling in the application correlated chunks together (e.g., chunk 𝑣𝑖 that explains the
(e.g., leaking sensitive data or system details) or issues within data- meaning of “bloat-table” and chunk 𝑣 𝑗 that utilizes “bloat-
base (e.g., improper data types). (4) Hanging. The database becomes table” in root cause analysis), we generate a summary for
unresponsive, which is usually caused by long-running queries, each chunk based on both its content and its subsections.
deadlocks, or resource contention. (5) Crashing. The database un- Finally, we utilize LLM to extract maintenance experience
expectedly shuts down, causing data inaccessible. For a mature from chunks with similar summaries (Section 4).
database product, each anomaly type is explained in the documenta- • Prompt Template Generation. To help LLM better under-
tion and suitable to be learned by LLMs. stand the DM tasks, we iteratively generate and score dif-
Observation Tools for Anomaly Detection. “Observability of ferent formats of task descriptions using DM samples (i.e.,
the database” is vital to detect above anomalies, including logs, met- given the anomaly and solutions, ask LLM to describe the
rics, and traces. (1) Logs are records of database events. For example, task), and adopt task description that both scores high perfor-
PostgresSQL supports slow query logs (with error messages that mance and is sensible to human DBAs (in cases of learning
can help debug and solve execution issues), but these logs may bias) for LLM diagnosis (Section 5).
2
Figure 2: Overview of D-Bot

Second, in maintenance stage, given an anomaly, D-Bot iter- cause can affect the database performance (e.g., the performance
atively reasons the possible root causes by taking advantages of hazards of many dead tuples); “metrics” provide hints of matching
external tools and multi-LLM communications. with this experience segment, i.e., LLM will utilize this experience
• External Tool Learning. For a given anomaly, D-Bot first if the abnormal metrics exist in the “metrics” field; “steps” provide
matches relevant tools using algorithms like Dense Retrieval. the detailed procedure of checking whether the root cause exists by
Next, D-Bot provides the tool APIs together with their de- interacting with database (e.g., obtaining the ratio of dead tuples
scriptions to the LLM (e.g., function calls in GPT-4). After and live tuples from table statistics views).
that, LLM can utilize these APIs to obtain metric values or op-
1 "name": "many_dead_tuples",
timization solutions. For example, in PostgresSQL, LLM can
2 "content": "If the accessed table has too many dead tuples,
acquire the templates of slowest queries in the pg_activity
it can cause bloat-table and degrade performance",
view. If these queries consume much CPU resource (e.g., over
3 "metrics": ["live_tuples", "dead_tuples", "table_size", "
80%), they could be root causes and optimized with rewriting
dead_rate"],
tool (Section 6).
4 "steps": "For each accessed table, if the total number of
• LLM Diagnosis. Although LLM can understand the func-
live tuples and dead tuples is within an acceptable
tions of tool APIs, it still may generate incorrect API requests,
limit (1000), and table size is not too big (50MB), it
leading to diagnosis failures. To solve this problem, we em-
is not a root cause. Otherwise, if the dead rate also
ploy the tree of thought strategy, where LLM can go back
exceeds the threshold (0.02), it is considered a root
to previous steps if the current step fails. It significantly
cause. And we suggest to clean up dead tuples in time."
increases the likelihood of LLMs arriving at reasonable di-
agnosis results (Section 7). LLM for Experience Detection. It aims to detect experience seg-
• Collaborative Diagnosis. A single LLM may execute only ments that follow above format. Since different paragraphs within a
the initial diagnosis steps and end up early, leaving the prob- long document may be correlated with each other (e.g., the concept
lem inadequately resolved. To address this limitation, we of “bloat-table” appearing in “many_dead_tuples” is introduced in
propose the use of multiple LLMs working collaboratively. another section), we explain how to extract experience segments
Each LLM plays a specific role and communicates by the en- without losing the technical details.
vironment settings (e.g., priorities, speaking orders). In this Step1: Segmentation. Instead of partitioning documents into fixed-
way, we can enable LLMs to engage in debates and inspire length segments, we divide them based on the structure of the sec-
more robust solutions (Section 8). tion structures and their content. Initially, the document is divided
into chunks using the section separators. If a chunk exceeds the
4 EXPERIENCE DETECTION FROM maximum chunk size (e.g., 1k tokens), we further divide it recur-
sively into smaller chunks.
DOCUMENTS Step2: Chunk Summary. Next, for each chunk denoted as 𝑥, a
Document learning aims to extract experience segments from tex- summary 𝑥 .𝑠𝑢𝑚𝑚𝑎𝑟𝑦 is created by feeding the content of 𝑥 into
tual sources, where the extracted segments are potentially useful in LLM with a summarization prompt 𝑝𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒 :
different DM cases. For instance, when analyzing the root causes
of performance degradation, LLM utilizes the “many_dead_tuples” 𝑝𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒 = Summarize the provided chunk briefly · · · Your
experience to decide whether dead tuples have negatively affected summary will serve as an index for others to find technical
the efficiency of index lookup and scans. details related to database maintenance · · · Pay attention to
Desired Experience Format. To ensure LLM can efficiently uti- examples even if the chunks covers other topics.
lize the experience, each experience fragment should include four
fields. As shown in the following example, “name” helps LLM to The generated 𝑥 .𝑠𝑢𝑚𝑚𝑎𝑟𝑦 acts as a textual index of 𝑥, enabling
understand the overall function; “content” explains how the root the matching of chunks containing similar content.
3
Step3: Experience Extraction. Once the summaries of the chunks (e.g., the ratio of detected root causes), and reserve the best prompts
are generated, LLM parses the content of each chunk and compares (e.g., top-10) as candidates. Finally, we select the best one to serve
it with the summaries of other chunks having similar content, which as the input prompt template for the incoming maintenance tasks.
is guided by the extraction prompt 𝑝𝑒𝑥𝑡𝑟𝑎𝑐𝑡 . This way, experience
segments that correlate with the key points from the summaries
are detected. 6 EXTERNAL TOOL LEARNING
As we know, the efficient use of tools is a hallmark of human
𝑝𝑒𝑥𝑡𝑟𝑎𝑐𝑡 = Given a chunk summary, extract diagnosis experi- cognitive capabilities [17, 18]. When human beings encounter a
ence from the chunk. If uncertain, explore diagnosis experience new tool, they start to understand the tool and explore how it works,
in chunks with similar summaries. i.e., taking it as something with particular functions and trying to
understand what the functions are used for. Likewise, we aim to
In our implementation, given a document, we use LLM to extract inspire similar ability within LLM.
experience segments into the above 4-field format. Tool Retrieval. We first retrieve the appropriate tools for the
Detected Maintenance Experience. In Figure 3, we showcase diagnosis task at hand, represented as 𝐷𝑡 . There are several methods
the simplified diagnosis procedure together with some necessary used, such as BM25, LLM Embeddings, and Dense Retrieval.
details, coming from chunks originally in different sections of the (1) BM25, simply represented as 𝑓 (𝐷𝑡 , 𝑄) = BM25, is a common
given documents (e.g., the maintenance guide with over 100 pages). probabilistic retrieval method that ranks tool descriptions (𝐷) based
1. Background Understanding. It’s crucial to grasp the con- on their relevance to the given anomaly (𝑄) [19].
text of system performance, such as recent changes in customer (2) LLM Embeddings, denoted as 𝑓 (𝐷𝑡 , 𝐿) = 𝐿𝐿𝑀 𝐸 , are a method
expectation, workload type, or even system settings. that converts tool descriptions (𝐷𝑡 ) into embeddings (𝐸𝑡 ) using LLM,
2. Database Pressure Checking. This step identifies data- i.e., 𝐸𝑡 = 𝐿(𝐷𝑡 ). These embeddings capture the semantic meanings
base bottlenecks, such as tracking CPU usage and active ses- in a multi-dimensional space, hence helping in finding related tools
sions; and monitoring system views (e.g., pg_stat_activity and even in the absence of keyword overlap, 𝐷𝑡 = 𝐿𝐿𝑀𝐸 (𝐸𝑡 ).
pgxc_stat_activity) to focus on non-idle sessions. (3) Dense Retrieval, denoted as 𝑓 (𝑄, 𝐷𝑡 , 𝑁 ) = 𝐷𝑅 , uses neural net-
3. Application Pressure Checking. If there is no apparent works (𝑁 ) to generate dense representations of both the anomaly
pressure on the database or the resource consumption is very low (𝑄) and the tool descriptions (𝐷𝑡 ), separately denoted as Dense𝑄
(e.g., CPU usage below 10% and only a few active sessions), it is and Dense𝐷 . To retrieve the relevant tools, we calculate the simi-
suggested to investigate the application side, such as exhausted ap- larity between Dense𝑄 and Dense𝐷 , and rank them based on these
plication server resources, high network latency, or slow processing similarity scores.
of queries by application servers. The proper method for tool retrieval depends on the specific
4. System Pressure Checking. The focus shifts to examining scenarios. BM25 is efficient for quick results with large volumes of
the system resources where the database is located, including CPU API descriptions in the tools and clear anomaly characters. LLM
usage, IO status, and memory consumption. Embeddings excel at capturing semantic and syntactic relationships,
5. Database Usage Checking. Lastly, we can investigate sub- which is especially useful when relevance isn’t obvious from key-
optimal database usage behaviors, such as (1) addressing concur- words (e.g., different metrics with similar functions). Dense Retrieval
rency issues caused by locking waits, (2) examining database con- is ideal for vague anomaly, which captures context and semantic
figurations, (3) identifying abnormal wait events (e.g., io_event), (4) meaning, but is more computational costly.
tackling long/short-term performance declines, and (5) optimizing
poorly performing queries that may be causing bottlenecks.
7 LLM DIAGNOSIS
5 DIAGNOSIS PROMPT GENERATION Tree Search Algorithm using LLM. To avoid diagnosis failures
Instead of directly mapping extracted experience to new cases, caused by the incorrect actions (e.g., non-existent API name) derived
next we explore how to teach LLMs to (1) understand the database by LLM, we propose to utilize the tree of thought strategy that can
maintenance tasks and (2) reason over the root causes by itself. guide LLM to go back to previous actions if the current action fails.
Step1: Tree Structure Initialization. We initialize a tree structure,
Input Enrichment. With a database anomaly 𝑥 as input, we can
where root node is the diagnosis request (Figure 4). Utility methods
enrich 𝑥 with additional description information so called input
are utilized to manipulate the tree structure, and UCT score for node
prompt 𝑥 ′ . On one hand, 𝑥 ′ helps LLM to better understand the task
𝑣 are computed based√︃on the modifications during planning, i.e.,
intent. On the other hand, since database diagnosis is generally a
𝑤 (𝑣) ln(𝑁 ) ln(𝑁 )
complex task that involves multiple steps, 𝑥 ′ preliminarily implies UCT(𝑣) = 𝑛 (𝑣) + 𝐶 · 𝑛 (𝑣) , where 𝑛 (𝑣) denotes the selection
how to divide the complex task into sub-tasks in a proper order, frequency and 𝑤 (𝑣) denotes the success ratio of detecting root
such that further enhancing the reasoning of LLM. causes. Note, the action of 𝑛(𝑣 fails to call tool API, 𝑤 (𝑣) equals -1.
From our observation, the quality of 𝑥 ′ can greatly impact the Step2: Simulate Execution. This step kickoffs the execution of
performance of LLM on maintenance tasks [27] (Figure 2). Thus, simulations starting from the root node of the tree. It involves
we first utilize LLM to suggest candidate prompts based on a small selecting nodes based on specific standard (e.g., detected abnormal
set of input-output pairs (e.g., 5 pairs for a prompt). Second, we rank metrics). If the criteria for selecting a new node is met, a new node is
these generated prompts based on a customized scoring function chosen; otherwise, the node with the highest UCT value is selected.
4
Figure 3: The outline of diagnosis experience extracted from documents.

is dedicated to a distinct domain of functions. For example,


we include three LLM agents in the initial implementation:
(1) Chief DBA is responsible for collaboratively diagnosing
and detecting root causes with other agents; (2) CPU Agent
is specialized in CPU usage analysis and diagnosis, and (3)
Memory Agent focuses on memory usage analysis and diag-
nosis. Each LLM agent can automatically invoke tool APIs to
retrieve database statistics, extract external knowledge, and
conduction optimizations. For instance, CPU Agent utilizes
the monitoring tool Prometheus to check CPU usage metrics
within specific time periods, and determine the root causes
of high CPU usage by matching with extracted experience
Figure 4: Example LLM diagnosis by tree of thought (Section 4). Note, if CPU/memory agents cannot report useful
analysis, Chief DBA is responsible to detect other potential
Step3: Existing Node Reflection. For each node in the path from the problems, such as those on the application side.
root node to the selected node, reflections are generated based on
• Environment Settings. We need to set a series of principles
decisions made at previous nodes. For example, we count on LLM
for the agents to efficiently communicate, such as (1) Chat
to rethink the benefits of analyzing non-resource relevant metrics.
Order: To avoid the mutual negative influence, we only al-
If LLM decides the action cannot find any useful information, the
low one LLM agent to “speak” (i.e., appending the analysis
UCT value will be reduced and set to that of its parent node. In this
results to the chat records to let other agents know) at a
way, we can enhance the diagnosis efficiency.
time. To ensure flexible chat (e.g., if an agent cannot detect
Step4: Terminal Condition. If LLM cannot find any more root
anything useful, it should not speak), we rely on Chief DBA
cause (corresponding to a leaf node) for a threshold time (e.g., five),
to decide which agent to speak in each iteration (diagnosis
the algorithm ends and LLM outputs the final analysis based on the
scheduling); (2) Visibility: By default, we assume the analysis
detected root causes.
results of agents can be seen by each other, i.e., within the
same chat records. In the future, we can split agents into
8 COLLABORATIVE DIAGNOSIS FOR different groups, where each group is in charge of different
COMPLEX CASES database clusters/instances and they do not share the chat
A single LLM may be limited in its ability to fully resolve a problem records; (3) Selector is vital to filter invalid analysis that may
(e.g., stuck in initial steps). Collaborative diagnosis involves the mislead the diagnosis directions; (4) Updater works to update
utilization of multiple LLMs to collectively address complex cases agent memory based on the historical records.
by leveraging their unique role capabilities. This section introduces
• Chat Summary . For a complex database problem, it re-
the communicative framework for database diagnosis [1, 16].
quires agents dozens of iterations to give in-depth analy-
• Agents. In the communicative framework, agents can be
sis, leading to extremely long chat records. Thus, it is vi-
undertaken by human beings or LLMs. Humans can pro-
tal to effectively summarize the critical information from
vide LLM agents with scenario requirements (e.g., business
chat records without exceeding the maximal length of LLM
changes over the incoming period) and prior knowledge (e.g.,
prompts. To the end, we progressively summarize the lines
historical anomalies). On the other hand, each LLM agent
5
Table 1: Diagnosis performance of single root causes ( : legal diagnosis results; : accurate diagnosis results).

Type Root Cause Description LLM+Metrics D-Bot


Data Insert INSERT_LARGE_DATA Long execution time for large data insertions
FETCH_LARGE_DATA Fetching of large data volumes
REDUNDANT_INDEX Unnecessary and redundant indexes in tables
Slow LACK_STATISTIC_INFO Outdated statistical info affecting execution plan
Query MISSING_INDEXES Missing indexes causing performance issues
POOR_JOIN_PERFORMANCE Poor performance of Join operators
CORRELATED_SUBQUERY Non-promotable subqueries in SQL
LOCK_CONTENTION Lock contention issues
Concurrent WORKLOAD_CONTENTION Workload concentration affecting SQL execution
Transaction CPU_CONTENTION Severe external CPU resource contention
IO_CONTENTION IO resource contention affecting SQL performance

of a record used with tools, including inputs for certain tools


and the results returned by these tools. Based on the current
summary, it extracts the goals intended to be solved with
each call to the tool, and forms a new summary, e.g.,

[Current summary]
- I know the start and end time of the anomaly.
[New Record]
Thought: Now that I have the start and end time of the
anomaly, I need to diagnose the causes of the anomaly
Action: is_abnormal_metric
Action Input: {“start_time”: 1684600070, “end_time”:
1684600074, “metric_name”: “cpu_usage”}
Observation: “The metric is abnormal”
[New summary]
- I know the start and end time of the anomaly.
- I searched for is_abnormal_metric, and I now know that the
CPU usage is abnormal.

With this communicative framework and well-defined communi-


cation principles, the collaborative diagnosis process among human Figure 5: A basic demonstration of D-Bot.
and LLM agents becomes more efficient (e.g., parallel diagnosis) and
effective (e.g., chat records could trigger investigating of in-depth
metric observation and root cause analysis).
Diagnosis Performance Comparison. We compare the perfor-
mance of D-Bot against a baseline, namely llm+Metrics. Both of
9 PRELIMINARY EXPERIMENT RESULTS the two methods are deployed with the OpenAI model GPT-4 [2]
alongside metrics and views from PostgreSQL and Prometheus.
Demonstration. As illustrated in Figure 5, Chief DBA monitors The evaluation focuses on basic single-cause problems as detailed
the status of the database to detect anomalies. Upon recognizing a in Table 1. Besides, we also offer a multi-cause diagnosis example
new anomaly, Chief DBA notifies both the Memory Agent and CPU presented in the Appendix-B.
Agent. These agents independently assess the potential root causes Preliminary results indicate that LLM +Metrics and D-Bot can
and communicate their findings (the root causes and recommended achieve a high legality rate (producing valid responses to specific
solutions) to the Chief DBA. Subsequently, the Chief DBA consol- database issues). However, it is a “dangerous behavior” for LLM
idates the diagnostic results for the user’s convenience. In initial +Metrics, which actually has very low success rate (infrequent pro-
iterations, these agents generally gather limited information, and vision of the correct causes). In contrast, D-Bot achieves both high
so they will continue for multiple iterations until the conclusion legal rate and success rate. The reasons are three-fold.
of Chief DBA is nearly certain or no further valuable information First, LLM +Metrics conducts very basic reasoning and often
can be obtained. Additionally, during the diagnosis, users have the misses key causes. For example, for the INSERT_LARGE_DATA
option to participate by offering instructions and feedback, such as case, LLM +Metrics only finds “high number of running processes”
verifying the effectiveness of a proposed optimization solution. with the node_procs_running metric, and stops early. In contrast,
6
D-Bot not only finds the high concurrency problem, but analyze [8] Hai Lan, Zhifeng Bao, and Yuwei Peng. 2020. An Index Advisor Using Deep
the operation statistics in the database process and identifies “high Reinforcement Learning. In CIKM. 2105–2108.
[9] Gabriel Paludo Licks, Júlia Mara Colleoni Couto, Priscilla de Fátima Miehe, Re-
memory usage due to heavy use of UPDATE and INSERT operations nata De Paris, Duncan Dubugras A. Ruiz, and Felipe Meneguzzi. 2020. SmartIX:
on xxx tables” by looking up the pg_stat_statements view. A database indexing agent based on reinforcement learning. Appl. Intell. 50, 8
(2020), 2575–2588.
Second, LLM +Metrics often “makes up” reasons without [10] Ping Liu, Shenglin Zhang, Yongqian Sun, Yuan Meng, Jiahai Yang, and Dan
substantial knowledge evidence. For example, for the CORRE- Pei. 2020. FluxInfer: Automatic Diagnosis of Performance Anomaly for Online
LATED_SUBQUERY case, LLM +Metrics observes SORT operations Database System. In 39th IEEE International Performance Computing and Commu-
nications Conference, IPCCC 2020, Austin, TX, USA, November 6-8, 2020. IEEE, 1–8.
in logged queries, and incorrectly attributes the cause to “frequent https://doi.org/10.1109/IPCCC50635.2020.9391550
reading and sorting of large amount of data”, thereby ending the [11] Xiaoze Liu, Zheng Yin, Chao Zhao, Congcong Ge, Lu Chen, Yunjun Gao, Dimeng
diagnostic process. Instead, D-Bot cross-references with the query Li, Ziting Wang, Gaozhong Liang, Jian Tan, and Feifei Li. 2022. PinSQL: Pinpoint
Root Cause SQLs to Resolve Performance Issues in Cloud Databases. In 38th IEEE
optimization knowledge, and then finds the correlated-subquery International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia,
structure might be the performance bottleneck, with additional May 9-12, 2022. IEEE, 2549–2561. https://doi.org/10.1109/ICDE53745.2022.00236
[12] Xianglin Lu, Zhe Xie, Zeyan Li, Mingjie Li, Xiaohui Nie, Nengwen Zhao, Qingyang
extracted information like estimated operation costs. Yu, Shenglin Zhang, Kaixin Sui, Lin Zhu, and Dan Pei. 2022. Generic and Robust
Third, LLM +Metrics meet trouble in deriving appropriate solu- Performance Diagnosis via Causal Inference for OLTP Database Systems. In 22nd
tions. LLM +Metrics often gives very generic optimization solutions IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid
2022, Taormina, Italy, May 16-19, 2022. IEEE, 655–664. https://doi.org/10.1109/
(e.g., “resolve resource contention issues”), which are useless in CCGrid54584.2022.00075
practice. Instead, leveraging its tool retrieval component, D-Bot [13] Minghua Ma, Zheng Yin, Shenglin Zhang, and et al. 2020. Diagnosing Root
can learn to give specific optimization advice (e.g., invoking query Causes of Intermittent Slow Queries in Large-Scale Cloud Databases. Proc. VLDB
Endow. 13, 8 (2020), 1176–1189. https://doi.org/10.14778/3389133.3389136
transformation rules, adjusting the work_mem parameter) or gather [14] Yuxi Ma, Chi Zhang, and Song-Chun Zhu. 2023. Brain in a Vat: On Miss-
more insightful information (e.g., “calculate the total cost of the ing Pieces Towards Artificial General Intelligence in Large Language Mod-
els. CoRR abs/2307.03762 (2023). https://doi.org/10.48550/arXiv.2307.03762
plan and check whether the cost rate of the sort or hash operators arXiv:2307.03762
exceeds the cost rate threshold”). [15] R. Malinga Perera, Bastian Oetomo, Benjamin I. P. Rubinstein, and Renata
This evaluation reveals the potential of D-Bot in going beyond Borovica-Gajic. 2021. DBA bandits: Self-driving index tuning under ad-hoc,
analytical workloads with safety guarantees. In ICDE. 600–611.
mere anomaly detection to root cause analysis and provision of [16] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, and et al. 2023. Com-
actionable suggestions. Despite these advancements, from the basic municative Agents for Software Development. arXiv preprint arXiv:2307.07924
deployment of D-Bot, there are still some unresolved challenges. (2023).
[17] Yujia Qin, Shengding Hu, Yankai Lin, and et al. 2023. Tool learning with founda-
First, it is tricky to share the maintenance experience (e.g., varying tion models. arXiv preprint arXiv:2304.08354 (2023).
metric and view names) across different database products. Second, [18] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin,
Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie
it is labor-intensive to adequately prepare extensive number of Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. Tool-
anomaly-diagnosis data, which is essential to fine-tune and direct LLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.
less-capable LLMs (e.g., those smaller than 10B) to understand the arXiv:cs.AI/2307.16789
[19] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance
complex database knowledge and apply in maintenance. framework: BM25 and beyond. Foundations and Trends® in Information Retrieval
3, 4 (2009), 333–389.
10 CONCLUSION [20] James Turnbull. 2018. Monitoring with Prometheus. Turnbull Press.
[21] Gary Valentin, Michael Zuliani, Daniel C. Zilio, Guy M. Lohman, and Alan Skelley.
In this paper, we propose a vision of D-Bot, an LLM-based data- 2000. DB2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes.
In ICDE. 101–110.
base administrator that can continuously acquire database main- [22] Kyu-Young Whang. 1987. Index Selection in Relational Databases. Foundations
tenance experience from textual sources, and provide reasonable, of Data Organization (1987), 487–500.
well-founded, in-time diagnosis and optimization advice for target [23] Wentao Wu, Chi Wang, Tarique Siddiqui, Junxiong Wang, Vivek R. Narasayya,
Surajit Chaudhuri, and Philip A. Bernstein. 2022. Budget-aware Index Tuning
databases. We will continue to complete and improve this work with with Reinforcement Learning. In SIGMOD Conference. 1528–1541.
our collaborators. [24] Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. DBSherlock: A Perfor-
mance Diagnostic Tool for Transactional Databases. In Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San
REFERENCES Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and
[1] [n.d.]. https://github.com/OpenBMB/AgentVerse. Last accessed on 2023-8. Sam Madden (Eds.). ACM, 1599–1614. https://doi.org/10.1145/2882903.2915218
[2] [n.d.]. https://openai.com/. Last accessed on 2023-8. [25] Xuanhe Zhou, Chengliang Chai, Guoliang Li, and Ji Sun. 2020. Database meets
[3] Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven Index artificial intelligence: A survey. IEEE Transactions on Knowledge and Data Engi-
Selection Tool for Microsoft SQL Server. In VLDB. 146–155. neering 34, 3 (2020), 1096–1116.
[4] Karl Dias, Mark Ramacher, Uri Shaft, Venkateshwaran Venkataramani, and Gra- [26] Xuanhe Zhou, Luyang Liu, Wenbo Li, Lianyuan Jin, Shifu Li, Tianqing Wang,
ham Wood. 2005. Automatic Performance Diagnosis and Tuning in Oracle. In and Jianhua Feng. 2022. AutoIndex: An Incremental Index Management System
Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilo- for Dynamic Workloads. In ICDE. 2196–2208.
mar, CA, USA, January 4-7, 2005, Online Proceedings. www.cidrdb.org, 84–94. [27] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis,
http://cidrdb.org/cidr2005/papers/P07.pdf Harris Chan, and Jimmy Ba. 2022. Large Language Models Are Human-Level
[5] Shiyue Huang, Ziwei Wang, Xinyi Zhang, Yaofeng Tu, Zhongliang Li, and Bin Cui. Prompt Engineers. (2022). arXiv:2211.01910 http://arxiv.org/abs/2211.01910
2023. DBPA: A Benchmark for Transactional Database Performance Anomalies.
Proc. ACM Manag. Data 1, 1 (2023), 72:1–72:26. https://doi.org/10.1145/3588926
[6] Prajakta Kalmegh, Shivnath Babu, and Sudeepa Roy. 2019. iQCAR: inter-Query
Contention Analyzer for Data Analytics Frameworks. In Proceedings of the 2019
International Conference on Management of Data, SIGMOD Conference 2019, Ams-
terdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold,
Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 918–935.
https://doi.org/10.1145/3299869.3319904
[7] Jan Kossmann, Alexander Kastius, and Rainer Schlosser. 2022. SWIRL: Selection
of Workload-aware Indexes using Reinforcement Learning. In EDBT. 2:155–2:168.
7
A APPENDIX - PROMPTS

8
9
10
B APPENDIX - TEST CASES

11
12

You might also like