LLM As Dba: Xuanhe Zhou Guoliang Li Zhiyuan Liu
LLM As Dba: Xuanhe Zhou Guoliang Li Zhiyuan Liu
ABSTRACT
Thought: High memory usage seems to be caused by
Database administrators (DBAs) play a crucial role in managing, poor join performance and much inactive memory
maintaining and optimizing a database system to ensure data avail- Reasoning: Poor joins can be solved by plan optimization
ability, performance, and reliability. However, it is hard and tedious Action: optimize_query_plan
arXiv:2308.05481v1 [cs.DB] 10 Aug 2023
Second, in maintenance stage, given an anomaly, D-Bot iter- cause can affect the database performance (e.g., the performance
atively reasons the possible root causes by taking advantages of hazards of many dead tuples); “metrics” provide hints of matching
external tools and multi-LLM communications. with this experience segment, i.e., LLM will utilize this experience
• External Tool Learning. For a given anomaly, D-Bot first if the abnormal metrics exist in the “metrics” field; “steps” provide
matches relevant tools using algorithms like Dense Retrieval. the detailed procedure of checking whether the root cause exists by
Next, D-Bot provides the tool APIs together with their de- interacting with database (e.g., obtaining the ratio of dead tuples
scriptions to the LLM (e.g., function calls in GPT-4). After and live tuples from table statistics views).
that, LLM can utilize these APIs to obtain metric values or op-
1 "name": "many_dead_tuples",
timization solutions. For example, in PostgresSQL, LLM can
2 "content": "If the accessed table has too many dead tuples,
acquire the templates of slowest queries in the pg_activity
it can cause bloat-table and degrade performance",
view. If these queries consume much CPU resource (e.g., over
3 "metrics": ["live_tuples", "dead_tuples", "table_size", "
80%), they could be root causes and optimized with rewriting
dead_rate"],
tool (Section 6).
4 "steps": "For each accessed table, if the total number of
• LLM Diagnosis. Although LLM can understand the func-
live tuples and dead tuples is within an acceptable
tions of tool APIs, it still may generate incorrect API requests,
limit (1000), and table size is not too big (50MB), it
leading to diagnosis failures. To solve this problem, we em-
is not a root cause. Otherwise, if the dead rate also
ploy the tree of thought strategy, where LLM can go back
exceeds the threshold (0.02), it is considered a root
to previous steps if the current step fails. It significantly
cause. And we suggest to clean up dead tuples in time."
increases the likelihood of LLMs arriving at reasonable di-
agnosis results (Section 7). LLM for Experience Detection. It aims to detect experience seg-
• Collaborative Diagnosis. A single LLM may execute only ments that follow above format. Since different paragraphs within a
the initial diagnosis steps and end up early, leaving the prob- long document may be correlated with each other (e.g., the concept
lem inadequately resolved. To address this limitation, we of “bloat-table” appearing in “many_dead_tuples” is introduced in
propose the use of multiple LLMs working collaboratively. another section), we explain how to extract experience segments
Each LLM plays a specific role and communicates by the en- without losing the technical details.
vironment settings (e.g., priorities, speaking orders). In this Step1: Segmentation. Instead of partitioning documents into fixed-
way, we can enable LLMs to engage in debates and inspire length segments, we divide them based on the structure of the sec-
more robust solutions (Section 8). tion structures and their content. Initially, the document is divided
into chunks using the section separators. If a chunk exceeds the
4 EXPERIENCE DETECTION FROM maximum chunk size (e.g., 1k tokens), we further divide it recur-
sively into smaller chunks.
DOCUMENTS Step2: Chunk Summary. Next, for each chunk denoted as 𝑥, a
Document learning aims to extract experience segments from tex- summary 𝑥 .𝑠𝑢𝑚𝑚𝑎𝑟𝑦 is created by feeding the content of 𝑥 into
tual sources, where the extracted segments are potentially useful in LLM with a summarization prompt 𝑝𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒 :
different DM cases. For instance, when analyzing the root causes
of performance degradation, LLM utilizes the “many_dead_tuples” 𝑝𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒 = Summarize the provided chunk briefly · · · Your
experience to decide whether dead tuples have negatively affected summary will serve as an index for others to find technical
the efficiency of index lookup and scans. details related to database maintenance · · · Pay attention to
Desired Experience Format. To ensure LLM can efficiently uti- examples even if the chunks covers other topics.
lize the experience, each experience fragment should include four
fields. As shown in the following example, “name” helps LLM to The generated 𝑥 .𝑠𝑢𝑚𝑚𝑎𝑟𝑦 acts as a textual index of 𝑥, enabling
understand the overall function; “content” explains how the root the matching of chunks containing similar content.
3
Step3: Experience Extraction. Once the summaries of the chunks (e.g., the ratio of detected root causes), and reserve the best prompts
are generated, LLM parses the content of each chunk and compares (e.g., top-10) as candidates. Finally, we select the best one to serve
it with the summaries of other chunks having similar content, which as the input prompt template for the incoming maintenance tasks.
is guided by the extraction prompt 𝑝𝑒𝑥𝑡𝑟𝑎𝑐𝑡 . This way, experience
segments that correlate with the key points from the summaries
are detected. 6 EXTERNAL TOOL LEARNING
As we know, the efficient use of tools is a hallmark of human
𝑝𝑒𝑥𝑡𝑟𝑎𝑐𝑡 = Given a chunk summary, extract diagnosis experi- cognitive capabilities [17, 18]. When human beings encounter a
ence from the chunk. If uncertain, explore diagnosis experience new tool, they start to understand the tool and explore how it works,
in chunks with similar summaries. i.e., taking it as something with particular functions and trying to
understand what the functions are used for. Likewise, we aim to
In our implementation, given a document, we use LLM to extract inspire similar ability within LLM.
experience segments into the above 4-field format. Tool Retrieval. We first retrieve the appropriate tools for the
Detected Maintenance Experience. In Figure 3, we showcase diagnosis task at hand, represented as 𝐷𝑡 . There are several methods
the simplified diagnosis procedure together with some necessary used, such as BM25, LLM Embeddings, and Dense Retrieval.
details, coming from chunks originally in different sections of the (1) BM25, simply represented as 𝑓 (𝐷𝑡 , 𝑄) = BM25, is a common
given documents (e.g., the maintenance guide with over 100 pages). probabilistic retrieval method that ranks tool descriptions (𝐷) based
1. Background Understanding. It’s crucial to grasp the con- on their relevance to the given anomaly (𝑄) [19].
text of system performance, such as recent changes in customer (2) LLM Embeddings, denoted as 𝑓 (𝐷𝑡 , 𝐿) = 𝐿𝐿𝑀 𝐸 , are a method
expectation, workload type, or even system settings. that converts tool descriptions (𝐷𝑡 ) into embeddings (𝐸𝑡 ) using LLM,
2. Database Pressure Checking. This step identifies data- i.e., 𝐸𝑡 = 𝐿(𝐷𝑡 ). These embeddings capture the semantic meanings
base bottlenecks, such as tracking CPU usage and active ses- in a multi-dimensional space, hence helping in finding related tools
sions; and monitoring system views (e.g., pg_stat_activity and even in the absence of keyword overlap, 𝐷𝑡 = 𝐿𝐿𝑀𝐸 (𝐸𝑡 ).
pgxc_stat_activity) to focus on non-idle sessions. (3) Dense Retrieval, denoted as 𝑓 (𝑄, 𝐷𝑡 , 𝑁 ) = 𝐷𝑅 , uses neural net-
3. Application Pressure Checking. If there is no apparent works (𝑁 ) to generate dense representations of both the anomaly
pressure on the database or the resource consumption is very low (𝑄) and the tool descriptions (𝐷𝑡 ), separately denoted as Dense𝑄
(e.g., CPU usage below 10% and only a few active sessions), it is and Dense𝐷 . To retrieve the relevant tools, we calculate the simi-
suggested to investigate the application side, such as exhausted ap- larity between Dense𝑄 and Dense𝐷 , and rank them based on these
plication server resources, high network latency, or slow processing similarity scores.
of queries by application servers. The proper method for tool retrieval depends on the specific
4. System Pressure Checking. The focus shifts to examining scenarios. BM25 is efficient for quick results with large volumes of
the system resources where the database is located, including CPU API descriptions in the tools and clear anomaly characters. LLM
usage, IO status, and memory consumption. Embeddings excel at capturing semantic and syntactic relationships,
5. Database Usage Checking. Lastly, we can investigate sub- which is especially useful when relevance isn’t obvious from key-
optimal database usage behaviors, such as (1) addressing concur- words (e.g., different metrics with similar functions). Dense Retrieval
rency issues caused by locking waits, (2) examining database con- is ideal for vague anomaly, which captures context and semantic
figurations, (3) identifying abnormal wait events (e.g., io_event), (4) meaning, but is more computational costly.
tackling long/short-term performance declines, and (5) optimizing
poorly performing queries that may be causing bottlenecks.
7 LLM DIAGNOSIS
5 DIAGNOSIS PROMPT GENERATION Tree Search Algorithm using LLM. To avoid diagnosis failures
Instead of directly mapping extracted experience to new cases, caused by the incorrect actions (e.g., non-existent API name) derived
next we explore how to teach LLMs to (1) understand the database by LLM, we propose to utilize the tree of thought strategy that can
maintenance tasks and (2) reason over the root causes by itself. guide LLM to go back to previous actions if the current action fails.
Step1: Tree Structure Initialization. We initialize a tree structure,
Input Enrichment. With a database anomaly 𝑥 as input, we can
where root node is the diagnosis request (Figure 4). Utility methods
enrich 𝑥 with additional description information so called input
are utilized to manipulate the tree structure, and UCT score for node
prompt 𝑥 ′ . On one hand, 𝑥 ′ helps LLM to better understand the task
𝑣 are computed based√︃on the modifications during planning, i.e.,
intent. On the other hand, since database diagnosis is generally a
𝑤 (𝑣) ln(𝑁 ) ln(𝑁 )
complex task that involves multiple steps, 𝑥 ′ preliminarily implies UCT(𝑣) = 𝑛 (𝑣) + 𝐶 · 𝑛 (𝑣) , where 𝑛 (𝑣) denotes the selection
how to divide the complex task into sub-tasks in a proper order, frequency and 𝑤 (𝑣) denotes the success ratio of detecting root
such that further enhancing the reasoning of LLM. causes. Note, the action of 𝑛(𝑣 fails to call tool API, 𝑤 (𝑣) equals -1.
From our observation, the quality of 𝑥 ′ can greatly impact the Step2: Simulate Execution. This step kickoffs the execution of
performance of LLM on maintenance tasks [27] (Figure 2). Thus, simulations starting from the root node of the tree. It involves
we first utilize LLM to suggest candidate prompts based on a small selecting nodes based on specific standard (e.g., detected abnormal
set of input-output pairs (e.g., 5 pairs for a prompt). Second, we rank metrics). If the criteria for selecting a new node is met, a new node is
these generated prompts based on a customized scoring function chosen; otherwise, the node with the highest UCT value is selected.
4
Figure 3: The outline of diagnosis experience extracted from documents.
[Current summary]
- I know the start and end time of the anomaly.
[New Record]
Thought: Now that I have the start and end time of the
anomaly, I need to diagnose the causes of the anomaly
Action: is_abnormal_metric
Action Input: {“start_time”: 1684600070, “end_time”:
1684600074, “metric_name”: “cpu_usage”}
Observation: “The metric is abnormal”
[New summary]
- I know the start and end time of the anomaly.
- I searched for is_abnormal_metric, and I now know that the
CPU usage is abnormal.
8
9
10
B APPENDIX - TEST CASES
11
12