HLSPilot - LLM-based High-Level Synthesis
HLSPilot - LLM-based High-Level Synthesis
Abstract—Large language models (LLMs) have catalyzed an implementations. HLS tools automate the design tasks such
upsurge in automatic code generation, garnering significant at- as concurrent analysis of algorithms, interface design, logic
tention for register transfer level (RTL) code generation. Despite unit mapping, and data management, thereby substantially
the potential of RTL code generation with natural language,
arXiv:2408.06810v1 [cs.AR] 13 Aug 2024
it remains error-prone and limited to relatively small modules shortening the hardware design cycle.
because of the substantial semantic gap between natural language While HLS offers numerous advantages such as higher
expressions and hardware design intent. In response to the development efficiency and lower design barriers [1] [2], there
limitations, we propose a methodology that reduces the semantic are still some issues in the real-world HLS-based hardware
gaps by utilizing C/C++ for generating hardware designs via acceleration workflow [3]. Firstly, the overall analysis of the
High-Level Synthesis (HLS) tools. Basically, we build a set
of C-to-HLS optimization strategies catering to various code program is of great importance, determining the performance
patterns, such as nested loops and local arrays. Then, we apply bottlenecks of the program and the co-design between CPU
these strategies to sequential C/C++ code through in-context and FPGA remains a challenging issue. Besides, designs based
learning, which provides the LLMs with exemplary C/C++ to on HLS still encounter a few major performance issues [4] [5].
HLS prompts. With this approach, HLS designs can be generated Foremost, it still requires substantial optimization experience
effectively. Since LLMs still face problems in determining the
optimized pragma parameters precisely, we have a design space to craft high-quality HLS code and achieve desired perfor-
exploration (DSE) tool integrated for pragma parameter tuning. mance in practical development processes [6] [7]. In addition,
Furthermore, we also employ profiling tools to pinpoint the HLS code often struggles to reach optimality due to the large
performance bottlenecks within a program and selectively convert design space of various pragma parameters. Some design
bottleneck components to HLS code for hardware accelera- space exploration (DSE) tools have been proposed [8] [9]
tion. By combining the LLM-based profiling, C/C++ to HLS
translation, and DSE, we have established HLSPilot—the first [10] [11] to automate the parameter tuning, but these tools do
LLM-enabled high-level synthesis framework, which can fully not fundamentally optimize the hardware design. High-quality
automate the high-level application acceleration on hybrid CPU- HLS design turns out to be the major performance challenge
FPGA architectures. According to our experiments on real-world from the perspective of general software designers. Some
application benchmarks, HLSPilot achieve comparable perfor- researchers have attempted to address this challenge by using
mance in general and can even outperform manually crafted
counterparts, thereby underscoring the substantial promise of pre-built templates for specific domain applications [12] [13]
LLM-assisted hardware designs. [14]. For example, ThunderGP [13] has designed a set of HLS-
Index Terms—large language model, high-level synthesis, C- based templates for optimized graph processing accelerator
to-HLS, Code Generation. generation, allowing designers to implement various graph
algorithms by filling in the templates. However, it demands
I. I NTRODUCTION comprehensive understanding of both the domain knowledge
Hardware designing is a demanding task requiring a high and the HLS development experience from designers and there
level of expertise. Traditional hardware design involves coding is still a lack of well-established universal solution to obtain
with register transfer level (RTL) language. However, as optimized HLS code. Bridging the gap between C/C++ and
the complexity of hardware increases continuously with the HLS remains a formidable challenge requiring further efforts.
computing requirements of applications, RTL coding becomes Large Language Models (LLMs) have recently exhibited
exceedingly time-consuming and labor-intensive. The emer- remarkable capabilities in various generative tasks, including
gence of High-Level Synthesis (HLS) enables hardware design text generation, machine translation, and code generation,
at higher abstraction levels [1]. HLS typically employs high- underscoring their advanced learning and imitation skills.
level languages like C/C++ for hardware description, allowing These advancements have opened up possibilities for ad-
software engineers to also engage in hardware development, dressing hardware design challenges. Researchers have begun
which significantly lowering the expertise barrier in hard- applying LLMs to various hardware design tasks, including
ware design. Designers can focus more on the applications general-purpose processor designs, domain-specific accelera-
and algorithms rather than the details of low-level hardware tor designs, and arbitrary RTL code generation. Among these
applications, it can be observed that neural network accelerator
∗ Corresponding author.
generation utilizing a predefined template, as reported in [15],
This work is supported by the National Key R&D Program of China under
Grant (2022YFB4500405), and the National Natural Science Foundation of reaches an almost 100% success rate. In contrast, generating
China under Grant 62174162. register transfer level (RTL) code from natural language de-
scriptions, such as design specifications, experiences a consid- workflow on a hybrid CPU-FPGA architecture with a case
erably higher failure rate [16] [17]. This disparity is largely due study.
to the semantic gap between inputs and the anticipated outputs.
Despite the imperfections, these work have demonstrated the II. R ELATED W ORK
great potential of exploring LLMs for hardware designing. A. LLM for Hardware Design
Inspired by prior works, we introduce HLSPilot, an au- Recent works have begun to utilize LLMs to assist the
tomated framework that utilizes LLMs to generate and op- hardware designing from different angles [15], [16], [18]–
timize HLS code from sequential C/C++ code. Instead of [25]. Generating RTL code with natural language is a typical
generating RTL code from natural language directly, HLSPilot approach of hardware design with LLMs. For instance, VGen
mainly leverages LLMs to generate the C-like HLS code [18] leverages an open-source LLM, CodeGen [26], fine-
from C/C++ with much narrower semantic gap and outputs tuned with Verilog code corpus to generate Verilog code.
RTL code eventually using established HLS tools. Essen- Similarly, VerilogEval [19] enhances the LLM’s capability
tially, HLSPilot accomplishes RTL code generation from to generate Verilog by constructing a supervised fine-tuning
C/C++ without imposing hardware design tasks with broad dataset, it also establishes a benchmark for evaluating LLM’s
semantic gap on LLMs. Specifically, HLSPilot initiates the performance. ChipChat [24] achieves an 8-bit accumulator-
process with runtime profiling to pinpoint code segments based microprocessor design through multi-round natural lan-
that are the performance bottleneck and require optimization. guage conversation. ChipGPT [16] proposes a four-stage zero-
Subsequently, HLSPilot extracts the kernel code segments code logic design framework based on GPT for hardware
and applies appropriate HLS optimization strategies to the design. These studies have successfully applied LLMs to
computing kernels to generate optimized HLS code. Then, practical hardware designing. However, these methods are
HLSPilot employs a design space exploration (DSE) tool mostly limited to small functional modules and the success
to fine-tune the parameters of the generated HLS design. rate drops substantially when the hardware design gets larger.
Finally, HLSPilot leverages Xilinx OpenCL APIs to offload GPT4AIGchip proposed in [15] can also leverage LLMs to
the compute kernels to the FPGA, facilitating the deployment generate efficient AI accelerators based on a hardware tem-
of the entire algorithm on a hybrid CPU-FPGA architecture. plate, but it relies on pre-built hardware library that requires
In summary, LLMs are utilized for the hardware acceleration intensive understanding of both the domain knowledge and
throughout the entire hardware acceleration workflow ranging the hardware design techniques. which can hinders its use by
from profiling, HW/SW partitioning, HLS code generation, software developers.Recently, a domain-specific LLM for chip
HLS code optimization, and tool usage, thereby achieving a design, ChipNeMo [17], was proposed. ChipNeMo employs
high degree of design automation. a series of domain-adaptive techniques to train the LLM
The major contributions of this work are summarized as capable of generating RTL code, writing EDA tool scripts, and
follows: summarizing bugs. While powerful, domain-specific LLMs
• We propose HLSPilot, the first automatic HLS code face challenges such as high training costs and difficulties in
generation and optimization framework from sequential data collection.
C/C++ code using LLM. This framework investigates
B. LLM for Code Generation
the use of LLM for HLS design strategy learning and
tool learning, and build a complete hardware acceleration Code generation is one of the key applications of LLMs.
workflow ranging from runtime profiling, kernel identi- A number of domain-specific LLMs such as CodeGen [26],
fication, automatic HLS code generation, design space CodeX [27], and CodeT5 [28] have been proposed to address
exploration, and HW/SW co-design on a hybrid CPU- the programming of popular languages such as C/C++, Python,
FPGA computing architecture. The framework is open and Java, which have a large number of corpus for pre-
sourced on Github1 . training and fine-tuning. In contrast, it can be challenging to
• We propose a retrieval based approach to learn the HLS collect sufficient corpus for the less popular languages. VGen
optimization techniques and examples from Xilinx user [18] collected and filtered Verilog corpus from Github and
manual and utilize an in-context learning approach to textbooks, obtaining only hundreds of MB of corpus. Hence,
apply the learned HLS optimizations on serial C/C++ prompt engineering in combination with in-context learning
code and generate optimized HLS code with LLM for provides an attractive approach to leverage LLMs to generate
various computing kernels. code for domain-specific languages. For instance, the authors
• According to our experiments on an HLS benchmark, in [29] augment code generation by providing the language’s
HLSPilot can generate optimized HLS code from sequen- Backus–Naur form (BNF) grammar within prompts.
tial C/C++ code and the resulting designs can outperform
III. HLSP ILOT F RAMEWORK
manual optimizations with the assistance of DSE tools in
most cases. In addition, we also demonstrate the success- The remarkable achievements of LLMs across a wide
ful use of HLSPilot as a complete hardware acceleration domain of applications inspire us to create an LLM-driven
automatic hardware acceleration design framework tailored
1 https://github.com/xcw-1010/HLSPilot for a hybrid CPU-FPGA architecture. Unlike previous efforts
3-1. Automated Optimization Strategy Learning 3-2. Strategy Retrieval and Applying
strategy application
strategy retrieval
strategy1:
introduction scenes
parameter
demos prompt
description
system prompt:
… ... … You are an expert in FPGA…
stage 1
area
stage 2
profiling source
code
stage 3
stage1 stage2
latency
profiling report kernel to be
software code stage2-1 stage2-2
that primarily focused on code generation, our objective is such as execution time distribution across the algorithm and
to harness the potential of LLMs to emulate the role of an the number of function calls conveniently. Since LLMs is
expert engineer in hardware acceleration. Given that hardware capable to understand and summarize the textual reports,
acceleration on a hybrid CPU-FPGA architecture demands the time-consuming functions can be identified conveniently.
a set of different design tasks such as runtime profiling, HLSPilot extracts the computing kernels to be optimized in
compute kernel identification, compute kernel acceleration, next stage based on these profiling information.
design space exploration, and CPU-FPGA co-design, LLMs Secondly, the computing kernels are organized as dependent
must understand the design guidelines and manipulate the tasks and pipelined accordingly. The dependent tasks can
relevant design tools to achieve the desired design objec- be implemented efficiently with the data flow mechanism
tives, akin to an engineer. Fortunately, LLMs have exhibited supported by Xilinx HLS. While the compute kernels can be
powerful capabilities in document comprehension, in-context irregular, we propose a program-tree-based strategy to refactor
learning, tool learning, and code generation, all of which align the program structure of the compute kernels and generate
perfectly with the hardware acceleration design requirements. an optimized task flow graph while ensuring equivalent code
The intended design framework eventually provides an end-to- functionality. Details of the automatic task pipelining will be
end high-level synthesis of sequential C/C++ code on a hybrid illustrated in Section III-B.
CPU-FPGA architecture, thus named as HLSPilot, which will Thirdly, we start to optimize each task with HLS inde-
be elaborated in the rest of this section. pendently. While there are many distinct HLS optimization
strategies applicable to different high-level code patterns, we
A. HLSPilot Overview create a set of HLS optimization strategies based on Xilinx
HLSPilot as presented in Fig. 1 takes sequential C/C++ code HLS user guide and leverage LLMs to select and apply the
as design input and it mainly includes five major processing appropriate optimization strategies automatically based on the
stages to generate optimized hardware acceleration solution on code patterns in each task. Details of the LLM-based automatic
a hybrid CPU-FPGA architecture. HLS optimization will be presented in Section III-C.
Firstly, HLSPilot conducts runtime profiling on the high- Fourthly, after the code refactoring and the application
level application code to identify the most time-consuming of various HLS pragmas, the HLS code can be obtained,
computing kernels, which will be the focus of subsequent but the parameters such as the initiation interval (II) for
optimization. In this work, we profile the target algorithm and pipelining, the factors of loop unrolling, and the size for array
analyze the execution time with gprof on a CPU system. Then, partitioning in the HLS code still needs to be tuned to produce
a detailed performance report will be generated as needed. accelerators with higher performance. However, it remains
With the report, we can obtain the performance information rather challenging for LLMs to decide design parameters of
a complex design precisely. To address this issue, HLSPilot Algorithm 1: Program-tree-based Pipelining Strategy
utilizes external tools to conduct the design space exploration Input: Top-level Function Code C
and decides the optimized solution automatically. According Output: Tasks Collection
to recent research [30], LLMs is capable to learn and utilize T = {task1 , task2 , . . . , taskn }
external APIs and tools efficiently. Hence, HLSPilot leverages 1 T ← {C}
LLMs to extract the parameters from HLS code and invoke 2 while T has task that can be further split do
the DSE tool proposed in [31] by generating the corresponding 3 Tnew ← {}
execution scripts. 4 for taski ∈ T do
Finally, when the compute kernels are optimized with 5 if LLM decides to futher split taski then
HLS, they can be compiled and deployed on FPGAs for 6 1.For non-loop blocks: split the code based
hardware acceleration. Nonetheless, these accelerators must be on the functionality of the statement
integrated with a host processor to provide a holistic hardware execution
acceleration solution. The acceleration system has both host 7 2.For loop blocks: split the code based on
code and device code that will be executed on CPU side and the minimum parallelizable loop granularity
FPGA side respectively. HLSPilot leverages LLMs to learn the 8 Add the refactored code to Tnew
APIs provided by Xilinx runtime (XRT) to manage the FPGA- 9 else
based accelerators and perform the data transfer between host 10 Add taski to Tnew
memory and FPGA device memory. Then, it generates the host 11 end
code mostly based on the original algorithm code and replaces 12 end
the compute kernels with the compute APIs that will invoke 13 T ← Tnew
the FPGA accelerators and the data movement APIs. The 14 end
device code is mainly the HLS code generated in prior steps.
With both the host code and device code, the entire algorithm
can be deployed on the hybrid CPU-FPGA architecture.
have LLM to analyze the semantics of code statements, rec-
B. Program-Tree-based Task Pipelining ognize the purpose of these statements, and group statements
While the compute kernel can be quite complex, it needs to performing the same function into a single task. For loop
be split into multiple tasks for the sake of potential pipelining code, the decomposition is primarily based on the smallest
or parallel processing, which is critical to the performance of loop granularity that can be executed in parallel. We take
the generated accelerator. However, it is difficult to split the advantage of the in-context learning capabilities of LLMs and
compute kernel appropriately because inappropriate splitting present a few representative decomposition examples to guide
may lead to imbalanced pipelining and low performance. In the task decomposition for general scenarios. These examples
addition, the splitting usually causes code refactoring, which as detailed as follows.
may produce code with inconsistent functionality and further 1) Each iteration of the loop is considered as a task: In the
complicate the problem. To address this problem, we propose original merge sort loop, each iteration processes all intervals
a program-tree-based strategy to guide LLM to produce fine- of the same width. Therefore, each iteration can be regarded
grained task splitting and pipelining. as a task. For example, taski merges all intervals with a width
The proposed program-tree based task pipelining strategy equal to 2i .
is detailed in Algorithm 1. According to the strategy, LLM // before:
iteratively decomposes the compute kernel to smaller tasks and for (int width = 1; width < SIZE; width = 2 * width
) {
form a tree structure eventually. An input compute kernel C is for (int i1 = 0; i1 < SIZE; i1 = i1 + 2 * width)
denoted as the root node of the tree. Hence, the initial node set {
of the tree T = {C}. Then, LLM decides whether each task int i2 = i1 + width;
int i3 = i1 + 2 * width;
in T can be further decomposed based on the complexity of if (i2 >= SIZE) i2 = SIZE;
the task code. If a decomposition is confirmed in taski , LLM if (i3 >= SIZE) i3 = SIZE;
will perform the code decomposition. The decomposition for merge(A, i1, i2, i3, temp);
}
non-loop tasks and loop tasks are different and they will be }
detailed later in this sub section. If the task cannot be further
decomposed, the taski is added to Tnew directly. // after:
for (int stage = 1; stage < STAGES - 1; stage++) {
The major challenge of the program-tree-based task pipelin- // merge all equally wide intervals
ing strategy is the task decomposition metric which depends merge_intervals(temp[stage - 1], width, temp[
on the code structures and can vary substantially. As a result, stage]);
width *= 2;
the metric can be difficult to quantify. Instead of using a }
determined quantitative metric, we leverage LLMs to perform
the task decomposition with natural language rules and typical 2) The first and second halves of a loop’s traversal are
decomposition examples. Specifically, for non-loop code, we each considered as a task: In histogram statistics, since the
first and second halves of the loop can be executed in parallel, 4) Multiple levels of loops are considered as a task: In
they are considered as two tasks. video frame image convolution, there are a total of 4 layers
// before:
of loops, where loop1 and loop2 are considered as the tasks
for (int i = 0; i < INPUT_SIZE; i++) { for reading the pixel, and loop3 and loop4 are the tasks for
val = in[i]; calculating the convolution.
hist[val] = hist[val] + 1;
} // before:
loop1: for(int line=0; line<img_h; ++line) {
// after: loop2: for(int pixel=0; pixel<img_w; ++pixel) {
for (int i = 0; i < INPUT_SIZE / 2; i++) { float sum_r = 0, sum_g = 0, sum_b = 0;
val = in1[i]; loop3: for(int m=0; m<coeff_size; ++m) {
hist1[val] = hist1[val] + 1; loop4: for(int n=0; n<coeff_size; ++n) {
} int ii = line + m - center;
for (int i = 0; i < INPUT_SIZE / 2; i++) { int jj = pixel + n - center;
val = in2[i]; if(ii >= 0 && ii < img_h && jj >= 0 && jj < img_w)
hist2[val] = hist2[val] + 1; {
} sum_r += in[(ii * img_w) + jj].r * coeff[(m *
histogram_reduce(hist1, hist2, hist); coeff_size) + n];
sum_g += in[(ii * img_w) + jj].g * coeff[(m *
3) Each level of a loop is considered as a task: In BFS coeff_size) + n];
sum_b += in[(ii * img_w) + jj].b * coeff[(m *
algorithm, there are two loops, with the first loop used to coeff_size) + n];
find the frontier vertex and read the corresponding rpao data, }
the second loop used to traverse the neighbors of the frontier ...
}
vertex, which can be divided into two tasks based on this.
// before: // after:
loop1: for (int i = 0; i < vertex_num; i++) { void read_dataflow(hls::stream<RGBPixel>&
char d = depth[i]; read_stream, const RGBPixel *in, int img_w, int
if (d == level) { elements, int half) {
start = rpao[i]; int pixel = 0;
end = rpao[i + 1]; while(elements--) {
loop2: for (int j = start; j < end; j++) { read_stream << in[pixel++];
ngb_vidx = ciao[j]; }
ngb_depth = depth[ngb_vidx]; ...
if (ngb_depth == -1) { }
depth[ngb_vidx] = level_plus1;
} void compute_dataflow(hls::stream<RGBPixel>&
} write_stream, hls::stream<RGBPixel>&
} read_stream, const float* coefficient, int
} img_width, int elements, int center) {
static RGBPixel window_mem[COEFFICIENT_SIZE][
// after: MAX_WIDTH];
void read_frontier_vertex(int *depth, int static fixed coef[COEFFICIENT_SIZE *
vertex_num, int level, int *rpao, ...) { COEFFICIENT_SIZE];
... for(int i = 0; i < COEFFICIENT_SIZE*
for (int i = 0; i < vertex_num; i++) { COEFFICIENT_SIZE; i++) {
if (depth[i] == level) { coef[i] = coefficient[i];
int start = rpao[i]; }
int end = rpao[i + 1]; ...
start_stream << start; }
end_stream << end;
}
} In order to demonstrate the proposed task decomposition
} strategy, we take BFS with relatively complex nested loop as
void traverse(hls::stream<int>& start_stream, hls:: an example and present the generated program tree in Fig.2.
stream<int>& end_stream, ...) {
... It shows that the nested loop in BFS are effectively identified
while (!start_stream.empty() && !end_stream. and extracted as dependent tasks correctly.
empty()) {
int start = start_stream.read();
When the tasks are decomposed, the corresponding code
int end = end_stream.read(); segments will be packed into a function and the code needs
for (int j = start; j < end; j++) { to be refactored accordingly. Before proceeding to the HLS
ngb_vidx = ciao[j];
ngb_depth = depth[ngb_vidx];
acceleration, HLSPilot needs to check the correctness of the
if (ngb_depth == -1) { refactored code. Specifically, we compare the refactored code
depth[ngb_vidx] = level_plus1; to the original code by testing the execution results to ensure
}
}
the computing results are consistent. We follow a bottom-up
} testing strategy and start from the leaf nodes of the program
} tree. If an error occurs, it can be traced back to the erroneous
leaf node and check from its parent node. If errors persist
void bfs_kernel(…) {
for (int i = 0; i < vertex_num; i++) {
Upon retrieving a suitable optimization strategy, the strategy’s
… // traverse node
if (d == level) {
parameter description information and optimization example
… // find frontier
for (int j = start; j < end; j++) { information are integrated into the prompt, utilizing the LLM’s
… // process neighbor of frontier // stage1-1: load node depth
}… void load_depth(…) { in-context learning capabilities to generate optimized code.
for (int i = 0; i < vertex_num; i++) {
depth_inspect_stream <<
depth_for_inspect[i]; IV. E XPERIMENT
// stage1: traverse node and find frontier }
void read_frontier_vertex(...) { }
for (int i = 0; i < vertex_num; i++) { A. Experiment Setting
if (d == level) {
frontier_stream << i;
} // stage1-2: load frontier according to depth In this section, we demonstrate the effectiveness of HLSPi-
} void load_frontier(…) {
} for (int i = 0; i < vertex_num; i++) { lot framework for automatically generating and optimizing
d = depth_inspect_stream.read();
select strategies
Instructions:
Please apply these strategies in appropriate places based on
their descriptions and examples.