0% found this document useful (0 votes)
71 views9 pages

HLSPilot - LLM-based High-Level Synthesis

Uploaded by

yo bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views9 pages

HLSPilot - LLM-based High-Level Synthesis

Uploaded by

yo bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

HLSPilot: LLM-based High-Level Synthesis

Chenwei Xiong1,2 , Cheng Liu1,2∗ , Huawei Li1,2 , Xiaowei Li1,2


1
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2
Dept. of Computer Science, University of Chinese Academy of Sciences, Beijing, China
{xiongchenwei22s, liucheng}@ict.ac.cn

Abstract—Large language models (LLMs) have catalyzed an implementations. HLS tools automate the design tasks such
upsurge in automatic code generation, garnering significant at- as concurrent analysis of algorithms, interface design, logic
tention for register transfer level (RTL) code generation. Despite unit mapping, and data management, thereby substantially
the potential of RTL code generation with natural language,
arXiv:2408.06810v1 [cs.AR] 13 Aug 2024

it remains error-prone and limited to relatively small modules shortening the hardware design cycle.
because of the substantial semantic gap between natural language While HLS offers numerous advantages such as higher
expressions and hardware design intent. In response to the development efficiency and lower design barriers [1] [2], there
limitations, we propose a methodology that reduces the semantic are still some issues in the real-world HLS-based hardware
gaps by utilizing C/C++ for generating hardware designs via acceleration workflow [3]. Firstly, the overall analysis of the
High-Level Synthesis (HLS) tools. Basically, we build a set
of C-to-HLS optimization strategies catering to various code program is of great importance, determining the performance
patterns, such as nested loops and local arrays. Then, we apply bottlenecks of the program and the co-design between CPU
these strategies to sequential C/C++ code through in-context and FPGA remains a challenging issue. Besides, designs based
learning, which provides the LLMs with exemplary C/C++ to on HLS still encounter a few major performance issues [4] [5].
HLS prompts. With this approach, HLS designs can be generated Foremost, it still requires substantial optimization experience
effectively. Since LLMs still face problems in determining the
optimized pragma parameters precisely, we have a design space to craft high-quality HLS code and achieve desired perfor-
exploration (DSE) tool integrated for pragma parameter tuning. mance in practical development processes [6] [7]. In addition,
Furthermore, we also employ profiling tools to pinpoint the HLS code often struggles to reach optimality due to the large
performance bottlenecks within a program and selectively convert design space of various pragma parameters. Some design
bottleneck components to HLS code for hardware accelera- space exploration (DSE) tools have been proposed [8] [9]
tion. By combining the LLM-based profiling, C/C++ to HLS
translation, and DSE, we have established HLSPilot—the first [10] [11] to automate the parameter tuning, but these tools do
LLM-enabled high-level synthesis framework, which can fully not fundamentally optimize the hardware design. High-quality
automate the high-level application acceleration on hybrid CPU- HLS design turns out to be the major performance challenge
FPGA architectures. According to our experiments on real-world from the perspective of general software designers. Some
application benchmarks, HLSPilot achieve comparable perfor- researchers have attempted to address this challenge by using
mance in general and can even outperform manually crafted
counterparts, thereby underscoring the substantial promise of pre-built templates for specific domain applications [12] [13]
LLM-assisted hardware designs. [14]. For example, ThunderGP [13] has designed a set of HLS-
Index Terms—large language model, high-level synthesis, C- based templates for optimized graph processing accelerator
to-HLS, Code Generation. generation, allowing designers to implement various graph
algorithms by filling in the templates. However, it demands
I. I NTRODUCTION comprehensive understanding of both the domain knowledge
Hardware designing is a demanding task requiring a high and the HLS development experience from designers and there
level of expertise. Traditional hardware design involves coding is still a lack of well-established universal solution to obtain
with register transfer level (RTL) language. However, as optimized HLS code. Bridging the gap between C/C++ and
the complexity of hardware increases continuously with the HLS remains a formidable challenge requiring further efforts.
computing requirements of applications, RTL coding becomes Large Language Models (LLMs) have recently exhibited
exceedingly time-consuming and labor-intensive. The emer- remarkable capabilities in various generative tasks, including
gence of High-Level Synthesis (HLS) enables hardware design text generation, machine translation, and code generation,
at higher abstraction levels [1]. HLS typically employs high- underscoring their advanced learning and imitation skills.
level languages like C/C++ for hardware description, allowing These advancements have opened up possibilities for ad-
software engineers to also engage in hardware development, dressing hardware design challenges. Researchers have begun
which significantly lowering the expertise barrier in hard- applying LLMs to various hardware design tasks, including
ware design. Designers can focus more on the applications general-purpose processor designs, domain-specific accelera-
and algorithms rather than the details of low-level hardware tor designs, and arbitrary RTL code generation. Among these
applications, it can be observed that neural network accelerator
∗ Corresponding author.
generation utilizing a predefined template, as reported in [15],
This work is supported by the National Key R&D Program of China under
Grant (2022YFB4500405), and the National Natural Science Foundation of reaches an almost 100% success rate. In contrast, generating
China under Grant 62174162. register transfer level (RTL) code from natural language de-
scriptions, such as design specifications, experiences a consid- workflow on a hybrid CPU-FPGA architecture with a case
erably higher failure rate [16] [17]. This disparity is largely due study.
to the semantic gap between inputs and the anticipated outputs.
Despite the imperfections, these work have demonstrated the II. R ELATED W ORK
great potential of exploring LLMs for hardware designing. A. LLM for Hardware Design
Inspired by prior works, we introduce HLSPilot, an au- Recent works have begun to utilize LLMs to assist the
tomated framework that utilizes LLMs to generate and op- hardware designing from different angles [15], [16], [18]–
timize HLS code from sequential C/C++ code. Instead of [25]. Generating RTL code with natural language is a typical
generating RTL code from natural language directly, HLSPilot approach of hardware design with LLMs. For instance, VGen
mainly leverages LLMs to generate the C-like HLS code [18] leverages an open-source LLM, CodeGen [26], fine-
from C/C++ with much narrower semantic gap and outputs tuned with Verilog code corpus to generate Verilog code.
RTL code eventually using established HLS tools. Essen- Similarly, VerilogEval [19] enhances the LLM’s capability
tially, HLSPilot accomplishes RTL code generation from to generate Verilog by constructing a supervised fine-tuning
C/C++ without imposing hardware design tasks with broad dataset, it also establishes a benchmark for evaluating LLM’s
semantic gap on LLMs. Specifically, HLSPilot initiates the performance. ChipChat [24] achieves an 8-bit accumulator-
process with runtime profiling to pinpoint code segments based microprocessor design through multi-round natural lan-
that are the performance bottleneck and require optimization. guage conversation. ChipGPT [16] proposes a four-stage zero-
Subsequently, HLSPilot extracts the kernel code segments code logic design framework based on GPT for hardware
and applies appropriate HLS optimization strategies to the design. These studies have successfully applied LLMs to
computing kernels to generate optimized HLS code. Then, practical hardware designing. However, these methods are
HLSPilot employs a design space exploration (DSE) tool mostly limited to small functional modules and the success
to fine-tune the parameters of the generated HLS design. rate drops substantially when the hardware design gets larger.
Finally, HLSPilot leverages Xilinx OpenCL APIs to offload GPT4AIGchip proposed in [15] can also leverage LLMs to
the compute kernels to the FPGA, facilitating the deployment generate efficient AI accelerators based on a hardware tem-
of the entire algorithm on a hybrid CPU-FPGA architecture. plate, but it relies on pre-built hardware library that requires
In summary, LLMs are utilized for the hardware acceleration intensive understanding of both the domain knowledge and
throughout the entire hardware acceleration workflow ranging the hardware design techniques. which can hinders its use by
from profiling, HW/SW partitioning, HLS code generation, software developers.Recently, a domain-specific LLM for chip
HLS code optimization, and tool usage, thereby achieving a design, ChipNeMo [17], was proposed. ChipNeMo employs
high degree of design automation. a series of domain-adaptive techniques to train the LLM
The major contributions of this work are summarized as capable of generating RTL code, writing EDA tool scripts, and
follows: summarizing bugs. While powerful, domain-specific LLMs
• We propose HLSPilot, the first automatic HLS code face challenges such as high training costs and difficulties in
generation and optimization framework from sequential data collection.
C/C++ code using LLM. This framework investigates
B. LLM for Code Generation
the use of LLM for HLS design strategy learning and
tool learning, and build a complete hardware acceleration Code generation is one of the key applications of LLMs.
workflow ranging from runtime profiling, kernel identi- A number of domain-specific LLMs such as CodeGen [26],
fication, automatic HLS code generation, design space CodeX [27], and CodeT5 [28] have been proposed to address
exploration, and HW/SW co-design on a hybrid CPU- the programming of popular languages such as C/C++, Python,
FPGA computing architecture. The framework is open and Java, which have a large number of corpus for pre-
sourced on Github1 . training and fine-tuning. In contrast, it can be challenging to
• We propose a retrieval based approach to learn the HLS collect sufficient corpus for the less popular languages. VGen
optimization techniques and examples from Xilinx user [18] collected and filtered Verilog corpus from Github and
manual and utilize an in-context learning approach to textbooks, obtaining only hundreds of MB of corpus. Hence,
apply the learned HLS optimizations on serial C/C++ prompt engineering in combination with in-context learning
code and generate optimized HLS code with LLM for provides an attractive approach to leverage LLMs to generate
various computing kernels. code for domain-specific languages. For instance, the authors
• According to our experiments on an HLS benchmark, in [29] augment code generation by providing the language’s
HLSPilot can generate optimized HLS code from sequen- Backus–Naur form (BNF) grammar within prompts.
tial C/C++ code and the resulting designs can outperform
III. HLSP ILOT F RAMEWORK
manual optimizations with the assistance of DSE tools in
most cases. In addition, we also demonstrate the success- The remarkable achievements of LLMs across a wide
ful use of HLSPilot as a complete hardware acceleration domain of applications inspire us to create an LLM-driven
automatic hardware acceleration design framework tailored
1 https://github.com/xcw-1010/HLSPilot for a hybrid CPU-FPGA architecture. Unlike previous efforts
3-1. Automated Optimization Strategy Learning 3-2. Strategy Retrieval and Applying
strategy application
strategy retrieval
strategy1:
introduction scenes

parameter
demos prompt
description
system prompt:
… ... … You are an expert in FPGA…

strategy n: strategy application


Optimization strategy:
introduction scenes
strategy 1: description + demo +
optimized code
parameter strategy 2: description + demo
demos …
description
strategy n: description + demo

optimization strategy Optimize instruction


official documents knowledge base stage code
Design Space
Exploration Tools
1. Software Code Profiling and Analysis 2. Program-Tree-based Task Pipeline

stage 1
area
stage 2
profiling source
code
stage 3
stage1 stage2
latency
profiling report kernel to be
software code stage2-1 stage2-2

optimized 4. Design Space Exploration

C++ code HLS code 5. Hardware


Xilinx Runtime library User APIs (Accelerators) Deployment
CPU-FPGA
host code device code

Fig. 1. HLSPilot framework

that primarily focused on code generation, our objective is such as execution time distribution across the algorithm and
to harness the potential of LLMs to emulate the role of an the number of function calls conveniently. Since LLMs is
expert engineer in hardware acceleration. Given that hardware capable to understand and summarize the textual reports,
acceleration on a hybrid CPU-FPGA architecture demands the time-consuming functions can be identified conveniently.
a set of different design tasks such as runtime profiling, HLSPilot extracts the computing kernels to be optimized in
compute kernel identification, compute kernel acceleration, next stage based on these profiling information.
design space exploration, and CPU-FPGA co-design, LLMs Secondly, the computing kernels are organized as dependent
must understand the design guidelines and manipulate the tasks and pipelined accordingly. The dependent tasks can
relevant design tools to achieve the desired design objec- be implemented efficiently with the data flow mechanism
tives, akin to an engineer. Fortunately, LLMs have exhibited supported by Xilinx HLS. While the compute kernels can be
powerful capabilities in document comprehension, in-context irregular, we propose a program-tree-based strategy to refactor
learning, tool learning, and code generation, all of which align the program structure of the compute kernels and generate
perfectly with the hardware acceleration design requirements. an optimized task flow graph while ensuring equivalent code
The intended design framework eventually provides an end-to- functionality. Details of the automatic task pipelining will be
end high-level synthesis of sequential C/C++ code on a hybrid illustrated in Section III-B.
CPU-FPGA architecture, thus named as HLSPilot, which will Thirdly, we start to optimize each task with HLS inde-
be elaborated in the rest of this section. pendently. While there are many distinct HLS optimization
strategies applicable to different high-level code patterns, we
A. HLSPilot Overview create a set of HLS optimization strategies based on Xilinx
HLSPilot as presented in Fig. 1 takes sequential C/C++ code HLS user guide and leverage LLMs to select and apply the
as design input and it mainly includes five major processing appropriate optimization strategies automatically based on the
stages to generate optimized hardware acceleration solution on code patterns in each task. Details of the LLM-based automatic
a hybrid CPU-FPGA architecture. HLS optimization will be presented in Section III-C.
Firstly, HLSPilot conducts runtime profiling on the high- Fourthly, after the code refactoring and the application
level application code to identify the most time-consuming of various HLS pragmas, the HLS code can be obtained,
computing kernels, which will be the focus of subsequent but the parameters such as the initiation interval (II) for
optimization. In this work, we profile the target algorithm and pipelining, the factors of loop unrolling, and the size for array
analyze the execution time with gprof on a CPU system. Then, partitioning in the HLS code still needs to be tuned to produce
a detailed performance report will be generated as needed. accelerators with higher performance. However, it remains
With the report, we can obtain the performance information rather challenging for LLMs to decide design parameters of
a complex design precisely. To address this issue, HLSPilot Algorithm 1: Program-tree-based Pipelining Strategy
utilizes external tools to conduct the design space exploration Input: Top-level Function Code C
and decides the optimized solution automatically. According Output: Tasks Collection
to recent research [30], LLMs is capable to learn and utilize T = {task1 , task2 , . . . , taskn }
external APIs and tools efficiently. Hence, HLSPilot leverages 1 T ← {C}
LLMs to extract the parameters from HLS code and invoke 2 while T has task that can be further split do
the DSE tool proposed in [31] by generating the corresponding 3 Tnew ← {}
execution scripts. 4 for taski ∈ T do
Finally, when the compute kernels are optimized with 5 if LLM decides to futher split taski then
HLS, they can be compiled and deployed on FPGAs for 6 1.For non-loop blocks: split the code based
hardware acceleration. Nonetheless, these accelerators must be on the functionality of the statement
integrated with a host processor to provide a holistic hardware execution
acceleration solution. The acceleration system has both host 7 2.For loop blocks: split the code based on
code and device code that will be executed on CPU side and the minimum parallelizable loop granularity
FPGA side respectively. HLSPilot leverages LLMs to learn the 8 Add the refactored code to Tnew
APIs provided by Xilinx runtime (XRT) to manage the FPGA- 9 else
based accelerators and perform the data transfer between host 10 Add taski to Tnew
memory and FPGA device memory. Then, it generates the host 11 end
code mostly based on the original algorithm code and replaces 12 end
the compute kernels with the compute APIs that will invoke 13 T ← Tnew
the FPGA accelerators and the data movement APIs. The 14 end
device code is mainly the HLS code generated in prior steps.
With both the host code and device code, the entire algorithm
can be deployed on the hybrid CPU-FPGA architecture.
have LLM to analyze the semantics of code statements, rec-
B. Program-Tree-based Task Pipelining ognize the purpose of these statements, and group statements
While the compute kernel can be quite complex, it needs to performing the same function into a single task. For loop
be split into multiple tasks for the sake of potential pipelining code, the decomposition is primarily based on the smallest
or parallel processing, which is critical to the performance of loop granularity that can be executed in parallel. We take
the generated accelerator. However, it is difficult to split the advantage of the in-context learning capabilities of LLMs and
compute kernel appropriately because inappropriate splitting present a few representative decomposition examples to guide
may lead to imbalanced pipelining and low performance. In the task decomposition for general scenarios. These examples
addition, the splitting usually causes code refactoring, which as detailed as follows.
may produce code with inconsistent functionality and further 1) Each iteration of the loop is considered as a task: In the
complicate the problem. To address this problem, we propose original merge sort loop, each iteration processes all intervals
a program-tree-based strategy to guide LLM to produce fine- of the same width. Therefore, each iteration can be regarded
grained task splitting and pipelining. as a task. For example, taski merges all intervals with a width
The proposed program-tree based task pipelining strategy equal to 2i .
is detailed in Algorithm 1. According to the strategy, LLM // before:
iteratively decomposes the compute kernel to smaller tasks and for (int width = 1; width < SIZE; width = 2 * width
) {
form a tree structure eventually. An input compute kernel C is for (int i1 = 0; i1 < SIZE; i1 = i1 + 2 * width)
denoted as the root node of the tree. Hence, the initial node set {
of the tree T = {C}. Then, LLM decides whether each task int i2 = i1 + width;
int i3 = i1 + 2 * width;
in T can be further decomposed based on the complexity of if (i2 >= SIZE) i2 = SIZE;
the task code. If a decomposition is confirmed in taski , LLM if (i3 >= SIZE) i3 = SIZE;
will perform the code decomposition. The decomposition for merge(A, i1, i2, i3, temp);
}
non-loop tasks and loop tasks are different and they will be }
detailed later in this sub section. If the task cannot be further
decomposed, the taski is added to Tnew directly. // after:
for (int stage = 1; stage < STAGES - 1; stage++) {
The major challenge of the program-tree-based task pipelin- // merge all equally wide intervals
ing strategy is the task decomposition metric which depends merge_intervals(temp[stage - 1], width, temp[
on the code structures and can vary substantially. As a result, stage]);
width *= 2;
the metric can be difficult to quantify. Instead of using a }
determined quantitative metric, we leverage LLMs to perform
the task decomposition with natural language rules and typical 2) The first and second halves of a loop’s traversal are
decomposition examples. Specifically, for non-loop code, we each considered as a task: In histogram statistics, since the
first and second halves of the loop can be executed in parallel, 4) Multiple levels of loops are considered as a task: In
they are considered as two tasks. video frame image convolution, there are a total of 4 layers
// before:
of loops, where loop1 and loop2 are considered as the tasks
for (int i = 0; i < INPUT_SIZE; i++) { for reading the pixel, and loop3 and loop4 are the tasks for
val = in[i]; calculating the convolution.
hist[val] = hist[val] + 1;
} // before:
loop1: for(int line=0; line<img_h; ++line) {
// after: loop2: for(int pixel=0; pixel<img_w; ++pixel) {
for (int i = 0; i < INPUT_SIZE / 2; i++) { float sum_r = 0, sum_g = 0, sum_b = 0;
val = in1[i]; loop3: for(int m=0; m<coeff_size; ++m) {
hist1[val] = hist1[val] + 1; loop4: for(int n=0; n<coeff_size; ++n) {
} int ii = line + m - center;
for (int i = 0; i < INPUT_SIZE / 2; i++) { int jj = pixel + n - center;
val = in2[i]; if(ii >= 0 && ii < img_h && jj >= 0 && jj < img_w)
hist2[val] = hist2[val] + 1; {
} sum_r += in[(ii * img_w) + jj].r * coeff[(m *
histogram_reduce(hist1, hist2, hist); coeff_size) + n];
sum_g += in[(ii * img_w) + jj].g * coeff[(m *
3) Each level of a loop is considered as a task: In BFS coeff_size) + n];
sum_b += in[(ii * img_w) + jj].b * coeff[(m *
algorithm, there are two loops, with the first loop used to coeff_size) + n];
find the frontier vertex and read the corresponding rpao data, }
the second loop used to traverse the neighbors of the frontier ...
}
vertex, which can be divided into two tasks based on this.
// before: // after:
loop1: for (int i = 0; i < vertex_num; i++) { void read_dataflow(hls::stream<RGBPixel>&
char d = depth[i]; read_stream, const RGBPixel *in, int img_w, int
if (d == level) { elements, int half) {
start = rpao[i]; int pixel = 0;
end = rpao[i + 1]; while(elements--) {
loop2: for (int j = start; j < end; j++) { read_stream << in[pixel++];
ngb_vidx = ciao[j]; }
ngb_depth = depth[ngb_vidx]; ...
if (ngb_depth == -1) { }
depth[ngb_vidx] = level_plus1;
} void compute_dataflow(hls::stream<RGBPixel>&
} write_stream, hls::stream<RGBPixel>&
} read_stream, const float* coefficient, int
} img_width, int elements, int center) {
static RGBPixel window_mem[COEFFICIENT_SIZE][
// after: MAX_WIDTH];
void read_frontier_vertex(int *depth, int static fixed coef[COEFFICIENT_SIZE *
vertex_num, int level, int *rpao, ...) { COEFFICIENT_SIZE];
... for(int i = 0; i < COEFFICIENT_SIZE*
for (int i = 0; i < vertex_num; i++) { COEFFICIENT_SIZE; i++) {
if (depth[i] == level) { coef[i] = coefficient[i];
int start = rpao[i]; }
int end = rpao[i + 1]; ...
start_stream << start; }
end_stream << end;
}
} In order to demonstrate the proposed task decomposition
} strategy, we take BFS with relatively complex nested loop as
void traverse(hls::stream<int>& start_stream, hls:: an example and present the generated program tree in Fig.2.
stream<int>& end_stream, ...) {
... It shows that the nested loop in BFS are effectively identified
while (!start_stream.empty() && !end_stream. and extracted as dependent tasks correctly.
empty()) {
int start = start_stream.read();
When the tasks are decomposed, the corresponding code
int end = end_stream.read(); segments will be packed into a function and the code needs
for (int j = start; j < end; j++) { to be refactored accordingly. Before proceeding to the HLS
ngb_vidx = ciao[j];
ngb_depth = depth[ngb_vidx];
acceleration, HLSPilot needs to check the correctness of the
if (ngb_depth == -1) { refactored code. Specifically, we compare the refactored code
depth[ngb_vidx] = level_plus1; to the original code by testing the execution results to ensure
}
}
the computing results are consistent. We follow a bottom-up
} testing strategy and start from the leaf nodes of the program
} tree. If an error occurs, it can be traced back to the erroneous
leaf node and check from its parent node. If errors persist
void bfs_kernel(…) {
for (int i = 0; i < vertex_num; i++) {
Upon retrieving a suitable optimization strategy, the strategy’s
… // traverse node
if (d == level) {
parameter description information and optimization example
… // find frontier
for (int j = start; j < end; j++) { information are integrated into the prompt, utilizing the LLM’s
… // process neighbor of frontier // stage1-1: load node depth
}… void load_depth(…) { in-context learning capabilities to generate optimized code.
for (int i = 0; i < vertex_num; i++) {
depth_inspect_stream <<
depth_for_inspect[i]; IV. E XPERIMENT
// stage1: traverse node and find frontier }
void read_frontier_vertex(...) { }
for (int i = 0; i < vertex_num; i++) { A. Experiment Setting
if (d == level) {
frontier_stream << i;
} // stage1-2: load frontier according to depth In this section, we demonstrate the effectiveness of HLSPi-
} void load_frontier(…) {
} for (int i = 0; i < vertex_num; i++) { lot framework for automatically generating and optimizing
d = depth_inspect_stream.read();

// stage2: read neighbor information of frontier


if (d == level) {
frontier_stream << i;
hardware accelerator based on HLS. We utilize GPT-4 [35]
void read_rpao(...) {
while (!frontier_stream.empty()) {
}… as the default LLM to accomplish tasks such as HLS code
int idx = frontier_stream.read();
int start = rpao[idx];
analysis and optimization within the workflow. For accelerator
// stage3-1: load ciao according to rpao
int end = rpao[idx + 1];
start_stream << start;
void read_ciao(…) {
while ((rpao_empty != 1) || (done != 1)) {
deployment and evaluation, we adopt the Vitis HLS design
end_stream << end;
if (rpao_empty != 1) { flow, using the Xilinx Alveo U280 data center accelerator card.
}…
start = start_stream.read();
end = end_stream.read();
for (int i = start; i < end; i++) { For design space exploration, we utilizes GenHLSOptimizer
// stage3: traverse neighbor of frontier ciao_stream << ciao[i];
void traverse_neighbor(...) {
while (!start_stream.empty()
}… [31] to tune the parameters.
&& !end_stream.empty()) {

for (int j = start; j < end; j++) { // stage3-2: process neighbor depth
B. Benchmark Introduction
… void process_neighbor(…) {
if (ngb_depth == -1) {
depth[ngb_vidx] = level_plus1;
while (ciao_empty != 1 || done != 1) {

Currently most HLS benchmark suites [36]–[38] still face
}… if (ciao_empty != 1) {
vidx = ciao_stream.read();
sevaral limitations. Firstly, many benchmarks are only com-
ngb_depth = depth[vidx];
if (ngb_depth == -1) { prised of some textbook-style function kernels, failing to
depth[vidx] = level_plus1;
}… fully implement the complexity of real-world applications.
Thus evaluations on these benchmarks lack practical value.
Fig. 2. An example of program tree construction. LLM divides BFS with Secondly, most HLS benchmark suites only include optimized
nested loop into multiple dependent tasks for the pipelined execution.
HLS designs, lacking corresponding unoptimized versions,
which is unfriendly for evaluating the effectiveness of HLS
across multiple attempts, the program tree is backtracked and
optimization strategies.
the parent node is considered as the final refactored result.
To address these issues and accurately evaluate the per-
formance of the accelerators generated by our HLSPilot, we
C. LLM-based Automatic HLS Optimization
designed a benchmark suite that considers both the complexity
After the task pipelining, we continue to apply appro- of the designs and the convenience of comparing optimization
priate HLS optimization strategies to these tasks. The HLS effects. This benchmark suite consists of two parts: modified
optimization strategies are mainly extracted from Vendor’s Rosetta benchmarks [38] and a set of manually collected
documentation [32] [33] [34] by LLM. Since the optimizations benchmarks. The Rosetta benchmarks comprise a series of
are usually limited to specific scenarios or code patterns, there complex real-world applications such as 3D rendering, digit
are a number of distinct strategies but only a few of them may recognition, and spam filtering. Each application has both a
be actually utilized for a specific compute kernel in practice. software implementation and a corresponding HLS implemen-
To facilitate the automatic HLS optimization, we build an HLS tation. The original Rosetta benchmarks were implemented
optimization strategy knowledge base and propose a Retrieval- using SDSoC. We have these designs ported to Vitis and
Augmented-Generation-like (RAG-like) strategy to select the proposed corresponding unoptimized HLS designs without any
most suitable optimization strategies from knowledge base. optimization strategies based on the software implementations
The selected optimization strategies will be applied to the of the applications. Additionally, as a supplement, we collected
target code through in-context learning, ensuring optimized and implemented several other classic algorithm applications.
HLS code generation. Similarly, these applications also include unoptimized ver-
The workflow of HLSPilot’s RAG-like automatic optimiza- sions.
tion strategy learning is illustrated in Fig. 3. It uses the Xilinx
HLS official guide documentation as input and extracts struc- C. Experiment Results and Analysis
tured pragma optimization information from the documents. Experiment results. Table I shows the runtime of original
As shown in fig. 4, the structured information consists of four unoptimized design, manually optimized design, HLSPilot-
parts: (1) a brief introduction to the optimization strategy; (2) generated design, and HLSPilot-generated design with DSE
applicable optimization scenarios; (3) parameter descriptions; for each application in the benchmarks. The results indicate a
(4) optimization examples. The introduction to the optimiza- significant improvement in performance compared to the un-
tion strategy and the information on applicable scenarios optimized design when utilizing HLSPilot-generated designs.
are primarily used to assist in retrieving and matching the Overall, HLSPilot-generated designs achieve comparable per-
optimization strategy with the code, thus these information formance to those manually optimized by human experts,
is kept concise and general to enhance retrieval performance. while greatly reducing labor costs. With the utilization of DSE
1. Build Optimization Strategy Knowledge Base
strategy knowledge base
generate strategy introduction
knowledge function inline
loop flatten
used for strategy retrieval
base loop pipeline
application scenes
data pack √
loop unroll array partition √ parameter description
used for in-context learning
cache optimize √ … examples

select strategies

[ cache optimize, data pack, array partition ]

retrieve strategies generate prompt


based on code content
system prompt:
You are an expert in FPGA… Your goal is to optimize this apply
HLS code to make it work more efficiently on FPGA.
```{code content}```
strategies
Optimization strategy:
Here are some suitable strategies for optimizing above code:
[cache optimize, data pack, array partition]
The parameter descriptions and examples of these strategies
are as follows:
1.cache optimization: description + demo
2.data pack: description + demo
3.array partition: description + demo

Instructions:
Please apply these strategies in appropriate places based on
their descriptions and examples.

2. Retrieve and Apply Strategies

Fig. 3. Automatic Optimization Strategies Learning and Application

Strategy Overview:pragma HLS data_pack is a compilation TABLE I


directive to pack data fields of a structure into a scalar with a
wider bit width … B ENCHMARK RUNTIME ( MS ) ON X ILINX A LVEO U280
Application Scenes:
- Optimize the memory layout of structures to reduce storage Application original handcrafted HLSPilot HLSPilot + DSE
space requirements.
- Improve memory access efficiency, allowing simultaneous Fir 0.413 0.279 0.245 0.227
read and write access to all members of a structure.
Merge Sort 786.618 54.878 47.580 47.460
Parameter Description:
- variable=<variable>: Specifies the structure variable to be BFS 5018.551 3973.645 4184.273 3830.421
packed.
- instance=<name>: Optional parameter that specifies the PageRank 1862.214 1254.833 1114.991 1050.617
name of the resulting variable after packing.
- <byte_pad>: Optional parameter specifying whether to pack 3D Rendering 9.177 4.918 5.375 5.146
data on 8-bit boundaries. Supports two values:
- struct_level: … Digit Recognition 9917.663 9.892 78.837 52.832
Examples: Face Detection 83.752 55.909 64.372 59.138
1. Pack a structure array AB[17] with three 8-bit fields (R, G, B)
into a new 24-bit array with 17 elements.
typedef struct {
Optical Flow 101.313 54.084 71.932 63.184
unsigned char R, G, B;
} pixel;
Spam Filter 9278.917 37.346 8013.913 7519.317
pixel AB[17];
#pragma HLS data_pack variable=AB
2. Pack a structure pointer AB with three 8-bit fields

dataflow pipelining optimization, there are various ways to


Fig. 4. Structured information extracted by HLSPilot. The optimization split the same kernel. The rich experience of human experts
strategy from documents is summarized into four parts: (1) strategy overview
and (2) applicable scenarios for strategy retrieval; (3) parameter description may lead to more reasonable task partitioning. In addition,
and (4) examples for generating optimization prompt LLM struggles to implement optimizations tailored to specific
scenes. For instance, in the spam filter application, achieving
tools, some HLSPilot-generated designs can even outperform LUT optimization for sigmoid function requires sampling the
human designs. function and generating a specific lookup table, while also
considering issues such as quantization precision, which is
Analysis on the results. Table II shows the major opti-
difficult for LLM to implement.
mization strategies adopted by human expert’s designs and
HLSPilot-generated designs respectively. It can be noted
D. Case Study
that HLSPilot has selected appropriate optimization strategies
for different applications, basically covering the optimization To further verify the practicality of HLSPilot in real-world
selected by human expert. The performance gap between application, we selected the L-BFGS algorithm [39] and
HLSPilot and human expert mainly comes from the specific performed a complete hardware acceleration workflow for it
implementation methods of optimization. For example, for using HLSPilot on the hybrid CPU-FPGA platform.
TABLE II fixed-point conversion on the code, further optimizing the
M AJOR OPTIMIZATION STRATEGIES USED IN HANDCRAFTED DESIGN AND computational performance of the kernel. Finally, HLSPilot
HLSP ILOT- GENERATED DESIGN
determined the pragma parameters through DSE tools.
Application manual HLSPilot Acceleration result. We evaluated the cost calculation
Loop unrolling runtime and algorithm’s total runtime on both CPU and
Loop unrolling
Fir Loop pipeling
Loop pipelining
Memory optimization
CPU-FPGA platforms, as shown in Table III. L-BFGS-CPU
Dataflow pipelining Dataflow pipelining represents the algorithm program running on the CPU, while
Merge Sort Memory Optimization Memory Optimization HLSPilot-FP and HLSPilot-FXP respectively represent the
Loop unrolling Loop unrolling floating-point and fixed-point designs generated by HLSPi-
Dataflow pipelining Dataflow pipelining
BFS
Memory optimization Memory Optimization
lot. Overall, HLSPilot’s floating-point design and fixed-point
Dataflow pipelining Dataflow pipelining design have accelerated the end-to-end runtime by 7.79 times
PageRank
Memory optimization Memory optimization and 11.93 times, respectively. Notably, for the cost calculation,
Dataflow pipelining
Dataflow pipelining HLSPilot can accelerate it by more than 500 times, which fully
3D Rendering Communication optimization
Communication optimization demonstrates the effectiveness of HLSPilot’s acceleration.
Memory optimization
Dataflow pipelining TABLE III
Loop unrolling Loop unrolling ACCELERATION RESULT ON L-BFGS ALGORITHM
Digit Recognition
Loop pipelining Loop pipelining CostCalc. Total CostCalc. End-to-end
Datatype optimization Design
Runtime(s) Runtime(s) Speedup Speedup
Memory optimization Dataflow pipelining
Face Detection CPU 18237 18390 - -
Datatype optimization Memory optimization
Dataflow pipelining HLSPilot-FP 855 2365 21.33x 7.78x
Dataflow pipelining HLSPilot-FXP 31 1541 588.29x 11.93x
Memory optimization
Optical Flow Memory optimization
Datatype optimization
Communication optimization
Loop pipelining
Table IV shows the resource overhead and runtime of
Dataflow pipelining
Memory optimization Dataflow pipelining
the cost calculation kernel in L-BFGS. Runtime in table IV
Spam Filter
Communication optimization Memory optimization represents the time taken to execute one instance of the cost
LUT optimization calculation. It is evident that HLSPilot can effectively optimize
the performance bottlenecks of the algorithm, significantly
Introduction to the L-BFGS algorithm. L-BFGS algo- enhancing performance.
rithm is one of the commonly used algorithms in machine TABLE IV
learning for solving unconstrained optimization problems. C OST C ALC . KERNEL RESOURCE OVERHEAD AND RUNTIME
When solving gradient descent, L-BFGS algorithm approxi- Kernels #LUTs #FFs #BRAMs #DSPs Runtime(ms)
mates the inverse Hessian matrix using only a limited amount CPU - - - - 38529.08
of past information from the gradients, greatly reducing the kernel-FP 54970 66459 46 107 1680.84
storage space of data. However, due to its large number of kernel-FXP 188294 245018 270 624 60.9811
iterations, the algorithm performs poorly on the CPU, typically
taking several hours for each search process. V. C ONCLUSION
Complete acceleration workflow of HLSPilot. In this case, In this paper, we have introduced HLSPilot, the first LLM-
we wrote a C++ software code for L-BFGS algorithm as the driven HLS framework to automate the generation of hard-
input of HLSPilot. HLSPilot firstly ran the sequential C++ ware accelerators on CPU-FPGA platform. HLSPilot focuses
code of the algorithm on CPU and generated a profiling report on the transformation between sequential C/C++ code and
using the gprof tool, which includes detailed function runtime optimized HLS code, which greatly reducing the semantic
and number of calls. According to HLSPilot’s analysis, the gap between design intent and hardware code. Additionally,
cost calculate function in L-BFGS accounts for more than the integration of profiling tools and DSE tools enables auto-
99.1% of the total runtime of the algorithm, which is the matic hardware/software partition and pragma tuning. Through
performance bottleneck of the program. Therefore, this part the combined efforts of various modules driven by LLM,
will be extracted as the kernel for hardware acceleration. HLSPilot automatically generates high-performance hardware
Next, HLSPilot performed the task pipelining on the kernel accelerators. The kernel optimization experiment results on
code, partitioning the cost calculation process into three tasks: the benchmark fully demonstrate the potential of HLSPilot,
cost and convolution calculation, reconstruction error gradi- showing its ability to achieve comparable, and in some cases
ent calculation, and gradient check. Subsequently, HLSPilot superior, performance relative to manually designed FPGA
applied appropriate optimization strategies to each task. The kernel. In addition, we also performed a complete hardware
major optimization strategies employed in this stage included acceleration workflow for a real-world algorithm, achieving
local buffer optimization, loop unrolling, array partitioning, 11.93x speedup on the hybrid CPU-FPGA platform. These
and others. Particularly, HLSPilot noticed that the cost cal- results highlight the significant effects of LLM, suggesting a
culation process involved a significant amount of floating- promising future for LLM-assisted methodology in hardware
point computations. Therefore, it performed floating-point to design.
R EFERENCES [20] Y. Tsai, M. Liu, and H. Ren, “Rtlfixer: Automatically fixing rtl syntax
errors with large language models,” arXiv preprint arXiv:2311.16543,
[1] G. Martin and G. Smith, “High-level synthesis: Past, present, and future,” 2023.
IEEE Design & Test of Computers, vol. 26, no. 4, pp. 18–25, 2009. [21] S. Liu, W. Fang, Y. Lu, Q. Zhang, H. Zhang, and Z. Xie, “Rtlcoder:
[2] C. Liu, X. Chen, B. He, X. Liao, Y. Wang, and L. Zhang, “Obfs: Opencl Outperforming gpt-3.5 in design rtl generation with our open-source
based bfs optimizations on software programmable fpgas,” in 2019 dataset and lightweight solution,” arXiv preprint arXiv:2312.08617,
International Conference on Field-Programmable Technology (ICFPT). 2023.
IEEE, 2019, pp. 315–318. [22] S. Thakur, J. Blocklove, H. Pearce, B. Tan, S. Garg, and R. Karri, “Au-
[3] X. Zhang, Z. Feng, S. Liang, X. Chen, C. Liu, H. Li, and X. Li, tochip: Automating hdl generation using llm feedback,” arXiv preprint
“Graphitron: A domain specific language for fpga-based graph process- arXiv:2311.04887, 2023.
ing accelerator generation,” arXiv preprint arXiv:2407.12575, 2024. [23] Y. Lu, S. Liu, Q. Zhang, and Z. Xie, “Rtllm: An open-source benchmark
[4] S. Lahti, P. Sjövall, J. Vanne, and T. D. Hämäläinen, “Are we there for design rtl generation with large language model,” in 2024 29th Asia
yet? a study on the state of high-level synthesis,” IEEE Transactions and South Pacific Design Automation Conference (ASP-DAC). IEEE,
on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, 2024, pp. 722–727.
no. 5, pp. 898–911, 2018. [24] J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip-chat: Chal-
[5] B. C. Schafer and Z. Wang, “High-level synthesis design space explo- lenges and opportunities in conversational hardware design,” in 2023
ration: Past, present, and future,” IEEE Transactions on Computer-Aided ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD).
Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2628– IEEE, 2023, pp. 1–6.
2639, 2019. [25] Z. Jiang, Q. Zhang, C. Liu, H. Li, and X. Li, “Iicpilot: An intelligent
[6] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, “Performance integrated circuit backend design framework using open eda,” arXiv
modeling and directives optimization for high-level synthesis on fpga,” preprint arXiv:2407.12576, 2024.
IEEE Transactions on Computer-Aided Design of Integrated Circuits [26] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese,
and Systems, vol. 39, no. 7, pp. 1428–1441, 2019. and C. Xiong, “Codegen: An open large language model for code with
[7] C. Liu, H.-C. Ng, and H. K.-H. So, “Quickdough: A rapid fpga multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
loop accelerator design framework using soft cgra overlay,” in 2015 [27] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
International Conference on Field Programmable Technology (FPT). H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
IEEE, 2015, pp. 56–63. language models trained on code,” arXiv preprint arXiv:2107.03374,
[8] A. Sohrabizadeh, C. H. Yu, M. Gao, and J. Cong, “Autodse: Enabling 2021.
software programmers to design efficient fpga accelerators,” ACM Trans- [28] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware
actions on Design Automation of Electronic Systems (TODAES), vol. 27, unified pre-trained encoder-decoder models for code understanding and
no. 4, pp. 1–27, 2022. generation,” arXiv preprint arXiv:2109.00859, 2021.
[9] Y.-k. Choi and J. Cong, “Hls-based optimization and design space explo- [29] B. Wang, Z. Wang, X. Wang, Y. Cao, R. A Saurous, and Y. Kim,
ration for applications with variable loop bounds,” in 2018 IEEE/ACM “Grammar prompting for domain-specific language generation with large
International Conference on Computer-Aided Design (ICCAD). IEEE, language models,” Advances in Neural Information Processing Systems,
2018, pp. 1–8. vol. 36, 2024.
[10] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar, [30] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, E. Hambro,
“Design space exploration of fpga-based accelerators with multi-level L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language
parallelism,” in Design, Automation & Test in Europe Conference & models can teach themselves to use tools,” Advances in Neural Infor-
Exhibition (DATE), 2017. IEEE, 2017, pp. 1141–1146. mation Processing Systems, vol. 36, 2024.
[11] L. Ferretti, A. Cini, G. Zacharopoulos, C. Alippi, and L. Pozzi, “Graph [31] aferikoglou, “Genhlsoptimizer,” 2022. [Online]. Available: https:
neural networks for high-level synthesis design space exploration,” ACM //github.com/aferikoglou/GenHLSOptimizer
Transactions on Design Automation of Electronic Systems, vol. 28, no. 2, [32] Xilinx, “Vivado design suite user guide: High-level synthesis
pp. 1–20, 2022. (ug902),” 2020. [Online]. Available: https://docs.amd.com/v/u/en-US/
[12] E. Luo, H. Huang, C. Liu, G. Li, B. Yang, Y. Wang, H. Li, and ug902-vivado-high-level-synthesis
X. Li, “Deepburning-mixq: An open source mixed-precision neural [33] Xilinx, “Vivado hls optimization methodology guide
network accelerator design framework for fpgas,” in 2023 IEEE/ACM (ug1270),” 2018. [Online]. Available: https://docs.amd.com/v/u/en-US/
International Conference on Computer Aided Design (ICCAD). IEEE, ug1270-vivado-hls-opt-methodology-guide
2023, pp. 1–9. [34] Xilinx, “Vitis high-level synthesis user guide (ug1399),” 2023.
[13] X. Chen, H. Tan, Y. Chen, B. He, W.-F. Wong, and D. Chen, “Thun- [Online]. Available: https://docs.amd.com/r/en-US/ug1399-vitis-hls/
dergp: Hls-based graph processing framework on fpgas,” in The 2021 Navigating-Content-by-Design-Process
ACM/SIGDA International Symposium on Field-Programmable Gate [35] OpenAI, “Gpt-4,” 2023. [Online]. Available: https://platform.openai.
Arrays, 2021, pp. 69–80. com/docs/models/gpt-4-and-gpt-4-turbo
[14] S. Liang, C. Liu, Y. Wang, H. Li, and X. Li, “Deepburning-gl: an [36] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “Chstone: A
automated framework for generating graph neural network accelerators,” benchmark program suite for practical c-based high-level synthesis,” in
in Proceedings of the 39th International Conference on Computer-Aided 2008 IEEE International Symposium on Circuits and Systems (ISCAS).
Design, 2020, pp. 1–9. IEEE, 2008, pp. 1192–1195.
[15] Y. Fu, Y. Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y. C. Lin, [37] B. C. Schafer and A. Mahapatra, “S2cbench: Synthesizable systemc
“Gpt4aigchip: Towards next-generation ai accelerator design automation benchmark suite for high-level synthesis,” IEEE Embedded Systems
via large language models,” in 2023 IEEE/ACM International Confer- Letters, vol. 6, no. 3, pp. 53–56, 2014.
ence on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9. [38] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,
[16] K. Chang, Y. Wang, H. Ren, M. Wang, S. Liang, Y. Han, H. Li, Y.-H. Lai, G. Liu, G. A. Velasquez et al., “Rosetta: A realistic high-
and X. Li, “Chipgpt: How far are we from natural language hardware level synthesis benchmark suite for software programmable fpgas,”
design,” arXiv preprint arXiv:2305.14019, 2023. in Proceedings of the 2018 ACM/SIGDA International Symposium on
[17] M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, Field-Programmable Gate Arrays, 2018, pp. 269–278.
H. Anand, S. Banerjee, I. Bayraktaroglu et al., “Chipnemo: Domain- [39] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for
adapted llms for chip design,” arXiv preprint arXiv:2311.00176, 2023. large scale optimization,” Mathematical programming, vol. 45, no. 1,
[18] S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan- pp. 503–528, 1989.
Gavitt, and S. Garg, “Benchmarking large language models for auto-
mated verilog rtl code generation,” in 2023 Design, Automation & Test
in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6.
[19] M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating
large language models for verilog code generation,” in 2023 IEEE/ACM
International Conference on Computer Aided Design (ICCAD). IEEE,
2023, pp. 1–8.

You might also like