Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware
Strategy Space Hierarchization

Yangjie Zhou^* [email protected] Tencent , Honglin Zhu^* [email protected] Tencent , Qian Qiu [email protected] Tencent , Weihao Cui [email protected] Shanghai Jiao Tong University , Zihan Liu [email protected] Shanghai Jiao Tong University , Cong Guo [email protected] Shanghai Jiao Tong University , Siyuan Feng [email protected] Shanghai Jiao Tong University , Jintao Meng [email protected] Shenzhen Institute of Advanced Technology , Haidong Lan [email protected] Taichi Graphics , Jingwen Leng [email protected] Shanghai Jiao Tong University , Wenxi Zhu^† [email protected] Tencent and Minwen Deng^† [email protected] Tencent

Abstract.

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance.

To tackle these issues, we introduce $Vortex$ , a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. $Vortex$ capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling $Vortex$ to achieve remarkable efficiency. Comprehensive evaluations confirm that $Vortex$ reduces compilation time by $176\times$ compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of $2.53\times$ and $3.01\times$ , respectively.

^*Yangjie Zhou and Honglin Zhu contributed equally to this work.

^†Wenxi Zhu and Minwen Deng are the corresponding auhtors of this paper.

1. Introduction

Efficient optimization of tensor programs is crucial in accelerating Deep Neural Network (DNN) models (lecun2015deep, ) and large language models (LLMs) (llm_survey, ). Modern frameworks, e.g., PyTorch (pytorch, ) and TensorFlow (tensorflow, ), and compilers such as TVM (tvm, ) significantly rely on tensor programs for DNN computational operator abstractions. Traditional compilers (tvm, ; halide, ; ansor, ; tensorir, ) have primarily focused on optimizing static-shape DNNs, where tensor computations involve fixed-shape inputs and outputs at runtime. In contrast, the emergence of dynamic-shape DNNs, capable of handling variable input shapes during runtime, has become a significant area of interest (dynamic_survey, ; dietcode, ; bladedisc, ; nimble, ). For example, large language models (llm_survey, ) based on Transformers (attention, ) often adopt variable sequence lengths, necessitating dynamic-shape tensor computation. Effectively managing tensor computation with dynamic shapes is pivotal for optimizing neural network performance. However, the flexibility introduced by dynamic shapes poses challenges for optimizing tensor programs.

Refer to caption — Figure 1. Comparison of $Vortex$ with existing methods.

As illustrated in Figure 1, two main types of solutions have evolved to address the complexities of optimizing dynamic-shape tensor program computation. The first solution is the vendor-provided library, exemplified by oneDNN (onednn, ) for Intel CPUs and cuBLAS (cublas, ) for Nvidia GPUs, offering handcrafted implementations of DNN operators. While effective in certain scenarios, these libraries face limitations due to their empirical programming strategy, which does not offer the necessary flexibility for broad adaptability (dietcode, ). Additionally, the high development cost required to create these handcrafted solutions further constrains their efficiency (ansor, ).

The second category encompasses existing sample-driven dynamic-shape (dietcode, ; nimble, ) compilers, as shown in Figure 1 middle. Similar in workflow to static-shape compilers (tvm, ; ansor, ), these dynamic compilers utilize the tensor program and tensor samples as inputs to generate executable kernels. Typically, tensor compilers construct a substantial search space to implement optimization strategies such as loop partitioning, fusion, and reordering (tvm, ; ansor, ). Existing dynamic-shape compilers (dietcode, ; nimble, ) usually adopt a shape-generic search space by using tensor samples to represent shape information and auto-tuning micro-kernels for each specific sample at the offline phase. In the runtime, the dynamic compilers adopt a selector to integrate the micro-kernels for the computation process. However, their reliance on predefined shape samples limits their flexibility and effectiveness, particularly when tensor shapes fall outside the predefined range. This limitation can result in performance degradations of up to $4\times$ for unsampled shapes (§2.2). Furthermore, these approaches necessitate frequent re-tuning through profiling on actual hardware to accommodate sample variations, which incurs considerable overhead, often taking hours to days (dietcode, ; nimble, ).

Recognizing the limitations of existing methodologies, which predominantly rely on sample-based compilation (dietcode, ; nimble, ) or intensive manual implementation (cublas, ; cudnn, ), we underscore the need for an innovative approach that fully exploits the capabilities of hardware architecture. By adopting a hierarchical approach, we decouple the kernel into multiple levels: during the offline stage, we leverage detailed hardware information to construct hardware-friendly micro-kernels. Subsequently, during runtime, we utilize the shape information to select and configure micro-kernels dynamically, crafting shape-friendly implementations tailored to the needs of dynamic-shape tensor programs. This methodology not only enhances tensor program execution but also eliminates the reliance on predefined shape samples, marking a significant shift towards dynamic, real-time compilation strategies.

In this work, we propose $Vortex$ , a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. To achieve these goals, we employ a novel bidirectional method to integrate hardware information into the compilation process. Specifically, this bidirectional approach ensures a top-down alignment software tensor program with hardware hierarchy and a bottom-up construction that dynamically adapts to changing tensor shapes, as demonstrated in Figure 1.

First, $Vortex$ employs a top-down abstraction strategy to recursively decouple the tensor program, ensuring alignment with the hardware’s hierarchical architecture. For instance, in CUDA kernel on GPU, this process decouples the tensor program, mapping it to the software programming model’s Grid, Block, and Thread levels and the corresponding hardware structures: device, streaming multiprocessors (SMs), and registers (a100_whitepaper, ). By mirroring the hardware’s hierarchy organization, $Vortex$ optimizes the execution implementation and resource allocation specific to each hardware level. This tailored approach ensures that the software exploits the full potential of the hardware architecture, leading to a significant boost in kernel performance.

Second, $Vortex$ can generate the multi-level micro-kernels using bottom-up construction strategy. $Vortex$ uses hardware parameter information to prune a vast amount of strategy space, leading to more efficient code generation. An iterative constructor, progressing from lower to higher levels, utilizes this pruned strategy space to guide the construction of the kernels. This approach reduces the search space and accelerates the development process, thereby enhancing efficiency. Furthermore, $Vortex$ incorporates a hybrid analyzer that incorporates analytical (chimera, ; micro23_path, ) and empirical approaches (tvm, ; ansor, ) to evaluate the performance of various strategies, thereby enabling high-performance code generation with minimal system overhead.

$Vortex$ stands out due to its superior runtime performance and significantly reduced compilation overhead. We conduct a comprehensive evaluation of $Vortex$ , including a detailed comparative analysis with state-of-the-art vendor-provided libraries (oneDNN (onednn, ), ONNX Runtime (onnxruntime, ), cuBLAS (cublas, ), cuDNN (cudnn, ) and CUTLASS (cutlass, )) and dynamic-shape compiler DietCode (dietcode, ). These assessments are conducted across various hardware platforms, including Intel CPUs (intel_xeon_whitepaper, ) and Nvidia GPUs (a100_whitepaper, ), and at both operator and model levels. Notably, the performance of $Vortex$ showcases a remarkable speed, averaging $\mathbf{2.53\times}$ and $\mathbf{3.01\times}$ faster than vendor-provided libraries and DietCode, respectively. Additionally, $Vortex$ significantly accelerates the offline compilation, achieving $\mathbf{176\times}$ improvement over DietCode. Overall, the results highlight $Vortex$ ’s ability to outperform existing solutions across both CPU and GPU platforms.

In general, our work makes the following contributions:

•

We identify hardware-driven opportunities as pivotal factors for optimizing dynamic-shape compilation, highlighting the limitations of existing sample-driven solutions in terms of flexibility and functionality.
•

We propose rKernel, a top-down unified abstraction to decouple tensor programs and align their strategy space with hardware hierarchies, ensuring that the software fully exploits the potential of the backend hardware.
•

We introduce a bottom-up kernel construction approach, leveraging hardware information to prune the strategy space efficiently. This method facilitates the generation of kernels for dynamic-shape tensor programs without reliance on runtime shape samples.
•

We implement $Vortex$ , a novel dynamic-shape compiler. Our comprehensive evaluation demonstrates $Vortex$ ’s superiority over existing dynamic-shape compilers and manual optimization techniques on CPU and GPU platforms.

2. Background and Motivation

In this section, we begin by presenting a concise background on dynamic-shape tensor programs, outlining their definition and real-world computational scenarios. We then move to delineate the inherent constraints of the existing sample-driven optimization approach. Finally, we explore the opportunities and challenges in hardware-driven solutions.

2.1. Dynamic-shape Tensor Program

Tensor programs act as an operator-level abstraction, they are widely used across different tasks in neural network computation (pytorch, ; tensorflow, ; tvm, ; ansor, ). Conventional tensor programs inherently incorporate static shape information as an integral part of their input. On the contrary, dynamic-shape tensor programs enable processing tensor programs with unknown shapes (dynamic_survey, ). Dynamic-shape tensor programs have arisen in response to a twofold demand, driven by intrinsic data format dynamism and system execution and scheduling strategies.

The intrinsic data format dynamism is one of the fundamental driving forces demanding the integration of dynamic-shape tensor programs. For example, in natural language processing (NLP) (bert, ), the inherent variability in sequence lengths is a characteristic feature that traditional static-shape tensor programs find challenging to accommodate. The diverse lengths of sentence tasks necessitate a flexible framework capable of dynamically adapting to varying input sizes. In computer vision (CV) tasks, conventional methods rely on fixed image sizes for tensor program input, which limits flexibility (mobilenet, ). Recent developments (fast-rcnn, ; scale_fact_dect, ; dynamic_stride_net, ; dynamic_survey, ) have introduced dynamic-shape tensors as input, enhancing support for more advanced detection and tracking tasks. Additionally, in the field of graph neural network (GNN) (gnn_survey, ; GNNAdvisor, ), the dynamic nature of graph structures, marked by varying numbers of vertices and edges, necessitates the use of dynamic-shape tensor programs. These examples highlight the essential need for systems that seamlessly manage the intrinsic variability in diverse data formats.

System execution and scheduling strategies further underscore the importance of dynamic-shape tensor programs. For example, dynamic adjustment of batch sizes in the execution of neural networks introduces variability that demands adaptability in the underlying tensor program (dvabatch, ; lazybatch, ; orca, ). The ability to efficiently handle varying batch sizes is crucial for optimizing resource utilization and achieving optimal performance in real-world applications.

2.2. Limitations of Sample-Driven Approach

Many optimizations have been proposed for dynamic-shape tensor programs, all following a similar workflow (nimble, ; dietcode, ). We term this methodology the sample-driven approach. In this subsection, we first demonstrate its workflow and then discuss the inherent limitations.

As depicted in Figure 2, the current approach typically employs a sample list to depict the dynamic-shape parameters, combining the preknown static shape and the tensor program to staticize the dynamic-shape tensor program into a shape-generic search space. An auto-tuning module, adapted from the static-shape compiler(tvm, ; ansor, ; tensorir, ), is subsequently utilized. The auto-tuning module takes the tensor program with unspecified tile parameters, proceeding to investigate the high-performance tiling configurations within the comprehensive search space. This process produces fine-tuned micro-kernels for each input sample. At runtime, a decision-tree-based selector is employed to choose the appropriate micro-kernel based on the runtime shape. The kernel constructor finalizes the process by setting kernel launch parameters and incorporating padding to extend the applicability of these micro-kernels to runtime shapes not included in the sample list, utilizing the pre-compiled micro-kernels for runtime execution.

However, this approach faces significant limitations due to its reliance on a predetermined sample list for dynamic-shape parameters. This limitation restricts the compiler’s flexibility and overall functionality. In situations where input shapes are not included in the sample list, this sample-driven dynamic compilation optimization does not consistently provide high-performance computing support. This is a critical shortfall, as the diversity and variability of real-world data often extend beyond the scope of the predefined sample list.

To empirically validate this limitation, we conduct an experiment targeting the first general matrix multiply (GEMM) operation of the Bert model (bert, ). This GEMM operation entails the multiplication of two matrices, $A$ and $B$ . In this context, $M$ denotes the number of rows in matrix $A$ , computed as the product of batch size and sequence length. $N$ represents the number of columns in matrix $B$ , fixed at 768, and $K$ corresponds to the number of columns in matrix $A$ (and rows in matrix $B$ ), fixed at 2304. Utilizing DietCode’s default sample configuration, tests are performed with a fixed batch size of 16 and sequence length varying from 5 to 128 in the increment step of 19.

Figure 3 shows that DietCode exhibits a significant performance discrepancy for dynamic parameters not included in the sample list, as compared to the results achieved with the vendor’s library cuBLAS (cublas, ). This is due to the absence of specifically optimized micro-kernels for these shapes and the increased inefficiency from padding loss. Moreover, this method’s rigidity becomes evident when considering the need to modify the sample list to accommodate various computational scenarios. Such modifications necessitate re-tuning of the system, which worsens its inflexibility.

2.3. Hardware-Driven Approach: Opportunities and Challenges

The mismatch between runtime-used and offline-sampled shapes diminishes the efficiency of traditional sample-driven methodology in dynamic-shape compilation. Such methodologies not only fall short in delivering high performance but also result in considerable tuning overhead. However, we observe that generating the sample list for tuning micro-kernels is not mandatory for supporting the execution of a dynamic tensor program. Sample-driven methodology treats the hardware platform as a black box, where micro-kernels are only tuned based on performance feedback of sampled shapes. This approach overlooks the rich vein of prior knowledge available within the hardware itself.

Figure 4 demonstrates the architecture of current mainstream hardware deployments, such as central processing units (CPUs) (intel_xeon_whitepaper, ) and graphics processing units (GPUs) (a100_whitepaper, ). There exist inherent similarities among these hardware, each of which has a hierarchical structure. This hierarchy is distinctly multi-level, wherein each level comprises a predetermined quantity of computational or storage units. For instance, each CPU core has its own L1 cache and ALUs, while all the CPU cores share the L2/L3 cache and DRAM. As for GPUs, each streaming multiprocessor (SM) also has its own computing units (CUDA cores, Tensor Cores) and L1 Cache, while all SMs share the L2 cache and DRAM. At each level, the resources available for executing a dynamic tensor program operator are constrained by the inherent limitations of the hardware design.

To investigate the impact of hardware limits on kernel performance, we experiment by generating configurations with different resource usages for the common-used matrix multiplication. We collect these configurations during Ansor (ansor, )’s tuning process. Figure 5 shows the floating point operations per second (FLOPS) and the corresponding resource usage of the collected configurations. As observed, the operator performance seriously declines as the used resources exceed the hardware’s upper limit. This implies that configurations with hardware-unfriendly parameters consistently underperformed, allowing us to preemptively prune inefficient configurations from the large strategy space. Consequently, this approach maintains a streamlined strategy space for micro-kernel generation and eliminates the need to create a predefined sample list.

Moreover, the dependency on hardware hierarchy motivates us to generate the micro-kernels in a bottom-up manner. As the hardware hierarchy enables us to prune inefficient configurations based on the hardware limits level by level, only a limited number of micro-kernels is ultimately kept to support the dynamism of the tensor program. This approach enables comprehensive support for all potential input shapes in dynamic tensor programming.

Summary

In the light of utilizing the inherent nature of hardware structure, we can not only optimize but also fundamentally rethink our approach to compilation strategy. This observation motivated us to design a hardware-aware dynamic-shape compiler that facilitates highly efficient computational performance while maintaining flexibility.

3. Overview of $Vortex$

In this section, we describe the workflow of $Vortex$ , highlighting its key idea and optimization flow. $Vortex$ is an advanced operator-level compiler explicitly designed for optimizing the computation of dynamic-shape tensor programs.

Key Idea.

The key idea of $Vortex$ is its strategic use of hardware features to develop a hierarchical optimization flow, seamlessly integrating both offline and runtime stages. Initially, $Vortex$ systematically decomposes dynamic-shape tensor programs into multi-level subtasks through a unified abstraction, rKernel. Each hierarchical level leverages hardware parameters to create micro-kernels tailored to specific hardware needs. The process begins with the initial construction of micro-kernels at the lowest level, progressing to the final kernel selection during the runtime stage when shape information becomes available. The hardware-aware approach of $Vortex$ facilitates a bottom-up, multi-level compilation workflow. This approach leads to efficient optimization with lower runtime overhead and improved performance, which is ideal for dynamic-shape tensor programs.

Optimization Flow.

Figure 6 details the methodological framework of $Vortex$ . In the offline stage, for a given tensor program, $Vortex$ employs a top-down abstraction to transform tensor programs into a structured, hierarchical format that mirrors the hardware’s structure, which aligns the strategy space with hardware architecture. For instance, the upper-right section of Figure 6 illustrates the tensor program abstraction of a GPU-based kernel. Here, we can seamlessly align the terms “Grid”, “Block”, and “Warp” with the device, SM, and Tensor Cores, respectively. Then $Vortex$ employs a hardware-aware multi-level kernel constructor to develop a suite of hardware-friendly micro-kernels from the bottom level to the top. At each level, $Vortex$ utilizes a micro-kernel generator to identify candidates, then employs empirical or analytical analyzers to fine-tune the micro-kernels’ implementation. This approach offers a well-balanced compromise between performance and overhead, tailored specifically to each hardware tier. Then $Vortex$ generates a set of micro-kernels that are highly compatible with the hardware in each level. During the runtime stage, $Vortex$ employs a streamlined analytical module. This module quickly selects the appropriate candidate for the runtime input shapes, facilitating efficient execution on the hardware.

The benefits of this hardware-aware optimization workflow are two-fold. Primarily, $Vortex$ can generate hardware-friendly kernels that are universally capable of delivering high-performance computational support. Furthermore, the offline compilation phase’s independence from runtime shape samples significantly broadens the scope of support $Vortex$ can offer, as it reduces constraints and enhances flexibility.

Summary.

In summary, $Vortex$ presents an effective and efficient solution to the challenges of dynamic-shape tensor program computation. Its hardware-aware and multi-level process, blending offline preparation with runtime efficiency, ensures the system’s adaptability to complex dynamic-shape computational scenarios, guaranteeing high-performance support while maintaining low runtime overhead.

4. Strategy Space Hierarchization in $Vortex$

In this section, we introduce the key of $Vortex$ ’s workflow: strategy space hierarchization via top-down recursive decomposition. We first start with the GEMM as an example to illustrate the recursive execution pattern inherent in tensor programs. This example lays the foundation for presenting our innovative unified abstraction, rKernel, designed to accommodate the recursive nature prevalent in diverse tensor programs and across multiple hardware hierarchies.

4.1. Top-Down Recursive Notation

In this subsection, we use the example of GEMM tensor program on GPUs as a paradigmatic example of recursive execution patterns on hierarchical hardware platforms. GEMM, as a classic operator in deep learning, is mathematically defined as $C=A\times B$ , where $A$ and $B$ are the input matrices and $C$ is the output matrix.

As shown in Figure 7, it is natural to use recursive for-loops to represent GEMM execution flows. This approach breaks down the GEMM tensor program into a series of recursive loops set. These recursive loop sets elegantly map the high-level GEMM tensor program onto specific hardware levels, ranging from off-chip to on-chip memory and registers. Specifically, the uppermost recursion, OffChip_Mem_GEMM, processes end-to-end matrix blocks in off-chip memory. At intermediate levels, OnChip_Mem_GEMM handles smaller matrix blocks in on-chip memory. The innermost recursion, Register_GEMM, focuses on individual elements in registers, utilizing specific calculation instructions, such as Fused-Multiply-Add (FMA) (a100_whitepaper, ) instructions on GPUs.

This top-down recursive approach is intuitive and crucial for optimizing tensor programs on modern hardware architectures. It gives us an opportunity to decouple optimization space between different layers from each other, ensuring that each recursive layer is finely tuned to the unique capabilities and limitations of the corresponding hardware level, whether it be off-chip memory, on-chip memory, or registers. Furthermore, this approach provides a clear structure for optimizing different aspects of the tensor program independently at each level.

4.2. Unified Abstraction Design

Algorithm 1 Unified Recursive Abstraction

1:procedure rKernel(

L,PL,TSL,TRL,LF,SF

)

2: // L: Current hierarchical layer

3: // PL: Set of parallel loops

4: // TSL: Set of temporal spatial loops

5: // TRL: Set of temporal reduction loops

6: for each parallel loop

p

PL[L]

7: for each temporal spatial loop

ts

TSL[L]

8: for each temporal reduction loop

tr

TRL[L]

9: Load_Func(

L,p,ts,tr

)

10: rKernel(

L-1,PL,TSL,TRL,LF,SF

)

11: Store_Func(

L,p,ts

)

12:end procedure

Recognizing the hierarchical nature of tensor program execution, we also acknowledge the variations between different hardware and tensor programs. Specifically, CPUs and GPUs exhibit distinct computation modes and memory access controls, where CPUs are optimized for multi-threaded parallelism, and GPUs excel in Warp, CTA, and Grid-level parallelism. Moreover, different tensor programs demonstrate unique loop patterns; for instance, the loop characteristics of Convolution markedly differ from those in GEMM.

These variations inspire our design of a unified abstraction, which delineates a universal approach for representing tensor program executions across various hardware platforms. Algorithm 1 elaborates on this abstraction. It maintains the layer-wise recursive structure as demonstrated in GEMM (Figure 7) as the core, and enables custom loop mapping and execution stages for various tensor programs.

To achieve a universal and customizable representation, we classify loops within each hierarchical level into three distinct sets in Algorithm 1. The Parallel Loop Set is designed for parallel execution; the Temporal Spatial Loop Set manages temporal non-reduction operations; and the Temporal Reduction Loop Set focuses on temporal reduction operations. Each level, identified as level N, abstracts the execution into three stages: Load, rKernel(N-1), and Store. These stages serve as flexible interfaces, allowing for tailored execution based on the specific requirements of each hardware level.

Table 1. Complete representation for different hardware, levels via rKernel abstraction. ‘-’ refers to ‘No Parallel Binding’ in the ‘Parallel Binding’ column and ‘No Operation’ elsewhere.

HW.

Level

Parallel

Binding

Load

Lower Level

rKernel

Store

CPU

CacheBuf

\rightarrow

Reg

or GlobalMem

\rightarrow

Reg

ALU Calc.

Reg

\rightarrow

CacheBuf

or Reg

\rightarrow

GlobalMem

Thread

GlobalMem

\rightarrow

CacheBuf

or -

L1 rKernel

CacheBuf

\rightarrow

GlobalMem

or -

Process

L2 rKernel

GPU

Warp

SharedMem

\rightarrow

Reg

or GlobalMem

\rightarrow

Reg

Cuda/Tensor

Core Calc.

Reg

\rightarrow

SharedMem

or Reg

\rightarrow

GlobalMem

CTA

GlobalMem

\rightarrow

SharedMem

or -

L0 rKernel

SharedMem

\rightarrow

Global

or -

Grid

L1 rKernel

Table 1 illustrates how this unified abstraction rKernel is implemented across different hardware configurations, highlighting its versatility. Our focus lies in scrutinizing recursive execution patterns among different hierarchies. For CPUs, at the lowest level (L0), the rKernel abstraction allows for direct data transfer from Global memory to Registers or from CacheBuffer to Registers, depending on the specific needs of the computation. A “CacheBuffer" is defined as a memory buffer, sized within the L2 cache limits, to ensure consistent caching of its contents in the L2 cache for efficient data access and processing (onednn, ). Additionally, store operations at this level also reflect this adaptability, offering the choice of transferring data back to Global memory or CacheBuffer. As we progress to level L1, the abstraction provides options for either transferring data from Global memory to CacheBuffer or performing no operation, signifying a versatile approach to data handling. The highest level (L2) in CPUs focuses on the multi-thread mechanism at the process level, capitalizing on the CPU’s capabilities for multi-core parallel processing.

Similarly, for GPUs, including both Cuda Cores and Tensor Cores, rKernel adapts to different operational requirements. At L0, there’s an option for loading data either from Global memory or Shared Memory to Registers, and similarly, storing data either back to Global memory or to Shared Memory. This flexibility is crucial for optimizing memory utilization in the highly parallel environments of GPUs. At the L1 level, similar to CPUs, the abstraction can facilitate data transfer from Global to Shared Memory or no operation, enabling efficient resource management. The L2 level focuses on Grid-level operations, enhancing the scalability across the GPU’s multiple streaming multiprocessor (SM) architecture.

rKernel achieve a hierarchy abstraction of execution patterns that apply universally to various tensor programs and hardware types. This approach ensures a tailored strategy space for each hardware hierarchy level and facilitates the universal optimizations for dynamic-shape tensor programs.

5. Detailed Designs at Each Level

In this section, we thoroughly explore $Vortex$ ’s detailed designs at each hierarchical level. Initially, $Vortex$ utilizes an effective approach to generate hardware-aware candidates at each level. Following this, a hybrid analytical-empirical analyzer is deployed to discern the high-performance implementation for each candidate.

5.1. Bottom-up Hardware-aware Candidates Generator

This subsection introduces an innovative bottom-up method for generating candidates tailored to align with top-down recursive abstraction. This method becomes particularly crucial when execution parameter information is lacking, posing a significant challenge in every hierarchical layer for identifying suitable shape candidates for micro-kernel creation. Our approach centers on two key processes:

Firstly, we utilize the hardware’s parameter information to determine constraints for the candidate range. As highlighted in §2.3, our empirical studies have identified a notable decrease in hardware execution efficiency when the utilization at any hardware level is extremely low or high. This insight allows us to deduce a feasible range for candidate shapes, based on hardware utilization metrics.

Secondly, we follow a key design principle: ensuring the shape size of candidates in an upper layer is an integer multiple of the shape size in the lower layer. This approach aims to minimize padding loss during the construction of micro-kernels. As shown in Figure 8, if the sizes at one level are not multiples of those below, it results in more padding losses and inefficiencies at higher levels. Conversely, constructing candidates as integer multiples from one level to the next predominantly confines padding loss only to the outermost execution level, adhering to runtime requirements.

Algorithm 2 Candidates Generation Algorithm.

1:function GenerateCandidatesForLayer(

L

)

hwInfo\leftarrow

GetHardwareInfo(

L

)

3: // Determine by hardware resource limitation

cands\leftarrow

InitCands(

hwInfo

)

5: if

L=0

then

cands\leftarrow

FilterByISA(

cands

)

7: else

prevCands\leftarrow

GetPrevLayerCands(

L-1

)

cands\leftarrow

FilterByMultiples(

cands

prevCands

)

10: return

cands

11:end function

12:function FilterByISA(

cands

)

13:

filtered\leftarrow\emptyset

14: for

cand\in cands

15: if IsCompatible(cand) then

16:

filtered

.add(

cand

)

17: return

filtered

18:end function

19:function FilterByMultiples(

cands,prevCands

)

20:

filtered\leftarrow\emptyset

21:

map\leftarrow

an empty map

22: for

prev\in prevCands

23:

multiples\leftarrow

GenerateMultiples(

prev

cands

)

24: for

multiple\in multiples

25:

filtered

.add(

multiple

)

26:

map[multiple]

.append(

prev

)

27: return

filtered,map

28:end function

The overall process is detailed in Algorithm 2. The core function, GenerateCandidatesForLayer, is designed to operate distinctly based on the hierarchical layer it addresses. It begins by acquiring the hardware specifications for a given layer (denoted as L). This is achieved through the GetHardwareInfo and InitCands function, which retrieves essential hardware constraints that critically influence the parameter space. To validate candidate feasibility, the key is assessing memory usage against layer-specific limits and accounting for hardware constraints, such as a GPU’s 1024 threads-per-block maximum. For the initial layer (L = 0), the function employs FilterByISA to refine the candidate set according to the hardware platform’s Instruction Set Architecture (ISA) compatibility. For instance, on Intel CPUs, the FilterByISA function considers the granularity constraints of AVX512 (intel_xeon_whitepaper, ). Similarly, on GPU with Tensor Cores, the function assesses the constraints imposed by the Matrix Multiply-Accumulate (MMA) instruction (a100_whitepaper, ). These considerations ensure alignment with hardware capabilities.

Our algorithm utilizes FilterByMultiples, a method inspired by the classic sieve approach (sieve, ), to filter candidates. This function iteratively processes previous layer candidates (prevCands), generating multiples constrained via GenerateMultiples within the current layer’s candidate range. This approach ensures comprehensive exploration of viable parameter sets and maintains filtering efficiency. Additionally, we employ a mapping mechanism that uses a table to record the links between each candidate in the current layer and its possible match candidates in the previous layer, which is crucial for the subsequent analysis module.

5.2. Hybrid Analytical-Empirical Analyzer

In this subsection, we present a novel analyzing method which effectively incorporates analytical and empirical methodologies. This approach is designed to optimize the trade-off between efficiency and accuracy in strategy analysis.

Goal of the Analyzer.

Our analyzer aims to identify suitable candidates at each layer. Using Algorithm 2, we construct a map that describes the connections between candidates in connected layers. However, a single candidate can map to multiple lower-level candidates in this table. Each mapping corresponds to a unique implementation of the scheduling strategy, necessitating a thorough evaluation of the performance variability among these strategies. This task of identifying the optimal strategy $c^{*}$ from a set $S$ at a given hardware level $L$ can be defined as an optimization problem:

(1)

c^{*}=\underset{s\in S}{\arg\min}\,\operatorname{Cost}(s,L)

The analyzer’s primary objective is to identify a cost-effective and efficient strategy that aligns with the requirements at both offline and runtime stages. Concurrently, it is crucial to be aware of the time overhead associated with cost analysis, since excessive overhead, particularly during runtime, is clearly unacceptable. This necessitates a sophisticated design approach, blending empirical profiling on actual hardware with comprehensive theoretical analysis.

Analytical Cost Model.

We build an analytical cost model, a theoretical framework to predict the costs of different candidate implementations. The analytical cost model encapsulates the execution time based on algorithmic complexity and hardware specifications. As shown in Figure 9, we central to this model in two determinants: $Spatial$ , representing the amplification factor due to parallel loops executed across different hardware units, and $Temporal$ , relating to the execution process of serially dependent loop operations.

The temporal execution cost, $T_{temporal}$ , is carefully crafted to encapsulate the intricacies of pipeline execution within serial loops. It is quantified as follows:

(2)

\begin{split}T_{temporal}=&T_{Load}+\bigl{(}\text{sizeof}(\text{TemporalLoop})% -1\bigr{)}\\ &\times\max\bigl{(}T_{Load},Cost_{L-1}\bigr{)}+Cost_{L-1}+T_{Store}\end{split}

In this equation, $T_{Load}$ and $T_{Store}$ are the costs taken to load and store data, respectively. They are calculated based on the amount of data moved at the current layer divided by the memory bandwidth at that layer. $Cost_{L-1}$ represents the cost taken by the micro-kernel computation at the lower level. This equation accounts for the loop’s iteration count and juxtaposes the data load time against the execution span of a reduced kernel operation, thereby emulating the pipeline’s potential latency bottlenecks.

In parallel processing, the cost is modulated by the parallel loop’s scale relative to the hardware’s unit capacity:

(3)

F_{parallel}=\left\lceil\frac{\text{sizeof}(\text{ParallelLoop})}{|\text{% HardwareUnit}|}\right\rceil

This quantifies the hardware’s parallel processing ability, scaling execution cost with the parallel loop size. The overall strategy cost at layer L, $Cost_{L}$ , is the product of the parallel execution cost and the temporal execution cost, capturing the total time required for the execution of a computational task across various layers:

(4)

Cost_{L}=F_{parallel}\times T_{temporal}

Our analytical cost model faces a recursive complexity, needing $Cost_{L-1}$ for each level L. Furthermore, hardware optimizations such as instruction pipelining and out-of-order execution can lead to substantial inaccuracies in the cost model, posing a challenge in ensuring the precision of the analysis module (micro23_path, ).

Hybrid Analyzer Design.

Two key observations guide our design. Firstly, the bottom-up multi-level approach to kernel construction incrementally increases the number of candidates at higher layers. Secondly, unpredictable hardware-related scheduling, such as out-of-order execution, predominantly focuses on lower layers. These insights led to the development of our hybrid analytical-empirical analyzer.

The analyzer conducts empirical profiling on CPUs at level L0 and on GPUs at both L0 and L1 levels. For higher levels, it utilizes an analytical cost model. This hybrid system synergizes the efficiency of the analytical approach with the accuracy of empirical data, with the latter offering real-time performance insights to augment the analytical predictions. Importantly, all runtime analyses are conducted using the analytical model, ensuring a streamlined and low-overhead performance evaluation. The effectiveness of this hybrid methodology, in terms of performance and runtime overhead, is further investigated in §7.4. Overall, this hybrid approach is especially valuable in complex scenarios where theoretical models may be insufficient, ensuring both effectiveness and precision in the strategy analysis process.

6. Implementation

In this section, we detail the implementation of $Vortex$ . We focus on demonstrating the code generation method, and the scheduling process during the runtime phase.

⬇

1enum ANALYZE_TYPE {empirical, analytical};

2enum LOOP_TYPE {PL, TSL, TRL};

3class axis;

4class layer_meta_info {

5 int layer_depth;

6 map<axis, LOOP_TYPE> loop_type;

7 ANALYZE_TYPE analyzer;

9 func* load_func;

10 func* store_func;

11 func* compute_func;

12};

Figure 10. The definition of rKernel.

6.1. Code Generation

Despite the contrasting architectures of GPUs and CPUs, $Vortex$ ’s abstraction consistently represents both, as exemplified in Table 1. One of the key aspects contributing to this universal representation is the definition of data structures.

The rKernel data structure, as illustrated in Figure 10, is a cornerstone of $Vortex$ , tailored to encapsulate and streamline the complex processes involved. At its core lies the layer_meta_info class, pivotal for orchestrating the optimization strategy of each hierarchical layer. Within this class, the layer_depth attribute determines the layer’s position within the hierarchical structure, which is a crucial factor for the recursive optimization process. Moreover, the map<axis, LOOP_TYPE> provides a strategic mapping of loops to their respective types, including Spatial (S), Temporal Parallel (TP), and Temporal Reduction (TR). This nuanced approach to loop optimization aligns seamlessly with the unique characteristics of each tensor program. The ANALYZE_TYPE enum, encompassing empirical and analytical options, facilitates the selection of the appropriate optimization analysis method. Finally, the functional pointers—load_func, store_func, and compute_func—are critical in dynamically managing the various stages of computation. This data structure encapsulates the necessary elements to navigate dynamic-shape tensor program optimization challenges adeptly.

⬇

1// Input IR, we omit block and grid for brevity

2layer_meta_info gemm_tc_warp;

3gemm_tc_warp.set(

4 layer_depth = 0,

5 loop_type = {"k0":TRL,"m0":TSL,"n0":TSL},

6 cost_model = profiling;

7 load_func = load(shared_to_reg),

8 store_func = store(reg_to_shared),

9 compute_func = asm("mma.sync.m16n8k16")

10);

12// Output Generated Kernel

13dim3 grid(M/m_tile_grid, N/n_tile_grid);

14gemm_tensor_core_grid<<<grid, thread>>>(...);

16__global__ void gemm_tensor_core_grid(...) {

17 __shared__ half A_buf[], B_buf[], C_buf[];

18 for (k2 = 0; k2 < K; k2+=k_tile_grid)

19 for (k1 = 0; k1 < k2; k1+=k_tile_block)

20 asm("ld.global"); // Load A/B to A_buf/B_buf

21 for (m0 = 0; m0 < m1; m0+=m_tile_warp)

22 for (n0 = 0; n0 < n1; n0+=n_tile_warp)

23 asm("ld.shared"); // Load A_buf/B_buf to A_reg/B_reg

24 C_frag = 0;

25 for (k0 = 0; k0 < k1; k0+=k_tile_warp)

26 asm("mma.sync.m16n8k16");

27 C_frag += ...

28 asm("st.shared"); // Store C_reg to C_buf

29 asm("st.global"); // Store C_buf to C

30}

Figure 11. An illustration of GPU GEMM code generation.

rKernel serves as a recursive-based template for dynamic tensor program. When dealing with a fixed hardware and a particular operator, it necessitates users to initialize different levels of layer_meta_info. The development effort required for this task is minimal. This is because, for both CPU and GPU, we set the hierarchy level to three, which does not impose a significant burden. Furthermore, the dynamic parameters of the template enable consistent computational support for various runtime shapes. Owing to its versatility, development costs are apportioned across various computing scenarios. We built $Vortex$ on top of TVM (tvm, ). We harness the ‘tensorize’ primitive to implement the load, store, and compute functions, and leverage TVM’s robust and versatile code generation capabilities, we seamlessly target both CPU and GPU platforms. In Figure 11, we illustrate a representative GEMM kernel code generation for GPUs, showcasing our methodology’s depth and adaptability in optimizing dynamic-shape tensor programs for GPU architectures.

6.2. Integration of Offline and Runtime

The integration of offline and runtime components in $Vortex$ involves several key steps. During runtime, $Vortex$ employ analytical cost models to estimate the execution costs associated with different candidate solutions with runtime shape information. Subsequently, $Vortex$ select the most suitable micro-kernel candidates based on these optimal execution cost estimations.

Notably, our selection process accommodates the dynamic nature of hardware platforms. For instance, in the case of GPUs, the presence of a larger MMA instruction (a100_whitepaper, ) padding in the tensor core necessitates adaptive hardware solutions. We provide implementations for both CUDA cores and Tensor cores, allowing us to choose the appropriate backend hardware based on the runtime input shapes adaptively, further optimizing execution efficiency. The performance benefits of this adaptive strategy are thoroughly explored in §7.4. Additionally, by considering the computational shape of the selected micro-kernels in conjunction with the runtime shape, we collectively compute runtime-specific computational details, such as grid configurations. This comprehensive approach ensures the versatility and generality of our computational framework, facilitating efficient and adaptable runtime operations.

7. Evaluation

7.1. Experimental Setup

Platforms.

We evaluate $Vortex$ on two representative platforms: Intel 8255c CPU (intel_xeon_whitepaper, ) and Nvidia Ampere A100 GPU (a100_whitepaper, ). For GPUs, we conduct evaluations using two computational modes: Tensor Core Enabled mode with half-precision floating-point (FP16) data type, and Cuda Core Only mode with single-precision floating-point (FP32) data type. Table 2 details our experimental platforms.

Table 2. Hardware specifications.

Hardware

Nvidia GPU

Intel CPU

Version

Ampere A100 (108 SMs)

Xeon 8255c (48 Cores)

Storage

Global: 40G; L2 Cache: 40M;

Shared Memory: 48 K/SM;

Reg: 256K/SM

Global: 250.53G; L3 Cache: 35.75M;

L2 Cache: 1M/Core; L1 Cache: 32K/Core;

Reg: 2K/Core

Peak Flops

CUDA Core: 19.5 TFlops;

Tensor Core: 312 TFlops

7344 GFlops

CentOS Linux 8.4.2105

CentOS Linux 8.6.2205

Software

Driver Version: 450.156.00

CUDA Version: 11.8

cuDNN: 8.9.7.29

GCC: 10.2.0

LLVM: 15.0.3

Benchmarks.

Our evaluation encompasses two benchmark categories: operator-level and model-level. At the operator level, we gather 1197 different operator configurations from DeepBench (deepbench, ) and real-world models, covering various tasks like Transformer, CNN, and GNN (see Table 3 and Table 4 for details). These configurations demonstrate variability in all dimensions and possess a wide dynamic range, making them highly representative. At the model level, we assess the performance of three Transformer-based language models (Bert (bert, ), Bert-large (bert, ), GPT2 (gpt2, )) and three computer vision models (AlexNet (alexnet, ), ResNet (resnet, ), GoogleNet (googleNet, )) for end-to-end dynamic-shape neural network evaluation. To mirror real-world scenarios, we generate 17 sequence lengths ranging from 1 to 476 for language models. For CNN models, we configure batch sizes beginning with 1, and then incrementally step from 4 to 64 in multiples of 4.

Table 3. Benchmarked GEMM with dynamic shapes.

Category	M	N	K	#Cases
DeepBench (deepbench, )	[35, 8448]	[1, 6000]	[128, 500000]	84
Transformer (bert, ; gpt2, ; huggingface, )	[1, 476]	[768, 4096]	[768, 4096]	192
CNN (alexnet, ; vgg, ; googleNet, ; resnet, )	[1, 128]	[80, 25088]	[10, 4096]	80
GNN (GCN, ; GAT, ; GNNAdvisor, ; pyg, )	[2708, 1888584]	[2, 121]	[8, 3703]	150

Table 4. Benchmarked Convolution with dynamic shapes.

Category	BS	Fmap	Filter	Cin	Cout	#Cases
DeepBench (deepbench, )	[1,16]	[7,700]	[1,20]	[1,2048]	[16,2048]	107
CNN (alexnet, ; resnet, ; vgg, ; googleNet, )	[1,64]	[4,768]	[1,11]	[3,832]	[16,512]	584

Baselines.

We select various SOTA baselines, divided into two principal categories. The first category encompasses vendor-provided libraries, which are specialized libraries frequently utilized in neural network frameworks (pytorch, ). For NVIDIA GPUs, our evaluation utilizes cuBLAS (cublas, ) for GEMM and cuDNN (cudnn, ) for convolution. We also evaluate CUTLASS (cutlass, ) for both tasks. For Intel CPUs, we compare GEMM and convolution performance with oneDNN (onednn, ) and ONNX Runtime (onnxruntime, ). The second category is dynamic-shape compilers, for which we select DietCode (dietcode, ), the existing leading dynamic-shape tensor program compiler.

Table 5. Summary of operator-level speedups for

Vortex

compared to various baselines across different setups.

Hardware

Config

Operator

Baseline

Cases with

Speedup > 1 (%)

Average

Speedup

CPU

GEMM

oneDNN

77.3\%

1.82\times

ONNX Runtime

91.5\%

4.38\times

Conv.

oneDNN

85.8\%

2.09\times

ONNX Runtime

99.1\%

5.37\times

GPU (Tensor Core Enabled)

GEMM

cuBLAS

83.7\%

1.43\times

CUTLASS

94.2\%

2.62\times

Conv.

cuDNN

89.9\%

2.32\times

CUTLASS

80.5\%

1.70\times

GPU (Cuda Core Only)

GEMM

cuBLAS

78.3\%

1.63\times

CUTLASS

99.8\%

7.65\times

DietCode

94.1\%

2.67\times

Conv.

cuDNN

91.1\%

1.53\times

CUTLASS

87.8\%

2.88\times

DietCode

92.5\%

3.39\times

7.2. Dynamic-Shape Tensor Program

In this subsection, we present the evaluation of single dynamic-shape tensor program, specifically assessing GEMM and Convolution operators on CPU and GPU platforms. It is notable that DietCode is limited to GPU CUDA Cores and requires pre-determination of dynamic shape samples. We leverage the parameters from Table 3 and Table 4 as sample sets for DietCode’s offline compilation process. Importantly, the latency measurement for $Vortex$ encompasses both the operator execution time on the hardware platforms and the runtime overhead from $Vortex$ ’s cost model.

The evaluation results, shown in Figure 12, demonstrate $Vortex$ ’s performance across various configurations. The x-axis outlines the number of floating-point operations (FLOPs) in the workloads, including all GEMM test cases from Table 3 and convolution from Table 4, while the y-axis represents the speedups. $Vortex$ consistently achieves a generalized performance speedup, demonstrating improvements across different hardware setups, operators, and tensor shapes. To further quantify the effectiveness of $Vortex$ , we emphasize two metrics: the percentage of cases where $Vortex$ shows performance improvement (defined as cases where the speedup is greater than one) and the average speedup across all cases. Table 5 presents these detailed results. Overall, this comprehensive evaluation confirms that $Vortex$ provides robust and efficient acceleration results.

Table 6. Speedups of

Vortex

over DietCode for GEMM on GPU across different runtime ranges of M dimension.

96 Test Cases: M $\in$ [1,384], N = 768, K = 2304
Input Range for M	[0, 128)	[128, 256)	[256, 384)
Avg. Speedups	2.8x	1.4x	2.1x

Additionally, we explore the impact of DietCode’s reliance on sample-specific performance. As shown in Table 6, we configure the M dimension dynamically in DietCode, sampling and compiling it within the range [128, 256). The results show a performance decline when deviating from this range, highlighting DietCode’s limited flexibility.

7.3. Dynamic-Shape Network

Figure 13 presents a comprehensive evaluation of the end-to-end performance of classic language models and CNN models. We compare $Vortex$ against existing state-of-the-art solutions, using oneDNN and cuBLAS/cuDNN as the normalized baselines for CPU and GPU evaluations, respectively.

$Vortex$ demonstrates significant performance improvements across various tasks. $Vortex$ achieves notable average speedups of $2.91\times$ for BERT, $2.63\times$ for BERT-Large, and $2.94\times$ for GPT-2, across different baselines and hardware configurations. For CNNs, $Vortex$ achieves remarkable average speedups of $2.01\times$ for AlexNet, $2.13\times$ for ResNet, and $3.24\times$ for GoogleNet. Furthermore, $Vortex$ demonstrates considerable performance improvements over multiple existing solutions. Specifically, $Vortex$ achieves average $1.73\times$ , $4.26\times$ , $1.43\times$ , $1.71\times$ , $3.32\times$ and $4.13\times$ over oneDNN, ONNX Runtime, cuBLAS, cuDNN, CUTLASS, DietCode, respectively.

The evaluation results underscore that the performance of $Vortex$ varies with different hardware and models, reflecting the distinct execution characteristics inherent to each environment. Notably, $Vortex$ consistently outperforms the baseline across a wide range of input shapes and model types, demonstrating its exceptional adaptability and efficiency in real-world dynamic-shape DNN computation.

7.4. Additional Analysis

Offline Overhead Analysis.

We first analyze the offline period’s overhead. For diverse tensor shapes, $Vortex$ requires only a single compilation process, substantially reducing compilation overhead. For instance, in the GEMM evaluation across three computation modes (CPU, GPU with Tensor Core Enabled mode, and GPU with Cuda Core Only), $Vortex$ employs the candidates generation algorithm (§5.1) to yield 17731, 392, and 2332 distinct candidates. The respective time overheads are 29.3s, 92.2s, and 529.6s. In contrast, DietCode incurs a tuning duration of 25 hours in the Cuda Core Only mode, using configurations in Table 3 as the sample set. $Vortex$ thus achieves $174\times$ enhancement in compilation efficiency compared to DietCode. This reduced overhead is attributable to the efficient hardware-aware pruning of candidates and the utilization of analytical cost analysis, thereby obviating the need for extensive traditional profiling.

Runtime Overhead Analysis.

The primary source of runtime overhead in $Vortex$ is the increased computational requirements of the cost model. As depicted in Figure 14, the breakdown of $Vortex$ ’s execution on GPUs is presented, highlighting both the runtime scheduling costs and the execution durations of the final tensor programs for different shapes. The GEMM test is conducted with M/N/K values ranging from 64 to 4096. Notably, this runtime overhead impact is remarkably slight across various hardware platforms, demonstrating the significant runtime efficiency of $Vortex$ .

Hierarchical Kernel Construction Evaluation.

To validate the effectiveness of $Vortex$ ’s dynamic and hierarchical kernel construction methodology, we assessed its default configuration against three variants: Vortex-Oracle, which utilizes $Vortex$ as a static-shape compiler with a profiling-based analyzer for all layers in every test case from Table 3; Vortex-Static1, which maintains dynamic strategies at the L1 layer and adopts a static configuration for the L0 layer, selecting the most frequently optimal strategy; Vortex-Static2, which disables dynamic strategy selection at both the L0 and L1 layers, applying the same fixed strategy as Vortex-Static1. As shown in Figure 15, $Vortex$ achieves an average of 94.7% of the performance of Vortex-Oracle. Meanwhile, Vortex-Static1 and Vortex-Static2 achieve 60.7% and 49.5% of Vortex-Oracle’s performance, respectively. These experimental results underscore the effectiveness of $Vortex$ ’s dynamic code generation and emphasize the importance of maintaining dynamic strategies across varying hardware hierarchies.

Table 7. Comparison of

Vortex

’s default configuration with the modified analyzer setup. ’E’ denotes layers using the empirical method, while unlabeled layers use the analytical method.

HW.

Analyzer Config.

Offline

Overhead

Execution

Performance

CPU

Default (E: L0)

29.3 sec

1\times

Changed (E: L0, L1)

33.0 hour

1.04\times

GPU (Tensor Core Enabled)

Default (E: L0, L1)

92.2 sec

1\times

Changed (E: L0)

19.4 sec

0.84\times

GPU (Cuda Core Only)

Default (E: L0, L1)

529.6 sec

1\times

Changed (E: L0)

39.1 sec

0.63\times

Hybrid Analyzer Evaluation.

We conduct a study to assess the effectiveness of the hybrid analyzer in $Vortex$ . We compare the default $Vortex$ with a modified configuration as detailed in Table 7. We collect the data on offline compilation overhead, and the average runtime performance for all cases in Table 3. The experimental results reveal a significant and sharp increase in the CPU’s offline overhead when the configuration is modified, resulting in only marginal performance gains. Conversely, this modification leads to a significant reduction in GPU performance. These findings substantiate and underscore the rationale behind choosing the default configuration as the preferred setup for $Vortex$ .

Dynamic Hardware Adaptation.

We investigate the dynamic hardware adaptability of $Vortex$ on GPUs using FP16 for the GEMM operator. We test N values of 1024, 2048, and 4096, with K fixed at 1024 and M dynamically adjusted from 1 to 16, across three settings: Cuda Core Only, Tensor Core Only, and the default Adaptive. The results, presented in Figure 16, reveal dynamic hardware utilization opportunities and clarify hardware selection criteria for specific scenarios. $Vortex$ effectively utilizes optimal hardware, achieving performance gains of up to 48% and 54% over fixed CUDA and Tensor Core settings, respectively, demonstrating the effectiveness of its hardware-adaptive scheduler.

8. Related Work

For dynamic-shape tensor programs, vendor-provided libraries, such as cuBLAS (cublas, ), cuDNN (cudnn, ), MKL-DNN (mkl, ) and CUTLASS (cutlass, ) are extensively utilized in prevalent frameworks, facilitating high-performance tensor operations across diverse hardware platforms. These libraries, tailored to specific target hardware, require substantial engineering efforts. $Vortex$ , by introducing a novel unified recursive abstraction, significantly reduces development costs and unifies optimization strategies across different hardware platforms.

Additionally, compilation optimization is a crucial solution for dynamic-shape tensor programs. Existing methods, such as DietCode (dietcode, ), Nimble (nimble, ), and DISC (bladedisc, ), predominantly rely on sample-based compilation approaches. However, these methods overlook hardware-aware optimization opportunities. Our work, distinguishing itself in this area, uses hardware information as a fundamental element to construct a novel sample-free compilation workflow, thus supporting diverse high-performance scenarios.

For optimizing tensor programs with static shapes, various tensor compilers such as AutoTVM (autotvm, ), FlexTensor (flextensor, ), Ansor (ansor, ), and TensorIR (tensorir, ) have been proposed. However, these methods are associated with significant compile time overheads. Although efforts like Roller (roller, ) have attempted to optimize the compilation time for static-shape compilers, the time required is still considerably longer than the execution overhead, making them impractical for the online demands of dynamic-shape tensor programs.

Graph-level optimization is another important component for end-to-end DNN optimizations. DNNFusion (DNNFusion, ), Rammer (Rammer, ), Chimera (chimera, ), and AStitch (astitch, ) focus on fusion optimizations in DNNs. TASO (taso, ), Unity (unity, ), XLA (XLA, ), JAX (jax, ), TorchDynamo (TorchDynamo, ), TenSAT (tensat, ) explore graph rewriting opportunities. In this paper, our proposed $Vortex$ focuses on operator-level for dynamic-shape tensor programs, which is orthogonal to these works. Meanwhile, $Vortex$ is designed without inherent limitations that hinder integration with current graph-level compilers. We look forward to exploring this area as a part of our future research efforts.

9. Conclusion

In this work, we propose $Vortex$ , an hardware-driven and sample-free dynamic-shape tensor program compiler. $Vortex$ leverages bidirectional compilation techniques to deliver universally high-performance support with minimal system overhead. Experimental results demonstrate that $Vortex$ achieves average speedups of $2.53\times$ and $3.01\times$ over vendor-provided libraries and existing dynamic-shape compiler, respectively. These results highlight the effectiveness of $Vortex$ and its potential as a standard methodology for enhancing dynamic-shape tensor program optimization.

References

[1] oneAPI Deep Neural Network Library (oneDNN). https://github.com/oneapi-src/oneDNN, 2020.
[2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. $\{$ TensorFlow $\}$ : A system for $\{$ Large-Scale $\}$ machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
[3] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, et al. Jax: composable transformations of python+ numpy programs. Version 0.2, 5:14–24, 2018.
[4] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. $\{$ TVM $\}$ : An automated $\{$ End-to-End $\}$ optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
[5] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. Advances in Neural Information Processing Systems, 31, 2018.
[6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
[7] Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493–506. IEEE, 2021.
[8] George Chrysos. Intel® xeon phi™ coprocessor-the architecture. Intel Whitepaper, 176(2014):43–50, 2014.
[9] Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. $\{$ DVABatch $\}$ : Diversity-aware $\{$ Multi-Entry $\}$ $\{$ Multi-Exit $\}$ batching for efficient processing of $\{$ DNN $\}$ services on $\{$ GPUs $\}$ . In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 183–198, 2022.
[10] ONNX Runtime developers. Onnx runtime. https://onnxruntime.ai/, 2021. Version: x.y.z.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[12] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, et al. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 804–817, 2023.
[13] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
[14] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
[15] Heine Halberstam and Hans Egon Richert. Sieve methods. Courier Corporation, 2013.
[16] Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
[17] Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. Scale-aware face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6186–6195, 2017.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[19] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[20] Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47–62, 2019.
[21] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[24] Ying Li, Yifan Sun, and Adwait Jog. Path forward beyond simulators: Fast and accurate gpu execution time prediction for dnn workloads. pages 380–394, 2023.
[25] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with $\{$ rTasks $\}$ . In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881–897, 2020.
[26] S Narang and G Diamos. Deepbench: Benchmarking deep learning operations on different hardware, 2017.
[27] Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 883–898, 2021.
[28] NVIDIA. Nvidia a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2021.
[29] NVIDIA Corporation. Nvidia cublas documentation.
[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[31] PyTorch Contributors. TorchDynamo. https://pytorch.org/docs/master/dynamo/, 2022.
[32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[33] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013.
[34] Amit Sabne. Xla : Compiling machine learning for peak performance, 2020.
[35] Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, and Yida Wang. Nimble: Efficiently compiling dynamic neural networks for model inference. Proceedings of Machine Learning and Systems, 3:208–222, 2021.
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[37] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[38] Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUTLASS, jan 2023.
[39] Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. Unity: Accelerating $\{$ DNN $\}$ training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 267–284, 2022.
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[41] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[42] Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, Yajuan Wang, Endong Wang, Qing Zhang, Bo Shen, et al. Intel math kernel library. High-Performance Computing on the Intel® Xeon Phi™: How to Fully Exploit MIC Architectures, pages 167–188, 2014.
[43] Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. $\{$ GNNAdvisor $\}$ : An adaptive and efficient runtime system for $\{$ GNN $\}$ acceleration on $\{$ GPUs $\}$ . In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 515–531, 2021.
[44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
[45] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
[46] Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems, 3:255–268, 2021.
[47] Zerui Yang, Yuhui Xu, Wenrui Dai, and Hongkai Xiong. Dynamic-stride-net: Deep convolutional neural network with dynamic stride. In Optoelectronic Imaging and Multimedia Technology VI, volume 11187, pages 42–53. SPIE, 2019.
[48] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for $\{$ Transformer-Based $\}$ generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
[49] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
[50] Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. Dietcode: Automatic optimization for dynamic tensor programs. Proceedings of Machine Learning and Systems, 4:848–863, 2022.
[51] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating $\{$ High-Performance $\}$ tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020.
[52] Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1113–1126. IEEE, 2023.
[53] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 859–873, 2020.
[54] Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, et al. Bladedisc: Optimizing dynamic shape machine learning workloads via compiler approach. Proceedings of the ACM on Management of Data, 1(3):1–29, 2023.
[55] Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, et al. Astitch: enabling a new multi-dimensional optimization space for memory-intensive ml training and inference on modern simt architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 359–373, 2022.
[56] Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, et al. $\{$ ROLLER $\}$ : Fast and efficient tensor compilation for deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 233–248, 2022.

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization

Abstract.

1. Introduction

2. Background and Motivation

2.1. Dynamic-shape Tensor Program

2.2. Limitations of Sample-Driven Approach

2.3. Hardware-Driven Approach: Opportunities and Challenges

Summary

3. Overview of V⁢o⁢r⁢t⁢e⁢x𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x

Key Idea.

Optimization Flow.

Summary.

4. Strategy Space Hierarchization in V⁢o⁢r⁢t⁢e⁢x𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x

4.1. Top-Down Recursive Notation

4.2. Unified Abstraction Design

5. Detailed Designs at Each Level

5.1. Bottom-up Hardware-aware Candidates Generator

5.2. Hybrid Analytical-Empirical Analyzer

Goal of the Analyzer.

Analytical Cost Model.

Hybrid Analyzer Design.

6. Implementation

6.1. Code Generation

6.2. Integration of Offline and Runtime

7. Evaluation

7.1. Experimental Setup

Platforms.

Benchmarks.

Baselines.

7.2. Dynamic-Shape Tensor Program

7.3. Dynamic-Shape Network

7.4. Additional Analysis

Offline Overhead Analysis.

Runtime Overhead Analysis.

Hierarchical Kernel Construction Evaluation.

Hybrid Analyzer Evaluation.

Dynamic Hardware Adaptation.

8. Related Work

9. Conclusion

References

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware
Strategy Space Hierarchization

3. Overview of $Vortex$

4. Strategy Space Hierarchization in $Vortex$