Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware
Strategy Space Hierarchization

Yangjie Zhou* [email protected] Tencent Honglin Zhu* [email protected] Tencent Qian Qiu [email protected] Tencent Weihao Cui [email protected] Shanghai Jiao Tong University Zihan Liu [email protected] Shanghai Jiao Tong University Cong Guo [email protected] Shanghai Jiao Tong University Siyuan Feng [email protected] Shanghai Jiao Tong University Jintao Meng [email protected] Shenzhen Institute of Advanced Technology Haidong Lan [email protected] Taichi Graphics Jingwen Leng [email protected] Shanghai Jiao Tong University Wenxi Zhu [email protected] Tencent  and  Minwen Deng [email protected] Tencent
Abstract.

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance.

To tackle these issues, we introduce Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x to achieve remarkable efficiency. Comprehensive evaluations confirm that Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x reduces compilation time by 176×176\times176 × compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of 2.53×2.53\times2.53 × and 3.01×3.01\times3.01 ×, respectively.

*Yangjie Zhou and Honglin Zhu contributed equally to this work.
Wenxi Zhu and Minwen Deng are the corresponding auhtors of this paper.

1. Introduction

Efficient optimization of tensor programs is crucial in accelerating Deep Neural Network (DNN) models (lecun2015deep, ) and large language models (LLMs) (llm_survey, ). Modern frameworks, e.g., PyTorch (pytorch, ) and TensorFlow (tensorflow, ), and compilers such as TVM (tvm, ) significantly rely on tensor programs for DNN computational operator abstractions. Traditional compilers (tvm, ; halide, ; ansor, ; tensorir, ) have primarily focused on optimizing static-shape DNNs, where tensor computations involve fixed-shape inputs and outputs at runtime. In contrast, the emergence of dynamic-shape DNNs, capable of handling variable input shapes during runtime, has become a significant area of interest (dynamic_survey, ; dietcode, ; bladedisc, ; nimble, ). For example, large language models (llm_survey, ) based on Transformers (attention, ) often adopt variable sequence lengths, necessitating dynamic-shape tensor computation. Effectively managing tensor computation with dynamic shapes is pivotal for optimizing neural network performance. However, the flexibility introduced by dynamic shapes poses challenges for optimizing tensor programs.

Refer to caption
Figure 1. Comparison of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x with existing methods.

As illustrated in Figure 1, two main types of solutions have evolved to address the complexities of optimizing dynamic-shape tensor program computation. The first solution is the vendor-provided library, exemplified by oneDNN (onednn, ) for Intel CPUs and cuBLAS (cublas, ) for Nvidia GPUs, offering handcrafted implementations of DNN operators. While effective in certain scenarios, these libraries face limitations due to their empirical programming strategy, which does not offer the necessary flexibility for broad adaptability (dietcode, ). Additionally, the high development cost required to create these handcrafted solutions further constrains their efficiency (ansor, ).

The second category encompasses existing sample-driven dynamic-shape (dietcode, ; nimble, ) compilers, as shown in Figure 1 middle. Similar in workflow to static-shape compilers (tvm, ; ansor, ), these dynamic compilers utilize the tensor program and tensor samples as inputs to generate executable kernels. Typically, tensor compilers construct a substantial search space to implement optimization strategies such as loop partitioning, fusion, and reordering (tvm, ; ansor, ). Existing dynamic-shape compilers (dietcode, ; nimble, ) usually adopt a shape-generic search space by using tensor samples to represent shape information and auto-tuning micro-kernels for each specific sample at the offline phase. In the runtime, the dynamic compilers adopt a selector to integrate the micro-kernels for the computation process. However, their reliance on predefined shape samples limits their flexibility and effectiveness, particularly when tensor shapes fall outside the predefined range. This limitation can result in performance degradations of up to 4×4\times4 × for unsampled shapes (§2.2). Furthermore, these approaches necessitate frequent re-tuning through profiling on actual hardware to accommodate sample variations, which incurs considerable overhead, often taking hours to days (dietcode, ; nimble, ).

Recognizing the limitations of existing methodologies, which predominantly rely on sample-based compilation (dietcode, ; nimble, ) or intensive manual implementation (cublas, ; cudnn, ), we underscore the need for an innovative approach that fully exploits the capabilities of hardware architecture. By adopting a hierarchical approach, we decouple the kernel into multiple levels: during the offline stage, we leverage detailed hardware information to construct hardware-friendly micro-kernels. Subsequently, during runtime, we utilize the shape information to select and configure micro-kernels dynamically, crafting shape-friendly implementations tailored to the needs of dynamic-shape tensor programs. This methodology not only enhances tensor program execution but also eliminates the reliance on predefined shape samples, marking a significant shift towards dynamic, real-time compilation strategies.

In this work, we propose Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. To achieve these goals, we employ a novel bidirectional method to integrate hardware information into the compilation process. Specifically, this bidirectional approach ensures a top-down alignment software tensor program with hardware hierarchy and a bottom-up construction that dynamically adapts to changing tensor shapes, as demonstrated in Figure 1.

First, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x employs a top-down abstraction strategy to recursively decouple the tensor program, ensuring alignment with the hardware’s hierarchical architecture. For instance, in CUDA kernel on GPU, this process decouples the tensor program, mapping it to the software programming model’s Grid, Block, and Thread levels and the corresponding hardware structures: device, streaming multiprocessors (SMs), and registers (a100_whitepaper, ). By mirroring the hardware’s hierarchy organization, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x optimizes the execution implementation and resource allocation specific to each hardware level. This tailored approach ensures that the software exploits the full potential of the hardware architecture, leading to a significant boost in kernel performance.

Second, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x can generate the multi-level micro-kernels using bottom-up construction strategy. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x uses hardware parameter information to prune a vast amount of strategy space, leading to more efficient code generation. An iterative constructor, progressing from lower to higher levels, utilizes this pruned strategy space to guide the construction of the kernels. This approach reduces the search space and accelerates the development process, thereby enhancing efficiency. Furthermore, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x incorporates a hybrid analyzer that incorporates analytical (chimera, ; micro23_path, ) and empirical approaches (tvm, ; ansor, ) to evaluate the performance of various strategies, thereby enabling high-performance code generation with minimal system overhead.

Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x stands out due to its superior runtime performance and significantly reduced compilation overhead. We conduct a comprehensive evaluation of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, including a detailed comparative analysis with state-of-the-art vendor-provided libraries (oneDNN (onednn, ), ONNX Runtime (onnxruntime, ), cuBLAS (cublas, ), cuDNN (cudnn, ) and CUTLASS (cutlass, )) and dynamic-shape compiler DietCode (dietcode, ). These assessments are conducted across various hardware platforms, including Intel CPUs (intel_xeon_whitepaper, ) and Nvidia GPUs (a100_whitepaper, ), and at both operator and model levels. Notably, the performance of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x showcases a remarkable speed, averaging 2.53×\mathbf{2.53\times}bold_2.53 × and 3.01×\mathbf{3.01\times}bold_3.01 × faster than vendor-provided libraries and DietCode, respectively. Additionally, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x significantly accelerates the offline compilation, achieving 𝟏𝟕𝟔×\mathbf{176\times}bold_176 × improvement over DietCode. Overall, the results highlight Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s ability to outperform existing solutions across both CPU and GPU platforms.

In general, our work makes the following contributions:

  • We identify hardware-driven opportunities as pivotal factors for optimizing dynamic-shape compilation, highlighting the limitations of existing sample-driven solutions in terms of flexibility and functionality.

  • We propose rKernel, a top-down unified abstraction to decouple tensor programs and align their strategy space with hardware hierarchies, ensuring that the software fully exploits the potential of the backend hardware.

  • We introduce a bottom-up kernel construction approach, leveraging hardware information to prune the strategy space efficiently. This method facilitates the generation of kernels for dynamic-shape tensor programs without reliance on runtime shape samples.

  • We implement Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, a novel dynamic-shape compiler. Our comprehensive evaluation demonstrates Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s superiority over existing dynamic-shape compilers and manual optimization techniques on CPU and GPU platforms.

2. Background and Motivation

In this section, we begin by presenting a concise background on dynamic-shape tensor programs, outlining their definition and real-world computational scenarios. We then move to delineate the inherent constraints of the existing sample-driven optimization approach. Finally, we explore the opportunities and challenges in hardware-driven solutions.

2.1. Dynamic-shape Tensor Program

Tensor programs act as an operator-level abstraction, they are widely used across different tasks in neural network computation (pytorch, ; tensorflow, ; tvm, ; ansor, ). Conventional tensor programs inherently incorporate static shape information as an integral part of their input. On the contrary, dynamic-shape tensor programs enable processing tensor programs with unknown shapes (dynamic_survey, ). Dynamic-shape tensor programs have arisen in response to a twofold demand, driven by intrinsic data format dynamism and system execution and scheduling strategies.

Refer to caption
Figure 2. Existing sample-driven compilation workflow.

The intrinsic data format dynamism is one of the fundamental driving forces demanding the integration of dynamic-shape tensor programs. For example, in natural language processing (NLP) (bert, ), the inherent variability in sequence lengths is a characteristic feature that traditional static-shape tensor programs find challenging to accommodate. The diverse lengths of sentence tasks necessitate a flexible framework capable of dynamically adapting to varying input sizes. In computer vision (CV) tasks, conventional methods rely on fixed image sizes for tensor program input, which limits flexibility (mobilenet, ). Recent developments (fast-rcnn, ; scale_fact_dect, ; dynamic_stride_net, ; dynamic_survey, ) have introduced dynamic-shape tensors as input, enhancing support for more advanced detection and tracking tasks. Additionally, in the field of graph neural network (GNN) (gnn_survey, ; GNNAdvisor, ), the dynamic nature of graph structures, marked by varying numbers of vertices and edges, necessitates the use of dynamic-shape tensor programs. These examples highlight the essential need for systems that seamlessly manage the intrinsic variability in diverse data formats.

System execution and scheduling strategies further underscore the importance of dynamic-shape tensor programs. For example, dynamic adjustment of batch sizes in the execution of neural networks introduces variability that demands adaptability in the underlying tensor program (dvabatch, ; lazybatch, ; orca, ). The ability to efficiently handle varying batch sizes is crucial for optimizing resource utilization and achieving optimal performance in real-world applications.

2.2. Limitations of Sample-Driven Approach

Many optimizations have been proposed for dynamic-shape tensor programs, all following a similar workflow (nimble, ; dietcode, ). We term this methodology the sample-driven approach. In this subsection, we first demonstrate its workflow and then discuss the inherent limitations.

As depicted in Figure 2, the current approach typically employs a sample list to depict the dynamic-shape parameters, combining the preknown static shape and the tensor program to staticize the dynamic-shape tensor program into a shape-generic search space. An auto-tuning module, adapted from the static-shape compiler(tvm, ; ansor, ; tensorir, ), is subsequently utilized. The auto-tuning module takes the tensor program with unspecified tile parameters, proceeding to investigate the high-performance tiling configurations within the comprehensive search space. This process produces fine-tuned micro-kernels for each input sample. At runtime, a decision-tree-based selector is employed to choose the appropriate micro-kernel based on the runtime shape. The kernel constructor finalizes the process by setting kernel launch parameters and incorporating padding to extend the applicability of these micro-kernels to runtime shapes not included in the sample list, utilizing the pre-compiled micro-kernels for runtime execution.

Refer to caption
Figure 3. Comparing DietCode and cuBLAS over various sequence lengths on A100 GPU. ‘DietCode-I’ and ‘DietCode-O’ represent DietCode’s dynamic input configurations inside and outside the tuning sample list, respectively.

However, this approach faces significant limitations due to its reliance on a predetermined sample list for dynamic-shape parameters. This limitation restricts the compiler’s flexibility and overall functionality. In situations where input shapes are not included in the sample list, this sample-driven dynamic compilation optimization does not consistently provide high-performance computing support. This is a critical shortfall, as the diversity and variability of real-world data often extend beyond the scope of the predefined sample list.

To empirically validate this limitation, we conduct an experiment targeting the first general matrix multiply (GEMM) operation of the Bert model (bert, ). This GEMM operation entails the multiplication of two matrices, A𝐴Aitalic_A and B𝐵Bitalic_B. In this context, M𝑀Mitalic_M denotes the number of rows in matrix A𝐴Aitalic_A, computed as the product of batch size and sequence length. N𝑁Nitalic_N represents the number of columns in matrix B𝐵Bitalic_B, fixed at 768, and K𝐾Kitalic_K corresponds to the number of columns in matrix A𝐴Aitalic_A (and rows in matrix B𝐵Bitalic_B), fixed at 2304. Utilizing DietCode’s default sample configuration, tests are performed with a fixed batch size of 16 and sequence length varying from 5 to 128 in the increment step of 19.

Figure 3 shows that DietCode exhibits a significant performance discrepancy for dynamic parameters not included in the sample list, as compared to the results achieved with the vendor’s library cuBLAS (cublas, ). This is due to the absence of specifically optimized micro-kernels for these shapes and the increased inefficiency from padding loss. Moreover, this method’s rigidity becomes evident when considering the need to modify the sample list to accommodate various computational scenarios. Such modifications necessitate re-tuning of the system, which worsens its inflexibility.

2.3. Hardware-Driven Approach: Opportunities and Challenges

Refer to caption
Figure 4. CPU/GPU Diagram.

The mismatch between runtime-used and offline-sampled shapes diminishes the efficiency of traditional sample-driven methodology in dynamic-shape compilation. Such methodologies not only fall short in delivering high performance but also result in considerable tuning overhead. However, we observe that generating the sample list for tuning micro-kernels is not mandatory for supporting the execution of a dynamic tensor program. Sample-driven methodology treats the hardware platform as a black box, where micro-kernels are only tuned based on performance feedback of sampled shapes. This approach overlooks the rich vein of prior knowledge available within the hardware itself.

Figure 4 demonstrates the architecture of current mainstream hardware deployments, such as central processing units (CPUs) (intel_xeon_whitepaper, ) and graphics processing units (GPUs) (a100_whitepaper, ). There exist inherent similarities among these hardware, each of which has a hierarchical structure. This hierarchy is distinctly multi-level, wherein each level comprises a predetermined quantity of computational or storage units. For instance, each CPU core has its own L1 cache and ALUs, while all the CPU cores share the L2/L3 cache and DRAM. As for GPUs, each streaming multiprocessor (SM) also has its own computing units (CUDA cores, Tensor Cores) and L1 Cache, while all SMs share the L2 cache and DRAM. At each level, the resources available for executing a dynamic tensor program operator are constrained by the inherent limitations of the hardware design.

Refer to caption
(a)
Refer to caption
(b)
Figure 5. GEMM performance across different hardware resource usages on 8255c CPU and A100 GPU. Legend indicates corresponding GEMM parameters M, N, and K.

To investigate the impact of hardware limits on kernel performance, we experiment by generating configurations with different resource usages for the common-used matrix multiplication. We collect these configurations during Ansor (ansor, )’s tuning process. Figure 5 shows the floating point operations per second (FLOPS) and the corresponding resource usage of the collected configurations. As observed, the operator performance seriously declines as the used resources exceed the hardware’s upper limit. This implies that configurations with hardware-unfriendly parameters consistently underperformed, allowing us to preemptively prune inefficient configurations from the large strategy space. Consequently, this approach maintains a streamlined strategy space for micro-kernel generation and eliminates the need to create a predefined sample list.

Moreover, the dependency on hardware hierarchy motivates us to generate the micro-kernels in a bottom-up manner. As the hardware hierarchy enables us to prune inefficient configurations based on the hardware limits level by level, only a limited number of micro-kernels is ultimately kept to support the dynamism of the tensor program. This approach enables comprehensive support for all potential input shapes in dynamic tensor programming.

Summary

In the light of utilizing the inherent nature of hardware structure, we can not only optimize but also fundamentally rethink our approach to compilation strategy. This observation motivated us to design a hardware-aware dynamic-shape compiler that facilitates highly efficient computational performance while maintaining flexibility.

3. Overview of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x

In this section, we describe the workflow of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, highlighting its key idea and optimization flow. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x is an advanced operator-level compiler explicitly designed for optimizing the computation of dynamic-shape tensor programs.

Key Idea.

The key idea of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x is its strategic use of hardware features to develop a hierarchical optimization flow, seamlessly integrating both offline and runtime stages. Initially, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x systematically decomposes dynamic-shape tensor programs into multi-level subtasks through a unified abstraction, rKernel. Each hierarchical level leverages hardware parameters to create micro-kernels tailored to specific hardware needs. The process begins with the initial construction of micro-kernels at the lowest level, progressing to the final kernel selection during the runtime stage when shape information becomes available. The hardware-aware approach of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x facilitates a bottom-up, multi-level compilation workflow. This approach leads to efficient optimization with lower runtime overhead and improved performance, which is ideal for dynamic-shape tensor programs.

Refer to caption
Figure 6. Design overview of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x.

Optimization Flow.

Figure 6 details the methodological framework of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x. In the offline stage, for a given tensor program, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x employs a top-down abstraction to transform tensor programs into a structured, hierarchical format that mirrors the hardware’s structure, which aligns the strategy space with hardware architecture. For instance, the upper-right section of Figure 6 illustrates the tensor program abstraction of a GPU-based kernel. Here, we can seamlessly align the terms “Grid”, “Block”, and “Warp” with the device, SM, and Tensor Cores, respectively. Then Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x employs a hardware-aware multi-level kernel constructor to develop a suite of hardware-friendly micro-kernels from the bottom level to the top. At each level, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x utilizes a micro-kernel generator to identify candidates, then employs empirical or analytical analyzers to fine-tune the micro-kernels’ implementation. This approach offers a well-balanced compromise between performance and overhead, tailored specifically to each hardware tier. Then Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x generates a set of micro-kernels that are highly compatible with the hardware in each level. During the runtime stage, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x employs a streamlined analytical module. This module quickly selects the appropriate candidate for the runtime input shapes, facilitating efficient execution on the hardware.

The benefits of this hardware-aware optimization workflow are two-fold. Primarily, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x can generate hardware-friendly kernels that are universally capable of delivering high-performance computational support. Furthermore, the offline compilation phase’s independence from runtime shape samples significantly broadens the scope of support Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x can offer, as it reduces constraints and enhances flexibility.

Summary.

In summary, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x presents an effective and efficient solution to the challenges of dynamic-shape tensor program computation. Its hardware-aware and multi-level process, blending offline preparation with runtime efficiency, ensures the system’s adaptability to complex dynamic-shape computational scenarios, guaranteeing high-performance support while maintaining low runtime overhead.

Refer to caption
Figure 7. Recursive execution pattern of the GEMM operator across hardware hierarchy levels.

4. Strategy Space Hierarchization in Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x

In this section, we introduce the key of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s workflow: strategy space hierarchization via top-down recursive decomposition. We first start with the GEMM as an example to illustrate the recursive execution pattern inherent in tensor programs. This example lays the foundation for presenting our innovative unified abstraction, rKernel, designed to accommodate the recursive nature prevalent in diverse tensor programs and across multiple hardware hierarchies.

4.1. Top-Down Recursive Notation

In this subsection, we use the example of GEMM tensor program on GPUs as a paradigmatic example of recursive execution patterns on hierarchical hardware platforms. GEMM, as a classic operator in deep learning, is mathematically defined as C=A×B𝐶𝐴𝐵C=A\times Bitalic_C = italic_A × italic_B, where A𝐴Aitalic_A and B𝐵Bitalic_B are the input matrices and C𝐶Citalic_C is the output matrix.

As shown in Figure 7, it is natural to use recursive for-loops to represent GEMM execution flows. This approach breaks down the GEMM tensor program into a series of recursive loops set. These recursive loop sets elegantly map the high-level GEMM tensor program onto specific hardware levels, ranging from off-chip to on-chip memory and registers. Specifically, the uppermost recursion, OffChip_Mem_GEMM, processes end-to-end matrix blocks in off-chip memory. At intermediate levels, OnChip_Mem_GEMM handles smaller matrix blocks in on-chip memory. The innermost recursion, Register_GEMM, focuses on individual elements in registers, utilizing specific calculation instructions, such as Fused-Multiply-Add (FMA) (a100_whitepaper, ) instructions on GPUs.

This top-down recursive approach is intuitive and crucial for optimizing tensor programs on modern hardware architectures. It gives us an opportunity to decouple optimization space between different layers from each other, ensuring that each recursive layer is finely tuned to the unique capabilities and limitations of the corresponding hardware level, whether it be off-chip memory, on-chip memory, or registers. Furthermore, this approach provides a clear structure for optimizing different aspects of the tensor program independently at each level.

4.2. Unified Abstraction Design

Algorithm 1 Unified Recursive Abstraction
1:procedure rKernel(L,PL,TSL,TRL,LF,SF𝐿𝑃𝐿𝑇𝑆𝐿𝑇𝑅𝐿𝐿𝐹𝑆𝐹L,PL,TSL,TRL,LF,SFitalic_L , italic_P italic_L , italic_T italic_S italic_L , italic_T italic_R italic_L , italic_L italic_F , italic_S italic_F)
2:    // L: Current hierarchical layer
3:    // PL: Set of parallel loops
4:    // TSL: Set of temporal spatial loops
5:    // TRL: Set of temporal reduction loops
6:    for each parallel loop p𝑝pitalic_p in PL[L]𝑃𝐿delimited-[]𝐿PL[L]italic_P italic_L [ italic_L ] do
7:         for each temporal spatial loop ts𝑡𝑠tsitalic_t italic_s in TSL[L]𝑇𝑆𝐿delimited-[]𝐿TSL[L]italic_T italic_S italic_L [ italic_L ] do
8:             for each temporal reduction loop tr𝑡𝑟tritalic_t italic_r in TRL[L]𝑇𝑅𝐿delimited-[]𝐿TRL[L]italic_T italic_R italic_L [ italic_L ] do
9:                 Load_Func(L,p,ts,tr𝐿𝑝𝑡𝑠𝑡𝑟L,p,ts,tritalic_L , italic_p , italic_t italic_s , italic_t italic_r)
10:                 rKernel(L1,PL,TSL,TRL,LF,SF𝐿1𝑃𝐿𝑇𝑆𝐿𝑇𝑅𝐿𝐿𝐹𝑆𝐹L-1,PL,TSL,TRL,LF,SFitalic_L - 1 , italic_P italic_L , italic_T italic_S italic_L , italic_T italic_R italic_L , italic_L italic_F , italic_S italic_F)              
11:             Store_Func(L,p,ts𝐿𝑝𝑡𝑠L,p,tsitalic_L , italic_p , italic_t italic_s)              
12:end procedure

Recognizing the hierarchical nature of tensor program execution, we also acknowledge the variations between different hardware and tensor programs. Specifically, CPUs and GPUs exhibit distinct computation modes and memory access controls, where CPUs are optimized for multi-threaded parallelism, and GPUs excel in Warp, CTA, and Grid-level parallelism. Moreover, different tensor programs demonstrate unique loop patterns; for instance, the loop characteristics of Convolution markedly differ from those in GEMM.

These variations inspire our design of a unified abstraction, which delineates a universal approach for representing tensor program executions across various hardware platforms. Algorithm 1 elaborates on this abstraction. It maintains the layer-wise recursive structure as demonstrated in GEMM (Figure 7) as the core, and enables custom loop mapping and execution stages for various tensor programs.

To achieve a universal and customizable representation, we classify loops within each hierarchical level into three distinct sets in Algorithm 1. The Parallel Loop Set is designed for parallel execution; the Temporal Spatial Loop Set manages temporal non-reduction operations; and the Temporal Reduction Loop Set focuses on temporal reduction operations. Each level, identified as level N, abstracts the execution into three stages: Load, rKernel(N-1), and Store. These stages serve as flexible interfaces, allowing for tailored execution based on the specific requirements of each hardware level.

Table 1. Complete representation for different hardware, levels via rKernel abstraction. ‘-’ refers to ‘No Parallel Binding’ in the ‘Parallel Binding’ column and ‘No Operation’ elsewhere.
HW. Level
Parallel
Binding
Load
Lower Level
rKernel
Store
CPU 0 -
CacheBuf \rightarrow Reg
or GlobalMem\rightarrow Reg
ALU Calc.
Reg \rightarrow CacheBuf
or Reg \rightarrow GlobalMem
1 Thread
GlobalMem \rightarrow CacheBuf
or -
L1 rKernel
CacheBuf \rightarrow GlobalMem
or -
2 Process - L2 rKernel -
GPU 0 Warp
SharedMem\rightarrow Reg
or GlobalMem\rightarrow Reg
Cuda/Tensor
Core Calc.
Reg \rightarrowSharedMem
or Reg \rightarrow GlobalMem
1 CTA
GlobalMem \rightarrow SharedMem
or -
L0 rKernel
SharedMem \rightarrow Global
or -
2 Grid - L1 rKernel -

Table 1 illustrates how this unified abstraction rKernel is implemented across different hardware configurations, highlighting its versatility. Our focus lies in scrutinizing recursive execution patterns among different hierarchies. For CPUs, at the lowest level (L0), the rKernel abstraction allows for direct data transfer from Global memory to Registers or from CacheBuffer to Registers, depending on the specific needs of the computation. A “CacheBuffer" is defined as a memory buffer, sized within the L2 cache limits, to ensure consistent caching of its contents in the L2 cache for efficient data access and processing (onednn, ). Additionally, store operations at this level also reflect this adaptability, offering the choice of transferring data back to Global memory or CacheBuffer. As we progress to level L1, the abstraction provides options for either transferring data from Global memory to CacheBuffer or performing no operation, signifying a versatile approach to data handling. The highest level (L2) in CPUs focuses on the multi-thread mechanism at the process level, capitalizing on the CPU’s capabilities for multi-core parallel processing.

Similarly, for GPUs, including both Cuda Cores and Tensor Cores, rKernel adapts to different operational requirements. At L0, there’s an option for loading data either from Global memory or Shared Memory to Registers, and similarly, storing data either back to Global memory or to Shared Memory. This flexibility is crucial for optimizing memory utilization in the highly parallel environments of GPUs. At the L1 level, similar to CPUs, the abstraction can facilitate data transfer from Global to Shared Memory or no operation, enabling efficient resource management. The L2 level focuses on Grid-level operations, enhancing the scalability across the GPU’s multiple streaming multiprocessor (SM) architecture.

rKernel achieve a hierarchy abstraction of execution patterns that apply universally to various tensor programs and hardware types. This approach ensures a tailored strategy space for each hardware hierarchy level and facilitates the universal optimizations for dynamic-shape tensor programs.

5. Detailed Designs at Each Level

In this section, we thoroughly explore Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s detailed designs at each hierarchical level. Initially, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x utilizes an effective approach to generate hardware-aware candidates at each level. Following this, a hybrid analytical-empirical analyzer is deployed to discern the high-performance implementation for each candidate.

5.1. Bottom-up Hardware-aware Candidates Generator

This subsection introduces an innovative bottom-up method for generating candidates tailored to align with top-down recursive abstraction. This method becomes particularly crucial when execution parameter information is lacking, posing a significant challenge in every hierarchical layer for identifying suitable shape candidates for micro-kernel creation. Our approach centers on two key processes:

Refer to caption
Figure 8. Comparison of different padding patterns and their corresponding waste scenarios.

Firstly, we utilize the hardware’s parameter information to determine constraints for the candidate range. As highlighted in §2.3, our empirical studies have identified a notable decrease in hardware execution efficiency when the utilization at any hardware level is extremely low or high. This insight allows us to deduce a feasible range for candidate shapes, based on hardware utilization metrics.

Secondly, we follow a key design principle: ensuring the shape size of candidates in an upper layer is an integer multiple of the shape size in the lower layer. This approach aims to minimize padding loss during the construction of micro-kernels. As shown in Figure 8, if the sizes at one level are not multiples of those below, it results in more padding losses and inefficiencies at higher levels. Conversely, constructing candidates as integer multiples from one level to the next predominantly confines padding loss only to the outermost execution level, adhering to runtime requirements.

Algorithm 2 Candidates Generation Algorithm.
1:function GenerateCandidatesForLayer(L𝐿Litalic_L)
2:    hwInfo𝑤𝐼𝑛𝑓𝑜absenthwInfo\leftarrowitalic_h italic_w italic_I italic_n italic_f italic_o ← GetHardwareInfo(L𝐿Litalic_L)
3:    // Determine by hardware resource limitation
4:    cands𝑐𝑎𝑛𝑑𝑠absentcands\leftarrowitalic_c italic_a italic_n italic_d italic_s ← InitCands(hwInfo𝑤𝐼𝑛𝑓𝑜hwInfoitalic_h italic_w italic_I italic_n italic_f italic_o)
5:    if L=0𝐿0L=0italic_L = 0 then
6:         cands𝑐𝑎𝑛𝑑𝑠absentcands\leftarrowitalic_c italic_a italic_n italic_d italic_s ← FilterByISA(cands𝑐𝑎𝑛𝑑𝑠candsitalic_c italic_a italic_n italic_d italic_s)
7:    else
8:         prevCands𝑝𝑟𝑒𝑣𝐶𝑎𝑛𝑑𝑠absentprevCands\leftarrowitalic_p italic_r italic_e italic_v italic_C italic_a italic_n italic_d italic_s ← GetPrevLayerCands(L1𝐿1L-1italic_L - 1)
9:         cands𝑐𝑎𝑛𝑑𝑠absentcands\leftarrowitalic_c italic_a italic_n italic_d italic_s ← FilterByMultiples(cands𝑐𝑎𝑛𝑑𝑠candsitalic_c italic_a italic_n italic_d italic_s, prevCands𝑝𝑟𝑒𝑣𝐶𝑎𝑛𝑑𝑠prevCandsitalic_p italic_r italic_e italic_v italic_C italic_a italic_n italic_d italic_s)     
10:    return cands𝑐𝑎𝑛𝑑𝑠candsitalic_c italic_a italic_n italic_d italic_s
11:end function
12:function FilterByISA(cands𝑐𝑎𝑛𝑑𝑠candsitalic_c italic_a italic_n italic_d italic_s)
13:    filtered𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑filtered\leftarrow\emptysetitalic_f italic_i italic_l italic_t italic_e italic_r italic_e italic_d ← ∅
14:    for candcands𝑐𝑎𝑛𝑑𝑐𝑎𝑛𝑑𝑠cand\in candsitalic_c italic_a italic_n italic_d ∈ italic_c italic_a italic_n italic_d italic_s do
15:         if IsCompatible(cand) then
16:             filtered𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑filtereditalic_f italic_i italic_l italic_t italic_e italic_r italic_e italic_d.add(cand𝑐𝑎𝑛𝑑canditalic_c italic_a italic_n italic_d)              
17:    return filtered𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑filtereditalic_f italic_i italic_l italic_t italic_e italic_r italic_e italic_d
18:end function
19:function FilterByMultiples(cands,prevCands𝑐𝑎𝑛𝑑𝑠𝑝𝑟𝑒𝑣𝐶𝑎𝑛𝑑𝑠cands,prevCandsitalic_c italic_a italic_n italic_d italic_s , italic_p italic_r italic_e italic_v italic_C italic_a italic_n italic_d italic_s)
20:    filtered𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑filtered\leftarrow\emptysetitalic_f italic_i italic_l italic_t italic_e italic_r italic_e italic_d ← ∅
21:    map𝑚𝑎𝑝absentmap\leftarrowitalic_m italic_a italic_p ← an empty map
22:    for prevprevCands𝑝𝑟𝑒𝑣𝑝𝑟𝑒𝑣𝐶𝑎𝑛𝑑𝑠prev\in prevCandsitalic_p italic_r italic_e italic_v ∈ italic_p italic_r italic_e italic_v italic_C italic_a italic_n italic_d italic_s do
23:         multiples𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒𝑠absentmultiples\leftarrowitalic_m italic_u italic_l italic_t italic_i italic_p italic_l italic_e italic_s ← GenerateMultiples(prev𝑝𝑟𝑒𝑣previtalic_p italic_r italic_e italic_v, cands𝑐𝑎𝑛𝑑𝑠candsitalic_c italic_a italic_n italic_d italic_s)
24:         for multiplemultiples𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒𝑠multiple\in multiplesitalic_m italic_u italic_l italic_t italic_i italic_p italic_l italic_e ∈ italic_m italic_u italic_l italic_t italic_i italic_p italic_l italic_e italic_s do
25:             filtered𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑filtereditalic_f italic_i italic_l italic_t italic_e italic_r italic_e italic_d.add(multiple𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒multipleitalic_m italic_u italic_l italic_t italic_i italic_p italic_l italic_e)
26:             map[multiple]𝑚𝑎𝑝delimited-[]𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒map[multiple]italic_m italic_a italic_p [ italic_m italic_u italic_l italic_t italic_i italic_p italic_l italic_e ].append(prev𝑝𝑟𝑒𝑣previtalic_p italic_r italic_e italic_v)              
27:    return filtered,map𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑𝑚𝑎𝑝filtered,mapitalic_f italic_i italic_l italic_t italic_e italic_r italic_e italic_d , italic_m italic_a italic_p
28:end function

The overall process is detailed in Algorithm 2. The core function, GenerateCandidatesForLayer, is designed to operate distinctly based on the hierarchical layer it addresses. It begins by acquiring the hardware specifications for a given layer (denoted as L). This is achieved through the GetHardwareInfo and InitCands function, which retrieves essential hardware constraints that critically influence the parameter space. To validate candidate feasibility, the key is assessing memory usage against layer-specific limits and accounting for hardware constraints, such as a GPU’s 1024 threads-per-block maximum. For the initial layer (L = 0), the function employs FilterByISA to refine the candidate set according to the hardware platform’s Instruction Set Architecture (ISA) compatibility. For instance, on Intel CPUs, the FilterByISA function considers the granularity constraints of AVX512 (intel_xeon_whitepaper, ). Similarly, on GPU with Tensor Cores, the function assesses the constraints imposed by the Matrix Multiply-Accumulate (MMA) instruction (a100_whitepaper, ). These considerations ensure alignment with hardware capabilities.

Our algorithm utilizes FilterByMultiples, a method inspired by the classic sieve approach (sieve, ), to filter candidates. This function iteratively processes previous layer candidates (prevCands), generating multiples constrained via GenerateMultiples within the current layer’s candidate range. This approach ensures comprehensive exploration of viable parameter sets and maintains filtering efficiency. Additionally, we employ a mapping mechanism that uses a table to record the links between each candidate in the current layer and its possible match candidates in the previous layer, which is crucial for the subsequent analysis module.

5.2. Hybrid Analytical-Empirical Analyzer

In this subsection, we present a novel analyzing method which effectively incorporates analytical and empirical methodologies. This approach is designed to optimize the trade-off between efficiency and accuracy in strategy analysis.

Goal of the Analyzer.

Our analyzer aims to identify suitable candidates at each layer. Using Algorithm 2, we construct a map that describes the connections between candidates in connected layers. However, a single candidate can map to multiple lower-level candidates in this table. Each mapping corresponds to a unique implementation of the scheduling strategy, necessitating a thorough evaluation of the performance variability among these strategies. This task of identifying the optimal strategy csuperscript𝑐c^{*}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from a set S𝑆Sitalic_S at a given hardware level L𝐿Litalic_L can be defined as an optimization problem:

(1) c=argminsSCost(s,L)superscript𝑐𝑠𝑆Cost𝑠𝐿c^{*}=\underset{s\in S}{\arg\min}\,\operatorname{Cost}(s,L)italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_s ∈ italic_S end_UNDERACCENT start_ARG roman_arg roman_min end_ARG roman_Cost ( italic_s , italic_L )

The analyzer’s primary objective is to identify a cost-effective and efficient strategy that aligns with the requirements at both offline and runtime stages. Concurrently, it is crucial to be aware of the time overhead associated with cost analysis, since excessive overhead, particularly during runtime, is clearly unacceptable. This necessitates a sophisticated design approach, blending empirical profiling on actual hardware with comprehensive theoretical analysis.

Refer to caption
Figure 9. An Illustration of execution abstraction and associated analytical model.

Analytical Cost Model.

We build an analytical cost model, a theoretical framework to predict the costs of different candidate implementations. The analytical cost model encapsulates the execution time based on algorithmic complexity and hardware specifications. As shown in Figure 9, we central to this model in two determinants: Spatial𝑆𝑝𝑎𝑡𝑖𝑎𝑙Spatialitalic_S italic_p italic_a italic_t italic_i italic_a italic_l, representing the amplification factor due to parallel loops executed across different hardware units, and Temporal𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙Temporalitalic_T italic_e italic_m italic_p italic_o italic_r italic_a italic_l, relating to the execution process of serially dependent loop operations.

The temporal execution cost, Ttemporalsubscript𝑇𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙T_{temporal}italic_T start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT, is carefully crafted to encapsulate the intricacies of pipeline execution within serial loops. It is quantified as follows:

(2) Ttemporal=TLoad+(sizeof(TemporalLoop)1)×max(TLoad,CostL1)+CostL1+TStoresubscript𝑇𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙subscript𝑇𝐿𝑜𝑎𝑑sizeofTemporalLoop1subscript𝑇𝐿𝑜𝑎𝑑𝐶𝑜𝑠subscript𝑡𝐿1𝐶𝑜𝑠subscript𝑡𝐿1subscript𝑇𝑆𝑡𝑜𝑟𝑒\begin{split}T_{temporal}=&T_{Load}+\bigl{(}\text{sizeof}(\text{TemporalLoop})% -1\bigr{)}\\ &\times\max\bigl{(}T_{Load},Cost_{L-1}\bigr{)}+Cost_{L-1}+T_{Store}\end{split}start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT = end_CELL start_CELL italic_T start_POSTSUBSCRIPT italic_L italic_o italic_a italic_d end_POSTSUBSCRIPT + ( sizeof ( TemporalLoop ) - 1 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL × roman_max ( italic_T start_POSTSUBSCRIPT italic_L italic_o italic_a italic_d end_POSTSUBSCRIPT , italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) + italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_S italic_t italic_o italic_r italic_e end_POSTSUBSCRIPT end_CELL end_ROW

In this equation, TLoadsubscript𝑇𝐿𝑜𝑎𝑑T_{Load}italic_T start_POSTSUBSCRIPT italic_L italic_o italic_a italic_d end_POSTSUBSCRIPT and TStoresubscript𝑇𝑆𝑡𝑜𝑟𝑒T_{Store}italic_T start_POSTSUBSCRIPT italic_S italic_t italic_o italic_r italic_e end_POSTSUBSCRIPT are the costs taken to load and store data, respectively. They are calculated based on the amount of data moved at the current layer divided by the memory bandwidth at that layer. CostL1𝐶𝑜𝑠subscript𝑡𝐿1Cost_{L-1}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT represents the cost taken by the micro-kernel computation at the lower level. This equation accounts for the loop’s iteration count and juxtaposes the data load time against the execution span of a reduced kernel operation, thereby emulating the pipeline’s potential latency bottlenecks.

In parallel processing, the cost is modulated by the parallel loop’s scale relative to the hardware’s unit capacity:

(3) Fparallel=sizeof(ParallelLoop)|HardwareUnit|subscript𝐹𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙sizeofParallelLoopHardwareUnitF_{parallel}=\left\lceil\frac{\text{sizeof}(\text{ParallelLoop})}{|\text{% HardwareUnit}|}\right\rceilitalic_F start_POSTSUBSCRIPT italic_p italic_a italic_r italic_a italic_l italic_l italic_e italic_l end_POSTSUBSCRIPT = ⌈ divide start_ARG sizeof ( ParallelLoop ) end_ARG start_ARG | HardwareUnit | end_ARG ⌉

This quantifies the hardware’s parallel processing ability, scaling execution cost with the parallel loop size. The overall strategy cost at layer L, CostL𝐶𝑜𝑠subscript𝑡𝐿Cost_{L}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, is the product of the parallel execution cost and the temporal execution cost, capturing the total time required for the execution of a computational task across various layers:

(4) CostL=Fparallel×Ttemporal𝐶𝑜𝑠subscript𝑡𝐿subscript𝐹𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙subscript𝑇𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙Cost_{L}=F_{parallel}\times T_{temporal}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_p italic_a italic_r italic_a italic_l italic_l italic_e italic_l end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT

Our analytical cost model faces a recursive complexity, needing CostL1𝐶𝑜𝑠subscript𝑡𝐿1Cost_{L-1}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT for each level L. Furthermore, hardware optimizations such as instruction pipelining and out-of-order execution can lead to substantial inaccuracies in the cost model, posing a challenge in ensuring the precision of the analysis module (micro23_path, ).

Hybrid Analyzer Design.

Two key observations guide our design. Firstly, the bottom-up multi-level approach to kernel construction incrementally increases the number of candidates at higher layers. Secondly, unpredictable hardware-related scheduling, such as out-of-order execution, predominantly focuses on lower layers. These insights led to the development of our hybrid analytical-empirical analyzer.

The analyzer conducts empirical profiling on CPUs at level L0 and on GPUs at both L0 and L1 levels. For higher levels, it utilizes an analytical cost model. This hybrid system synergizes the efficiency of the analytical approach with the accuracy of empirical data, with the latter offering real-time performance insights to augment the analytical predictions. Importantly, all runtime analyses are conducted using the analytical model, ensuring a streamlined and low-overhead performance evaluation. The effectiveness of this hybrid methodology, in terms of performance and runtime overhead, is further investigated in §7.4. Overall, this hybrid approach is especially valuable in complex scenarios where theoretical models may be insufficient, ensuring both effectiveness and precision in the strategy analysis process.

6. Implementation

In this section, we detail the implementation of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x. We focus on demonstrating the code generation method, and the scheduling process during the runtime phase.

1enum ANALYZE_TYPE {empirical, analytical};
2enum LOOP_TYPE {PL, TSL, TRL};
3class axis;
4class layer_meta_info {
5 int layer_depth;
6 map<axis, LOOP_TYPE> loop_type;
7 ANALYZE_TYPE analyzer;
8
9 func* load_func;
10 func* store_func;
11 func* compute_func;
12};
Figure 10. The definition of rKernel.

6.1. Code Generation

Despite the contrasting architectures of GPUs and CPUs, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s abstraction consistently represents both, as exemplified in Table 1. One of the key aspects contributing to this universal representation is the definition of data structures.

The rKernel data structure, as illustrated in Figure 10, is a cornerstone of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, tailored to encapsulate and streamline the complex processes involved. At its core lies the layer_meta_info class, pivotal for orchestrating the optimization strategy of each hierarchical layer. Within this class, the layer_depth attribute determines the layer’s position within the hierarchical structure, which is a crucial factor for the recursive optimization process. Moreover, the map<axis, LOOP_TYPE> provides a strategic mapping of loops to their respective types, including Spatial (S), Temporal Parallel (TP), and Temporal Reduction (TR). This nuanced approach to loop optimization aligns seamlessly with the unique characteristics of each tensor program. The ANALYZE_TYPE enum, encompassing empirical and analytical options, facilitates the selection of the appropriate optimization analysis method. Finally, the functional pointers—load_func, store_func, and compute_func—are critical in dynamically managing the various stages of computation. This data structure encapsulates the necessary elements to navigate dynamic-shape tensor program optimization challenges adeptly.

1// Input IR, we omit block and grid for brevity
2layer_meta_info gemm_tc_warp;
3gemm_tc_warp.set(
4 layer_depth = 0,
5 loop_type = {"k0":TRL,"m0":TSL,"n0":TSL},
6 cost_model = profiling;
7 load_func = load(shared_to_reg),
8 store_func = store(reg_to_shared),
9 compute_func = asm("mma.sync.m16n8k16")
10);
11
12// Output Generated Kernel
13dim3 grid(M/m_tile_grid, N/n_tile_grid);
14gemm_tensor_core_grid<<<grid, thread>>>(...);
15
16__global__ void gemm_tensor_core_grid(...) {
17 __shared__ half A_buf[], B_buf[], C_buf[];
18 for (k2 = 0; k2 < K; k2+=k_tile_grid)
19 for (k1 = 0; k1 < k2; k1+=k_tile_block)
20 asm("ld.global"); // Load A/B to A_buf/B_buf
21 for (m0 = 0; m0 < m1; m0+=m_tile_warp)
22 for (n0 = 0; n0 < n1; n0+=n_tile_warp)
23 asm("ld.shared"); // Load A_buf/B_buf to A_reg/B_reg
24 C_frag = 0;
25 for (k0 = 0; k0 < k1; k0+=k_tile_warp)
26 asm("mma.sync.m16n8k16");
27 C_frag += ...
28 asm("st.shared"); // Store C_reg to C_buf
29 asm("st.global"); // Store C_buf to C
30}
Figure 11. An illustration of GPU GEMM code generation.

rKernel serves as a recursive-based template for dynamic tensor program. When dealing with a fixed hardware and a particular operator, it necessitates users to initialize different levels of layer_meta_info. The development effort required for this task is minimal. This is because, for both CPU and GPU, we set the hierarchy level to three, which does not impose a significant burden. Furthermore, the dynamic parameters of the template enable consistent computational support for various runtime shapes. Owing to its versatility, development costs are apportioned across various computing scenarios. We built Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x on top of TVM (tvm, ). We harness the ‘tensorize’ primitive to implement the load, store, and compute functions, and leverage TVM’s robust and versatile code generation capabilities, we seamlessly target both CPU and GPU platforms. In Figure 11, we illustrate a representative GEMM kernel code generation for GPUs, showcasing our methodology’s depth and adaptability in optimizing dynamic-shape tensor programs for GPU architectures.

6.2. Integration of Offline and Runtime

The integration of offline and runtime components in Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x involves several key steps. During runtime, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x employ analytical cost models to estimate the execution costs associated with different candidate solutions with runtime shape information. Subsequently, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x select the most suitable micro-kernel candidates based on these optimal execution cost estimations.

Notably, our selection process accommodates the dynamic nature of hardware platforms. For instance, in the case of GPUs, the presence of a larger MMA instruction (a100_whitepaper, ) padding in the tensor core necessitates adaptive hardware solutions. We provide implementations for both CUDA cores and Tensor cores, allowing us to choose the appropriate backend hardware based on the runtime input shapes adaptively, further optimizing execution efficiency. The performance benefits of this adaptive strategy are thoroughly explored in §7.4. Additionally, by considering the computational shape of the selected micro-kernels in conjunction with the runtime shape, we collectively compute runtime-specific computational details, such as grid configurations. This comprehensive approach ensures the versatility and generality of our computational framework, facilitating efficient and adaptable runtime operations.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 12. Performance results at operator-level. CPU results are normalized to oneDNN; GPU results are normalized to cuBLAS for GEMM and cuDNN for convolution, respectively.

7. Evaluation

7.1. Experimental Setup

Platforms.

We evaluate Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x on two representative platforms: Intel 8255c CPU (intel_xeon_whitepaper, ) and Nvidia Ampere A100 GPU (a100_whitepaper, ). For GPUs, we conduct evaluations using two computational modes: Tensor Core Enabled mode with half-precision floating-point (FP16) data type, and Cuda Core Only mode with single-precision floating-point (FP32) data type. Table 2 details our experimental platforms.

Table 2. Hardware specifications.
Hardware Nvidia GPU Intel CPU
Version Ampere A100 (108 SMs) Xeon 8255c (48 Cores)
Storage
Global: 40G; L2 Cache: 40M;
Shared Memory: 48 K/SM;
Reg: 256K/SM
Global: 250.53G; L3 Cache: 35.75M;
L2 Cache: 1M/Core; L1 Cache: 32K/Core;
Reg: 2K/Core
Peak Flops
CUDA Core: 19.5 TFlops;
Tensor Core: 312 TFlops
7344 GFlops
OS CentOS Linux 8.4.2105 CentOS Linux 8.6.2205
Software
Driver Version: 450.156.00
CUDA Version: 11.8
cuDNN: 8.9.7.29
GCC: 10.2.0
LLVM: 15.0.3

Benchmarks.

Our evaluation encompasses two benchmark categories: operator-level and model-level. At the operator level, we gather 1197 different operator configurations from DeepBench (deepbench, ) and real-world models, covering various tasks like Transformer, CNN, and GNN (see Table 3 and Table 4 for details). These configurations demonstrate variability in all dimensions and possess a wide dynamic range, making them highly representative. At the model level, we assess the performance of three Transformer-based language models (Bert (bert, ), Bert-large (bert, ), GPT2 (gpt2, )) and three computer vision models (AlexNet (alexnet, ), ResNet (resnet, ), GoogleNet (googleNet, )) for end-to-end dynamic-shape neural network evaluation. To mirror real-world scenarios, we generate 17 sequence lengths ranging from 1 to 476 for language models. For CNN models, we configure batch sizes beginning with 1, and then incrementally step from 4 to 64 in multiples of 4.

Table 3. Benchmarked GEMM with dynamic shapes.
Category M N K #Cases
DeepBench (deepbench, ) [35, 8448] [1, 6000] [128, 500000] 84
Transformer (bert, ; gpt2, ; huggingface, ) [1, 476] [768, 4096] [768, 4096] 192
CNN (alexnet, ; vgg, ; googleNet, ; resnet, ) [1, 128] [80, 25088] [10, 4096] 80
GNN (GCN, ; GAT, ; GNNAdvisor, ; pyg, ) [2708, 1888584] [2, 121] [8, 3703] 150
Table 4. Benchmarked Convolution with dynamic shapes.
Category BS Fmap Filter Cin Cout #Cases
DeepBench (deepbench, ) [1,16] [7,700] [1,20] [1,2048] [16,2048] 107
CNN (alexnet, ; resnet, ; vgg, ; googleNet, ) [1,64] [4,768] [1,11] [3,832] [16,512] 584

Baselines.

We select various SOTA baselines, divided into two principal categories. The first category encompasses vendor-provided libraries, which are specialized libraries frequently utilized in neural network frameworks (pytorch, ). For NVIDIA GPUs, our evaluation utilizes cuBLAS (cublas, ) for GEMM and cuDNN (cudnn, ) for convolution. We also evaluate CUTLASS (cutlass, ) for both tasks. For Intel CPUs, we compare GEMM and convolution performance with oneDNN (onednn, ) and ONNX Runtime (onnxruntime, ). The second category is dynamic-shape compilers, for which we select DietCode (dietcode, ), the existing leading dynamic-shape tensor program compiler.

Table 5. Summary of operator-level speedups for Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x compared to various baselines across different setups.
Hardware
Config
Operator Baseline
Cases with
Speedup > 1 (%)
Average
Speedup
CPU GEMM oneDNN 77.3%percent77.377.3\%77.3 % 1.82×1.82\times1.82 ×
ONNX Runtime 91.5%percent91.591.5\%91.5 % 4.38×4.38\times4.38 ×
Conv. oneDNN 85.8%percent85.885.8\%85.8 % 2.09×2.09\times2.09 ×
ONNX Runtime 99.1%percent99.199.1\%99.1 % 5.37×5.37\times5.37 ×
GPU (Tensor Core Enabled) GEMM cuBLAS 83.7%percent83.783.7\%83.7 % 1.43×1.43\times1.43 ×
CUTLASS 94.2%percent94.294.2\%94.2 % 2.62×2.62\times2.62 ×
Conv. cuDNN 89.9%percent89.989.9\%89.9 % 2.32×2.32\times2.32 ×
CUTLASS 80.5%percent80.580.5\%80.5 % 1.70×1.70\times1.70 ×
GPU (Cuda Core Only) GEMM cuBLAS 78.3%percent78.378.3\%78.3 % 1.63×1.63\times1.63 ×
CUTLASS 99.8%percent99.899.8\%99.8 % 7.65×7.65\times7.65 ×
DietCode 94.1%percent94.194.1\%94.1 % 2.67×2.67\times2.67 ×
Conv. cuDNN 91.1%percent91.191.1\%91.1 % 1.53×1.53\times1.53 ×
CUTLASS 87.8%percent87.887.8\%87.8 % 2.88×2.88\times2.88 ×
DietCode 92.5%percent92.592.5\%92.5 % 3.39×3.39\times3.39 ×

7.2. Dynamic-Shape Tensor Program

In this subsection, we present the evaluation of single dynamic-shape tensor program, specifically assessing GEMM and Convolution operators on CPU and GPU platforms. It is notable that DietCode is limited to GPU CUDA Cores and requires pre-determination of dynamic shape samples. We leverage the parameters from Table 3 and Table 4 as sample sets for DietCode’s offline compilation process. Importantly, the latency measurement for Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x encompasses both the operator execution time on the hardware platforms and the runtime overhead from Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s cost model.

The evaluation results, shown in Figure 12, demonstrate Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s performance across various configurations. The x-axis outlines the number of floating-point operations (FLOPs) in the workloads, including all GEMM test cases from Table 3 and convolution from Table 4, while the y-axis represents the speedups. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x consistently achieves a generalized performance speedup, demonstrating improvements across different hardware setups, operators, and tensor shapes. To further quantify the effectiveness of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, we emphasize two metrics: the percentage of cases where Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x shows performance improvement (defined as cases where the speedup is greater than one) and the average speedup across all cases. Table 5 presents these detailed results. Overall, this comprehensive evaluation confirms that Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x provides robust and efficient acceleration results.

Table 6. Speedups of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x over DietCode for GEMM on GPU across different runtime ranges of M dimension.
96 Test Cases: M \in [1,384], N = 768, K = 2304
Input Range for M [0, 128) [128, 256) [256, 384)
Avg. Speedups 2.8x 1.4x 2.1x

Additionally, we explore the impact of DietCode’s reliance on sample-specific performance. As shown in Table 6, we configure the M dimension dynamically in DietCode, sampling and compiling it within the range [128, 256). The results show a performance decline when deviating from this range, highlighting DietCode’s limited flexibility.

7.3. Dynamic-Shape Network

Refer to caption
Figure 13. Performance results at model-level. The x-axis represents sequence length for LLM and batch size for CNN, and the y-axis quantifies the speedups achieved relative to the baselines.

Figure 13 presents a comprehensive evaluation of the end-to-end performance of classic language models and CNN models. We compare Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x against existing state-of-the-art solutions, using oneDNN and cuBLAS/cuDNN as the normalized baselines for CPU and GPU evaluations, respectively.

Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x demonstrates significant performance improvements across various tasks. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x achieves notable average speedups of 2.91×2.91\times2.91 × for BERT, 2.63×2.63\times2.63 × for BERT-Large, and 2.94×2.94\times2.94 × for GPT-2, across different baselines and hardware configurations. For CNNs, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x achieves remarkable average speedups of 2.01×2.01\times2.01 × for AlexNet, 2.13×2.13\times2.13 × for ResNet, and 3.24×3.24\times3.24 × for GoogleNet. Furthermore, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x demonstrates considerable performance improvements over multiple existing solutions. Specifically, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x achieves average 1.73×1.73\times1.73 ×, 4.26×4.26\times4.26 ×, 1.43×1.43\times1.43 ×, 1.71×1.71\times1.71 ×, 3.32×3.32\times3.32 × and 4.13×4.13\times4.13 × over oneDNN, ONNX Runtime, cuBLAS, cuDNN, CUTLASS, DietCode, respectively.

The evaluation results underscore that the performance of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x varies with different hardware and models, reflecting the distinct execution characteristics inherent to each environment. Notably, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x consistently outperforms the baseline across a wide range of input shapes and model types, demonstrating its exceptional adaptability and efficiency in real-world dynamic-shape DNN computation.

7.4. Additional Analysis

Offline Overhead Analysis.

We first analyze the offline period’s overhead. For diverse tensor shapes, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x requires only a single compilation process, substantially reducing compilation overhead. For instance, in the GEMM evaluation across three computation modes (CPU, GPU with Tensor Core Enabled mode, and GPU with Cuda Core Only), Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x employs the candidates generation algorithm (§5.1) to yield 17731, 392, and 2332 distinct candidates. The respective time overheads are 29.3s, 92.2s, and 529.6s. In contrast, DietCode incurs a tuning duration of 25 hours in the Cuda Core Only mode, using configurations in Table 3 as the sample set. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x thus achieves 174×174\times174 × enhancement in compilation efficiency compared to DietCode. This reduced overhead is attributable to the efficient hardware-aware pruning of candidates and the utilization of analytical cost analysis, thereby obviating the need for extensive traditional profiling.

Runtime Overhead Analysis.

The primary source of runtime overhead in Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x is the increased computational requirements of the cost model. As depicted in Figure 14, the breakdown of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s execution on GPUs is presented, highlighting both the runtime scheduling costs and the execution durations of the final tensor programs for different shapes. The GEMM test is conducted with M/N/K values ranging from 64 to 4096. Notably, this runtime overhead impact is remarkably slight across various hardware platforms, demonstrating the significant runtime efficiency of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x.

Refer to caption
Figure 14. Runtime overhead breakdown of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x in GEMM. The x-axis represents various M/N/K parameters, and the y-axis represents execution time.
Refer to caption
Figure 15. Performance comparison in GPU Tensor Core Enabled mode. The x-axis represents the FLOPS of workloads, and the y-axis shows performance normalized to Vortex-Oracle. Dashed lines show the normalized average performance across test cases.

Hierarchical Kernel Construction Evaluation.

To validate the effectiveness of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s dynamic and hierarchical kernel construction methodology, we assessed its default configuration against three variants: Vortex-Oracle, which utilizes Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x as a static-shape compiler with a profiling-based analyzer for all layers in every test case from Table 3; Vortex-Static1, which maintains dynamic strategies at the L1 layer and adopts a static configuration for the L0 layer, selecting the most frequently optimal strategy; Vortex-Static2, which disables dynamic strategy selection at both the L0 and L1 layers, applying the same fixed strategy as Vortex-Static1. As shown in Figure 15, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x achieves an average of 94.7% of the performance of Vortex-Oracle. Meanwhile, Vortex-Static1 and Vortex-Static2 achieve 60.7% and 49.5% of Vortex-Oracle’s performance, respectively. These experimental results underscore the effectiveness of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s dynamic code generation and emphasize the importance of maintaining dynamic strategies across varying hardware hierarchies.

Table 7. Comparison of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x’s default configuration with the modified analyzer setup. ’E’ denotes layers using the empirical method, while unlabeled layers use the analytical method.
HW. Analyzer Config.
Offline
Overhead
Execution
Performance
CPU Default (E: L0) 29.3 sec 1×1\times1 ×
Changed (E: L0, L1) 33.0 hour 1.04×1.04\times1.04 ×
GPU (Tensor Core Enabled) Default (E: L0, L1) 92.2 sec 1×1\times1 ×
Changed (E: L0) 19.4 sec 0.84×0.84\times0.84 ×
GPU (Cuda Core Only) Default (E: L0, L1) 529.6 sec 1×1\times1 ×
Changed (E: L0) 39.1 sec 0.63×0.63\times0.63 ×

Hybrid Analyzer Evaluation.

We conduct a study to assess the effectiveness of the hybrid analyzer in Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x. We compare the default Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x with a modified configuration as detailed in Table 7. We collect the data on offline compilation overhead, and the average runtime performance for all cases in Table 3. The experimental results reveal a significant and sharp increase in the CPU’s offline overhead when the configuration is modified, resulting in only marginal performance gains. Conversely, this modification leads to a significant reduction in GPU performance. These findings substantiate and underscore the rationale behind choosing the default configuration as the preferred setup for Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x.

Dynamic Hardware Adaptation.

We investigate the dynamic hardware adaptability of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x on GPUs using FP16 for the GEMM operator. We test N values of 1024, 2048, and 4096, with K fixed at 1024 and M dynamically adjusted from 1 to 16, across three settings: Cuda Core Only, Tensor Core Only, and the default Adaptive. The results, presented in Figure 16, reveal dynamic hardware utilization opportunities and clarify hardware selection criteria for specific scenarios. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x effectively utilizes optimal hardware, achieving performance gains of up to 48% and 54% over fixed CUDA and Tensor Core settings, respectively, demonstrating the effectiveness of its hardware-adaptive scheduler.

8. Related Work

For dynamic-shape tensor programs, vendor-provided libraries, such as cuBLAS (cublas, ), cuDNN (cudnn, ), MKL-DNN (mkl, ) and CUTLASS (cutlass, ) are extensively utilized in prevalent frameworks, facilitating high-performance tensor operations across diverse hardware platforms. These libraries, tailored to specific target hardware, require substantial engineering efforts. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, by introducing a novel unified recursive abstraction, significantly reduces development costs and unifies optimization strategies across different hardware platforms.

Additionally, compilation optimization is a crucial solution for dynamic-shape tensor programs. Existing methods, such as DietCode (dietcode, ), Nimble (nimble, ), and DISC (bladedisc, ), predominantly rely on sample-based compilation approaches. However, these methods overlook hardware-aware optimization opportunities. Our work, distinguishing itself in this area, uses hardware information as a fundamental element to construct a novel sample-free compilation workflow, thus supporting diverse high-performance scenarios.

Refer to caption
Figure 16. Performance comparison on GPU across Cuda Core Only, Tensor Core Only, and Adaptive modes for GEMM. The x-axis represents the M value, and the y-axis shows the execution time normalized to the CUDA Core Only mode.

For optimizing tensor programs with static shapes, various tensor compilers such as AutoTVM (autotvm, ), FlexTensor (flextensor, ), Ansor (ansor, ), and TensorIR (tensorir, ) have been proposed. However, these methods are associated with significant compile time overheads. Although efforts like Roller (roller, ) have attempted to optimize the compilation time for static-shape compilers, the time required is still considerably longer than the execution overhead, making them impractical for the online demands of dynamic-shape tensor programs.

Graph-level optimization is another important component for end-to-end DNN optimizations. DNNFusion (DNNFusion, ), Rammer (Rammer, ), Chimera (chimera, ), and AStitch (astitch, ) focus on fusion optimizations in DNNs. TASO (taso, ), Unity (unity, ), XLA (XLA, ), JAX (jax, ), TorchDynamo (TorchDynamo, ), TenSAT (tensat, ) explore graph rewriting opportunities. In this paper, our proposed Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x focuses on operator-level for dynamic-shape tensor programs, which is orthogonal to these works. Meanwhile, Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x is designed without inherent limitations that hinder integration with current graph-level compilers. We look forward to exploring this area as a part of our future research efforts.

9. Conclusion

In this work, we propose Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x, an hardware-driven and sample-free dynamic-shape tensor program compiler. Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x leverages bidirectional compilation techniques to deliver universally high-performance support with minimal system overhead. Experimental results demonstrate that Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x achieves average speedups of 2.53×2.53\times2.53 × and 3.01×3.01\times3.01 × over vendor-provided libraries and existing dynamic-shape compiler, respectively. These results highlight the effectiveness of Vortex𝑉𝑜𝑟𝑡𝑒𝑥Vortexitalic_V italic_o italic_r italic_t italic_e italic_x and its potential as a standard methodology for enhancing dynamic-shape tensor program optimization.

References

  • [1] oneAPI Deep Neural Network Library (oneDNN). https://github.com/oneapi-src/oneDNN, 2020.
  • [2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. {{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  • [3] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, et al. Jax: composable transformations of python+ numpy programs. Version 0.2, 5:14–24, 2018.
  • [4] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
  • [5] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. Advances in Neural Information Processing Systems, 31, 2018.
  • [6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
  • [7] Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493–506. IEEE, 2021.
  • [8] George Chrysos. Intel® xeon phi™ coprocessor-the architecture. Intel Whitepaper, 176(2014):43–50, 2014.
  • [9] Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. {{\{{DVABatch}}\}}: Diversity-aware {{\{{Multi-Entry}}\}}{{\{{Multi-Exit}}\}} batching for efficient processing of {{\{{DNN}}\}} services on {{\{{GPUs}}\}}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 183–198, 2022.
  • [10] ONNX Runtime developers. Onnx runtime. https://onnxruntime.ai/, 2021. Version: x.y.z.
  • [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [12] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, et al. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 804–817, 2023.
  • [13] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
  • [14] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [15] Heine Halberstam and Hans Egon Richert. Sieve methods. Courier Corporation, 2013.
  • [16] Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
  • [17] Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. Scale-aware face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6186–6195, 2017.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [19] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [20] Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47–62, 2019.
  • [21] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • [23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • [24] Ying Li, Yifan Sun, and Adwait Jog. Path forward beyond simulators: Fast and accurate gpu execution time prediction for dnn workloads. pages 380–394, 2023.
  • [25] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {{\{{rTasks}}\}}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881–897, 2020.
  • [26] S Narang and G Diamos. Deepbench: Benchmarking deep learning operations on different hardware, 2017.
  • [27] Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 883–898, 2021.
  • [28] NVIDIA. Nvidia a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2021.
  • [29] NVIDIA Corporation. Nvidia cublas documentation.
  • [30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [31] PyTorch Contributors. TorchDynamo. https://pytorch.org/docs/master/dynamo/, 2022.
  • [32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [33] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013.
  • [34] Amit Sabne. Xla : Compiling machine learning for peak performance, 2020.
  • [35] Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, and Yida Wang. Nimble: Efficiently compiling dynamic neural networks for model inference. Proceedings of Machine Learning and Systems, 3:208–222, 2021.
  • [36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [37] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [38] Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUTLASS, jan 2023.
  • [39] Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. Unity: Accelerating {{\{{DNN}}\}} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 267–284, 2022.
  • [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [41] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [42] Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, Yajuan Wang, Endong Wang, Qing Zhang, Bo Shen, et al. Intel math kernel library. High-Performance Computing on the Intel® Xeon Phi™: How to Fully Exploit MIC Architectures, pages 167–188, 2014.
  • [43] Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. {{\{{GNNAdvisor}}\}}: An adaptive and efficient runtime system for {{\{{GNN}}\}} acceleration on {{\{{GPUs}}\}}. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 515–531, 2021.
  • [44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • [45] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
  • [46] Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems, 3:255–268, 2021.
  • [47] Zerui Yang, Yuhui Xu, Wenrui Dai, and Hongkai Xiong. Dynamic-stride-net: Deep convolutional neural network with dynamic stride. In Optoelectronic Imaging and Multimedia Technology VI, volume 11187, pages 42–53. SPIE, 2019.
  • [48] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  • [49] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [50] Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. Dietcode: Automatic optimization for dynamic tensor programs. Proceedings of Machine Learning and Systems, 4:848–863, 2022.
  • [51] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating {{\{{High-Performance}}\}} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020.
  • [52] Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1113–1126. IEEE, 2023.
  • [53] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 859–873, 2020.
  • [54] Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, et al. Bladedisc: Optimizing dynamic shape machine learning workloads via compiler approach. Proceedings of the ACM on Management of Data, 1(3):1–29, 2023.
  • [55] Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, et al. Astitch: enabling a new multi-dimensional optimization space for memory-intensive ml training and inference on modern simt architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 359–373, 2022.
  • [56] Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, et al. {{\{{ROLLER}}\}}: Fast and efficient tensor compilation for deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 233–248, 2022.