Skip to content

[Discussion] Refactor the dispatcher design for incoming different backends #81

@LeiWang1999

Description

@LeiWang1999

Due to the limitations of our tile-based kernel optimization for quantized kernels with small LLM shapes, as discussed in issue #64, and considering we are a library capable of providing different backends for various scenarios, PR #80 introduces a CUDA implementation for efficient small-batch quantized matrix multiplication. Looking ahead, we are contemplating the implementation of quantized flash attention with our TL backend. BitBlas needs to determine when and how to dispatch operation configurations to different backends, which requires thoughtful design for the new component.

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions