Skip to content

microsoft/T-MAC

Repository files navigation

T-MAC

BitNet on T-MAC (LUT-based) vs llama.cpp (dequantization-based)

Introduction

T-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables. T-MAC aims to boost low-bit LLM inference on CPUs. T-MAC already offers support for various low-bit models, including W4A16 from GPTQ/gguf, W2A16 from BitDistiller and W1(.58)A8 from BitNet on OSX/Linux/Windows equipped with ARM/Intel CPUs.

T-MAC achieves a token generation throughput of 22 tokens/sec with a single core and 54 tokens/sec with four cores on M2-Ultra for 3B BitNet, which is a 3x speedup compared to SOTA CPU low-bit framework (llama.cpp). T-MAC can even reach 11 tokens/sec on lower-end devices like Raspberry Pi 5.

End-2-End Speedup

We evaluate the token generation performance of different models on four different devices: Apple M2-Ultra, Jetson AGX Orin, Raspberry Pi 5 and Surface Book 3. Check datasheet for more details.

We evaluate BitNet-3B and Llama-2-7B (W2) with T-MAC 2-bit and llama.cpp Q2_K, and evaluate Llama-2-7B (W4) with T-MAC 4-bit and llama.cpp Q4_0.

In addition to providing a significant speedup, T-MAC can also match the same performance using fewer CPU cores. For instance, to reach 40 tokens/sec, a throughput that greatly surpasses human reading speed, T-MAC only requires 2 cores, while llama.cpp requires 8 cores. On Jetson AGX Orin, to achieve 10 tokens/sec, a throughput that already meets human reading speed, T-MAC only requires 2 cores, while llama.cpp uses all 12 cores. T-MAC can meet real-time requirements on less powerful devices equipped with fewer CPU cores like Raspberry Pi 5. By using fewer cores, T-MAC can reserve computational resources for other applications and significantly reduce power and energy consumption, both of which are crucial for edge devices.

T-MAC achieves significant speedup at single-threads and consumes much less CPU cores to reach the same throughput

The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through -fa to achieve an additional speedup of 10%~20%.

Kernel-level Speedup

Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU. The following figure shows the speedup compared to llama.cpp for llama-7b kernels during token generation (NUM_THREADS=1):

llama.cpp doesn't provide 1-bit kernel implementation, but we can deduce it from the 2-bit, as it won't bring additional speedup according to the 2/3/4-bit results.

Although we haven't integrated multi-batch (N>1) GEMM into llama.cpp, T-MAC can achieve significant speedup due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):

M2-Ultra is an exception as it is equipped with a specially designed AMX coprocessor to accelerate multi-batch GEMM. However, T-MAC can still achieve comparable performance at 2-bit.

Energy and Power Saving

By replacing heavy fused-multiply-add instructions with table lookup instructions, T-MAC significantly reduces power consumption. Combined with the speedup, T-MAC ultimately results in a substantial decrease in total energy consumption.

Multi-threading power/energy consumption on M2-Ultra for three models, M1: Llama-2-7B (W4), M2: Llama-2-7B (W2) and M3: BitNet-3B

Data sampled with powermetrics.

Compared to CUDA GPU

T-MAC achieves comparable 2-bit mpGEMM performance compared to CUDA GPU on Jetson AGX Orin. While the CUDA GPU outperforms the CPU in executing kernels other than mpGEMM, making the end-to-end performance of T-MAC (CPU) slightly slower, T-MAC can deliver considerable savings in power and energy consumption.

Framework Throughput (tokens/sec) Power (W) Energy (J/token)
llama.cpp (CPU) 7.08 15.0 2.12
llama.cpp (GPU) 20.03 30.8 1.54
T-MAC (CPU) 15.62 10.4 0.66

Throughput/power/energy comparison for Llama-2-7B (W2) on NVIDIA Jetson AGX Orin (NUM_THREADS=12 for CPU)

Data sampled with jetson-stats under power mode MAXN.

Installation

Requirements

  • Python (3.8 recommended)
  • virtualenv
  • cmake>=3.22

OSX (Apple Silicon)

First, install cmake, zstd (dependency of llvm) and libomp (dependency of tvm). Homebrew is recommended:

brew install cmake zlib libomp

If zstd is installed through homebrew, than cmake should also be installed through homebrew to ensure that zstd can be found by cmake.

Install t_mac from the source (please run in a virtualenv):

git clone --recursive https://github.com/microsoft/T-MAC.git
# in virtualenv
pip install . -v  # or pip install -e . -v
source build/t-mac-envs.sh

The command will download clang+llvm and build tvm from source. So it might take a bit of time.

Ubuntu (aarch64/x86_64)

Install cmake>=3.22 from Official Page.

Then install TVM build dependencies:

sudo apt install build-essential libtinfo-dev zlib1g-dev libzstd-dev libxml2-dev

Install t_mac from the source (please run in a virtualenv):

git clone --recursive https://github.com/microsoft/T-MAC.git
# in virtualenv
pip install . -v  # or pip install -e . -v
source build/t-mac-envs.sh

The command will download clang+llvm and build tvm from source. So it might take a bit of time.

Windows (x86_64)

Due to lack of stable clang+llvm prebuilt on Windows, Conda + Visual Studio is recommended to install dependencies.

First, install Visual Studio 2019 and toggle on Desk development with C++ and C++ Clang tools for Windows. Then, create conda environment within Developer PowerShell for VS 2019:

git clone --recursive https://github.com/microsoft/T-MAC.git
cd T-MAC
conda env create --file conda\tvm-build-environment.yaml
conda activate tvm-build

If you are using Visual Studio 2022, replace llvmdev =14.0.6 with llvmdev =17.0.6 in the yaml file.

After that, build TVM with:

cd 3rdparty\tvm
mkdir build
cp cmake\config.cmake build

Append set(USE_LLVM llvm-config) to build\config.cmake.

cd build
cmake ..
cmake --build . --config Release -- /m

Install t_mac from the source:

cd ..\..\..\  # back to project root directory
$env:MANUAL_BUILD = "1"
$env:PYTHONPATH = "$pwd\3rdparty\tvm\python"
pip install . -v  # or pip install -e . -v

Verification

After that, you can verify the installation through: python -c "import t_mac; print(t_mac.__version__); from tvm.contrib.clang import find_clang; print(find_clang())".

Usage

Currently, we supports end-to-end inference through llama.cpp integration.

We have provided an all-in-one script. Invoke it with:

pip install 3rdparty/llama.cpp/gguf-py
huggingface-cli download 1bitLLM/bitnet_b1_58-3B --local-dir ${model_dir}
python tools/run_pipeline.py -o ${model_dir}

An example output:

Running STEP.0: Compile kernels
  Running command in /Users/user/jianyu/T-MAC/deploy:
    python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m hf-bitnet-3b -r
Running STEP.1: Build T-MAC C++ CMakeFiles
  Running command in /Users/user/jianyu/T-MAC/build:
    cmake -DCMAKE_INSTALL_PREFIX=/Users/user/jianyu/T-MAC/install ..
Running STEP.2: Install T-MAC C++
  Running command in /Users/user/jianyu/T-MAC/build:
    cmake --build . --target install --config Release
Running STEP.3: Convert HF to GGUF
  Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp:
    python convert-hf-to-gguf-t-mac.py /Users/user/Downloads/test_models/hf-bitnet-3B --outtype i2 --outfile /Users/user/Downloads/test_models/hf-bitnet-3B/ggml-model.i2.gguf --kcfg /Users/user/jianyu/T-MAC/install/lib/kcfg.ini
Running STEP.4: Build llama.cpp CMakeFiles
  Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
    cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
Running STEP.5: Build llama.cpp
  Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
    cmake --build . --target main --config Release
Running STEP.6: Run inference
  Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
    /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build/bin/main -m /Users/user/Downloads/test_models/hf-bitnet-3B/ggml-model.i2.gguf -n 128 -t 4 -p Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. -b 1 -ngl 0 -c 2048
Check logs/2024-07-15-17-10-11.log for inference output

Check e2e.md for detailed and advanced usage.

Techniques

LLM inference incurs significant computational cost. Low-bit quantization, a widely adopted technique, introduces the challenge of mixed-precision GEMM (mpGEMM), which is not directly supported by hardware and requires convert/dequant operations.

We propose the use of a lookup table (LUT) to support mpGEMM. Our method involves the following key technniques:

  1. Given the low precision of weights, we group one-bit weights (e.g., into groups of 4), precompute all possible partial sums, and then use a LUT to store them.
  2. We employ shift and accumulate operations to support scalable bits from 1 to 4.
  3. On a CPU, we utilize tbl/pshuf instructions for fast table lookup.
  4. We reduce the table size from $2^n$ to $2^{n-1}$, incorporating a sign bit to accelerate LUT precomputation.

Our method exhibits several notable characteristics:

  1. T-MAC shows a linear scaling ratio of FLOPs and inference latency relative to the number of bits. This contrasts with traditional convert-based methods, which fail to achieve additional speedup when reducing from 4 bits to lower bits.
  2. T-MAC inherently supports bit-wise computation for int1/2/3/4, eliminating the need for dequantization. Furthermore, it accommodates all types of activations (e.g., fp8, fp16, int8) using fast table lookup and add instructions, bypassing the need for poorly supported fused-multiply-add instructions.
  3. T-MAC holds the potential to realize performance gains across all processing units (PUs).

Cite

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{wei2024tmaccpurenaissancetable,
      title={T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge}, 
      author={Jianyu Wei and Shijie Cao and Ting Cao and Lingxiao Ma and Lei Wang and Yanyong Zhang and Mao Yang},
      year={2024},
      eprint={2407.00088},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2407.00088}, 
}

About

Low-bit LLM inference on CPU with lookup table

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages