T-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables. T-MAC aims to boost low-bit LLM inference on CPUs. T-MAC already offers support for various low-bit models, including W4A16 from GPTQ/gguf, W2A16 from BitDistiller and W1(.58)A8 from BitNet on OSX/Linux/Windows equipped with ARM/Intel CPUs.
T-MAC achieves a token generation throughput of 22 tokens/sec with a single core and 54 tokens/sec with four cores on M2-Ultra for 3B BitNet, which is a 3x speedup compared to SOTA CPU low-bit framework (llama.cpp). T-MAC can even reach 11 tokens/sec on lower-end devices like Raspberry Pi 5.
We evaluate the token generation performance of different models on four different devices: Apple M2-Ultra, Jetson AGX Orin, Raspberry Pi 5 and Surface Book 3. Check datasheet for more details.
We evaluate BitNet-3B and Llama-2-7B (W2) with T-MAC 2-bit and llama.cpp Q2_K, and evaluate Llama-2-7B (W4) with T-MAC 4-bit and llama.cpp Q4_0.
In addition to providing a significant speedup, T-MAC can also match the same performance using fewer CPU cores. For instance, to reach 40 tokens/sec, a throughput that greatly surpasses human reading speed, T-MAC only requires 2 cores, while llama.cpp requires 8 cores. On Jetson AGX Orin, to achieve 10 tokens/sec, a throughput that already meets human reading speed, T-MAC only requires 2 cores, while llama.cpp uses all 12 cores. T-MAC can meet real-time requirements on less powerful devices equipped with fewer CPU cores like Raspberry Pi 5. By using fewer cores, T-MAC can reserve computational resources for other applications and significantly reduce power and energy consumption, both of which are crucial for edge devices.
T-MAC achieves significant speedup at single-threads and consumes much less CPU cores to reach the same throughput
The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through
-fa
to achieve an additional speedup of 10%~20%.
Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU. The following figure shows the speedup compared to llama.cpp for llama-7b kernels during token generation (NUM_THREADS=1):
llama.cpp doesn't provide 1-bit kernel implementation, but we can deduce it from the 2-bit, as it won't bring additional speedup according to the 2/3/4-bit results.
Although we haven't integrated multi-batch (N>1) GEMM into llama.cpp, T-MAC can achieve significant speedup due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
M2-Ultra is an exception as it is equipped with a specially designed AMX coprocessor to accelerate multi-batch GEMM. However, T-MAC can still achieve comparable performance at 2-bit.
By replacing heavy fused-multiply-add instructions with table lookup instructions, T-MAC significantly reduces power consumption. Combined with the speedup, T-MAC ultimately results in a substantial decrease in total energy consumption.
Multi-threading power/energy consumption on M2-Ultra for three models, M1: Llama-2-7B (W4), M2: Llama-2-7B (W2) and M3: BitNet-3B
Data sampled with powermetrics.
T-MAC achieves comparable 2-bit mpGEMM performance compared to CUDA GPU on Jetson AGX Orin. While the CUDA GPU outperforms the CPU in executing kernels other than mpGEMM, making the end-to-end performance of T-MAC (CPU) slightly slower, T-MAC can deliver considerable savings in power and energy consumption.
Framework | Throughput (tokens/sec) | Power (W) | Energy (J/token) |
---|---|---|---|
llama.cpp (CPU) | 7.08 | 15.0 | 2.12 |
llama.cpp (GPU) | 20.03 | 30.8 | 1.54 |
T-MAC (CPU) | 15.62 | 10.4 | 0.66 |
Throughput/power/energy comparison for Llama-2-7B (W2) on NVIDIA Jetson AGX Orin (NUM_THREADS=12 for CPU)
Data sampled with jetson-stats under power mode MAXN.
- Python (3.8 recommended)
- virtualenv
- cmake>=3.22
First, install cmake
, zstd
(dependency of llvm) and libomp
(dependency of tvm). Homebrew is recommended:
brew install cmake zlib libomp
If
zstd
is installed through homebrew, thancmake
should also be installed through homebrew to ensure thatzstd
can be found bycmake
.
Install t_mac
from the source (please run in a virtualenv
):
git clone --recursive https://github.com/microsoft/T-MAC.git
# in virtualenv
pip install . -v # or pip install -e . -v
source build/t-mac-envs.sh
The command will download clang+llvm and build tvm from source. So it might take a bit of time.
Install cmake>=3.22 from Official Page.
Then install TVM build dependencies:
sudo apt install build-essential libtinfo-dev zlib1g-dev libzstd-dev libxml2-dev
Install t_mac
from the source (please run in a virtualenv
):
git clone --recursive https://github.com/microsoft/T-MAC.git
# in virtualenv
pip install . -v # or pip install -e . -v
source build/t-mac-envs.sh
The command will download clang+llvm and build tvm from source. So it might take a bit of time.
Due to lack of stable clang+llvm prebuilt on Windows, Conda + Visual Studio is recommended to install dependencies.
First, install Visual Studio 2019 and toggle on Desk development with C++
and C++ Clang tools for Windows
. Then, create conda environment within Developer PowerShell for VS 2019
:
git clone --recursive https://github.com/microsoft/T-MAC.git
cd T-MAC
conda env create --file conda\tvm-build-environment.yaml
conda activate tvm-build
If you are using Visual Studio 2022, replace
llvmdev =14.0.6
withllvmdev =17.0.6
in the yaml file.
After that, build TVM with:
cd 3rdparty\tvm
mkdir build
cp cmake\config.cmake build
Append set(USE_LLVM llvm-config)
to build\config.cmake
.
cd build
cmake ..
cmake --build . --config Release -- /m
Install t_mac
from the source:
cd ..\..\..\ # back to project root directory
$env:MANUAL_BUILD = "1"
$env:PYTHONPATH = "$pwd\3rdparty\tvm\python"
pip install . -v # or pip install -e . -v
After that, you can verify the installation through: python -c "import t_mac; print(t_mac.__version__); from tvm.contrib.clang import find_clang; print(find_clang())"
.
Currently, we supports end-to-end inference through llama.cpp integration.
We have provided an all-in-one script. Invoke it with:
pip install 3rdparty/llama.cpp/gguf-py
huggingface-cli download 1bitLLM/bitnet_b1_58-3B --local-dir ${model_dir}
python tools/run_pipeline.py -o ${model_dir}
An example output:
Running STEP.0: Compile kernels
Running command in /Users/user/jianyu/T-MAC/deploy:
python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m hf-bitnet-3b -r
Running STEP.1: Build T-MAC C++ CMakeFiles
Running command in /Users/user/jianyu/T-MAC/build:
cmake -DCMAKE_INSTALL_PREFIX=/Users/user/jianyu/T-MAC/install ..
Running STEP.2: Install T-MAC C++
Running command in /Users/user/jianyu/T-MAC/build:
cmake --build . --target install --config Release
Running STEP.3: Convert HF to GGUF
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp:
python convert-hf-to-gguf-t-mac.py /Users/user/Downloads/test_models/hf-bitnet-3B --outtype i2 --outfile /Users/user/Downloads/test_models/hf-bitnet-3B/ggml-model.i2.gguf --kcfg /Users/user/jianyu/T-MAC/install/lib/kcfg.ini
Running STEP.4: Build llama.cpp CMakeFiles
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
Running STEP.5: Build llama.cpp
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
cmake --build . --target main --config Release
Running STEP.6: Run inference
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
/Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build/bin/main -m /Users/user/Downloads/test_models/hf-bitnet-3B/ggml-model.i2.gguf -n 128 -t 4 -p Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. -b 1 -ngl 0 -c 2048
Check logs/2024-07-15-17-10-11.log for inference output
Check e2e.md for detailed and advanced usage.
LLM inference incurs significant computational cost. Low-bit quantization, a widely adopted technique, introduces the challenge of mixed-precision GEMM (mpGEMM), which is not directly supported by hardware and requires convert/dequant operations.
We propose the use of a lookup table (LUT) to support mpGEMM. Our method involves the following key technniques:
- Given the low precision of weights, we group one-bit weights (e.g., into groups of 4), precompute all possible partial sums, and then use a LUT to store them.
- We employ shift and accumulate operations to support scalable bits from 1 to 4.
- On a CPU, we utilize tbl/pshuf instructions for fast table lookup.
- We reduce the table size from
$2^n$ to$2^{n-1}$ , incorporating a sign bit to accelerate LUT precomputation.
Our method exhibits several notable characteristics:
- T-MAC shows a linear scaling ratio of FLOPs and inference latency relative to the number of bits. This contrasts with traditional convert-based methods, which fail to achieve additional speedup when reducing from 4 bits to lower bits.
- T-MAC inherently supports bit-wise computation for int1/2/3/4, eliminating the need for dequantization. Furthermore, it accommodates all types of activations (e.g., fp8, fp16, int8) using fast table lookup and add instructions, bypassing the need for poorly supported fused-multiply-add instructions.
- T-MAC holds the potential to realize performance gains across all processing units (PUs).
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{wei2024tmaccpurenaissancetable,
title={T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge},
author={Jianyu Wei and Shijie Cao and Ting Cao and Lingxiao Ma and Lei Wang and Yanyong Zhang and Mao Yang},
year={2024},
eprint={2407.00088},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2407.00088},
}