Pulse · pytorch/ao · GitHub

May 2, 2025 – June 2, 2025

Overview

112 Active pull requests

27 Active issues

1 Release published by 1 person

v0.11.0
published May 9, 2025

82 Pull requests merged by 29 people

Remove Constraint for sm89 hardware
#2281 merged Jun 2, 2025
Fix benchmark_low_bit_adam.py reference
#2287 merged Jun 1, 2025
Fix Bug in MX Builds
#2284 merged May 31, 2025
Add back AOPerModuleConfig for BC
#2282 merged May 31, 2025
Patch the _is_conv_node function
#2257 merged May 31, 2025
Fixes MX formats build for blackwell
#2278 merged May 30, 2025
Update CMake to enable building ops on iOS
#2274 merged May 30, 2025
Resolve logger warnings
#2250 merged May 30, 2025
Add Integration Tests to H100 CI
#2268 merged May 30, 2025
Make optim lazily intialize global state
#2277 merged May 30, 2025
Fix generate.py for fbgemm int4 integration
#2273 merged May 29, 2025
Mark QAT range learning as prototype for now
#2272 merged May 29, 2025
Enable range learning for QAT
#2033 merged May 29, 2025
Fix torchao generate script for cpu device
#2267 merged May 29, 2025
Enable fp16+int4 mixed precission path for int4 xpu path with int zero point
#2240 merged May 29, 2025
integration-vllm-test
#2258 merged May 28, 2025
Add support for fbgemm int4 mm kernel
#2255 merged May 28, 2025
[reland2][ROCm] preshuffled weight mm
#2207 merged May 28, 2025
Support INT8 SDPA template for CPU
#2148 merged May 28, 2025
Fix Per Row scaling for inference
#2253 merged May 27, 2025
Revert "Try fixing CI by pinning pytest (#2238)"
#2263 merged May 27, 2025
Rename AOPerModuleConfig to ModuleFqnToConfig
#2243 merged May 24, 2025
Add backward compatible types to pt2e prepare
#2244 merged May 23, 2025
Relax int4wo device mismatch error
#2254 merged May 23, 2025
Revert "Patch the _is_conv_node function"
#2247 merged May 23, 2025
Patch the _is_conv_node function
#2223 merged May 22, 2025
Update Readme
#1526 merged May 22, 2025
[sparse] Add fp8 sparse gemm with rowwise scaling for activation sparsity
#2242 merged May 22, 2025
Try fixing CI by pinning pytest
#2238 merged May 22, 2025
Relax MOE constraints and add test for torch.mm computation
#2227 merged May 22, 2025
clean up prototype folder
#2232 merged May 21, 2025
remove benchmarks from top level repo
#2233 merged May 21, 2025
Update GemLite to support vLLM V1
#2199 merged May 21, 2025
Remove preserve_zero and zero_point_domain from choose_qparams_affine
#2149 merged May 21, 2025
use correct fp8 quantization dtype for AMD GPU
#2225 merged May 21, 2025
Re-land the PR of "Add INT8 SDPA path for CPU"
#2215 merged May 21, 2025
Update config.py
#2224 merged May 20, 2025
Make torchao pt2e prepare/convert functions compatible with quantizers in torch.ao
#2221 merged May 19, 2025
Enable {conv3d, conv_transpose3d} + bn fusion in pt2e
#2212 merged May 15, 2025
Add CI for Arm Linux
#2211 merged May 15, 2025
ROCm mxfp4 Skips
#2209 merged May 14, 2025
Add support for KleidiAI int4 kernels on aarch64 Linux
#2169 merged May 14, 2025
unbreak CI by fixing MX tests
#2208 merged May 14, 2025
Update __init__.py
#2206 merged May 14, 2025
Add mx_fp4 path
#2201 merged May 13, 2025
Arm_inductor_quantizer for Pt2e quantization
#2139 merged May 13, 2025
[float] document e2e training -> inference flow
#2190 merged May 13, 2025
Remove sparsity/prototype/blocksparse
#2205 merged May 13, 2025
Skips for ROCm (X86 Inductor Tests)
#2202 merged May 13, 2025
Add blockwise fp8 gemm benchmarks to README
#2203 merged May 12, 2025
Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors
#1763 merged May 12, 2025
Add noindex to 0.10 and 0.9 docs
#2194 merged May 12, 2025
Add subclass based method for inference w/ MXFP8
#2132 merged May 12, 2025
unpin torch to unbreak mac tests
#2198 merged May 12, 2025
2:4 activation sparsity packing kernels
#2012 merged May 12, 2025
Forward fix lint
#2197 merged May 12, 2025
Skip ROCm MoE Quantization
#2191 merged May 12, 2025
[optim] Fix low-bit optim when used with FSDP2+CPUOffload
#2195 merged May 10, 2025
[PT2E][X86] Migrate fusion passes in Inductor to torchao
#2140 merged May 10, 2025
Uses torch.version.cuda to compile CUDA extensions
#2193 merged May 9, 2025
Move moe quant to better prototype dir
#2192 merged May 9, 2025
Set eps in end-to-end QAT flow
#2180 merged May 9, 2025
metal lowbit kernels: qmv_fast optimization
#2167 merged May 9, 2025
[testing]Triaging ROCm wheel build
#2161 merged May 9, 2025
Add a triton kernel for swizziling
#2168 merged May 9, 2025
Enabling MOE Quantization using linear decomposition
#2043 merged May 8, 2025
Remove broken test
#2188 merged May 8, 2025
Add serialization support for AOPerModuleConfig
#2186 merged May 8, 2025
Generate speedup for inference
#2151 merged May 7, 2025
Fix cuda compile error with bf16
#2122 merged May 7, 2025
[BE] Fix MPS experimental workflow
#2181 merged May 7, 2025
Bump version to 0.12.0
#2178 merged May 6, 2025
Fix linux cpu builds. Resolves nightly build for mac stops on 0422
#2170 merged May 6, 2025
[reland] Fixing aliasing behavior for slice in AQT int4wo layout
#2176 merged May 6, 2025
Revert "Fixing aliasing behavior for slice in AQT TensorCoreTiledLayout"
#2175 merged May 6, 2025
Fixing aliasing behavior for slice in AQT TensorCoreTiledLayout
#2174 merged May 6, 2025
Update ruff version in dev-requirements to match CI
#2172 merged May 5, 2025
Remove fix not needed anymore after moving CUTLASS pin to v3.9.0
#2160 merged May 3, 2025
Update QAT README.md
#2162 merged May 2, 2025
Removes pinned version for pytest
#2158 merged May 2, 2025
[MX] Remove mxfp8 kernel and rely on cublas
#2130 merged May 2, 2025
Uses torch.version.cuda to compile CUDA extensions
#2163 merged May 2, 2025

30 Pull requests opened by 19 people

Update utils_parallel_dequant.cuh
#2164 opened May 2, 2025
tesor scaling added
#2171 opened May 5, 2025
[PT2E] Fix per-tensor observer issue with varying shape & rank
#2177 opened May 6, 2025
Eval hf models using lm_eval
#2179 opened May 6, 2025
[Do not Land] Re-land "Add INT8 SDPA path for CPU" (#2093)
#2183 opened May 7, 2025
[Not for land] remove workaround for slow rowwise cutlass gemm
#2185 opened May 8, 2025
Enable Int4WeightOnlyGPTQQuantizer on Intel GPU.
#2200 opened May 12, 2025
primitive scale fix
#2210 opened May 14, 2025
Add activation sparsity (24 + fp8 dynamic quant) subclass
#2213 opened May 15, 2025
Fixes MX formats build for blackwell
#2214 opened May 15, 2025
Convert Pytest to Unittest for tests under test/dtypes/
#2216 opened May 16, 2025
Update temp_build.py
#2218 opened May 17, 2025
Manually specify flags if no arch set
#2219 opened May 19, 2025
Fix failing tests on h100
#2231 opened May 21, 2025
GPTQ updates
#2235 opened May 21, 2025
Test older almalinux image
#2236 opened May 21, 2025
[draft] Update regression_test.yml
#2237 opened May 22, 2025
fix _replace_with_custom_fn_if_matches_filter in quant_api.py
#2252 opened May 23, 2025
Add a way to do power of 2 scaling
#2256 opened May 23, 2025
Add benchmark numbers to dashboard
#2260 opened May 24, 2025
test_affine_quantized_float.py pytest too unittest
#2261 opened May 25, 2025
Test d script
#2264 opened May 27, 2025
Update QAT docs, highlight axolotl integration
#2266 opened May 28, 2025
[float8 training] remove duplicate override for view
#2269 opened May 29, 2025
float8 moe training conversion API prototype
#2275 opened May 30, 2025
[WIP] Add support for fbgemm fp8 kernels
#2276 opened May 30, 2025
Fix QAT range learning, ensure scales get gradients
#2280 opened May 30, 2025
[do not land] testing if moving this breaks my PRs
#2283 opened May 30, 2025
Build mxfp4 kernel for sm120a
#2285 opened May 31, 2025
[optim] Fix bug when default dtype is BF16
#2286 opened May 31, 2025

9 Issues closed by 6 people

cannot save fp8-wo model
#2230 closed May 21, 2025
KleidiAI int4 kernels not loading properly on aarch64 Linux
#2143 closed May 16, 2025
New test files will likely fail on ROCM
#2204 closed May 13, 2025
FSDP2 + CPU Offload + AdamW8bit issue
#1931 closed May 10, 2025
nightly build for mac stops on 0422
#2157 closed May 6, 2025
Torchao's CPU overhead counteracts the performance benefit of using quantization kernel.
#1930 closed May 6, 2025
QAT docs
#2155 closed May 2, 2025
[BUILD][RFE] Remove `torch.cuda.is_available` in `setup.py` allowing to build CUDA extensions on CPU-only build nodes
#2152 closed May 2, 2025
[Doc] gemlite version
#1653 closed May 2, 2025

18 Issues opened by 12 people

QAT range learning tracker
#2271 opened May 29, 2025
[pt2e] QAT training and FSDP support
#2265 opened May 27, 2025
convert_to_float8_training and torch.compile make model slow
#2262 opened May 26, 2025
torch.ao.quantization deprecation tracker
#2259 opened May 24, 2025
We should deprecate Float8LinearConfig.force_recompute_fp8_weight_in_bwd
#2251 opened May 23, 2025
int4_weight_only get plain weight are padded
#2249 opened May 23, 2025
`quantize_(nn.Linear)` doesn't work with module swaps
#2241 opened May 22, 2025
BatchNorm + Convolution fusion in `prepare_pt2e` removal
#2245 opened May 22, 2025
Tensor Subclass + VLLM Compile
#2239 opened May 22, 2025
MXFP Inference Tracking Doc
#2229 opened May 21, 2025
[Quant] Can quant not be decomposed on inductor?
#2228 opened May 20, 2025
newer torchao breaks sglang?
#2226 opened May 19, 2025
TorchAO needs to update its build system
#2222 opened May 19, 2025
Ship all CUDA kernels in a single .so
#2220 opened May 19, 2025
Add MXFP casting kernels from triton Repro
#2217 opened May 16, 2025
[QAT] Linear layer's weight quantization granularity can only be per_group
#2189 opened May 9, 2025
[float8] Investigate if workaround for slow cutlass rowwise GEMM when fast_accum=False is still needed after perf improvments and potentially optimize GEMM further
#2184 opened May 8, 2025
[float8] Support power of 2 scales with PerRow scales for inference
#2182 opened May 7, 2025

19 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[CPU] Enable DA8W4 on CPU
#2128 commented on May 27, 2025 • 5 new comments
Support microbenchmarking for low precision training
#2101 commented on May 8, 2025 • 5 new comments
Enhance test_autoquant_compile to support ROCm
#2100 commented on May 14, 2025 • 2 new comments
Implement dtensor.shard_dim_alltoall, aten.contiguous, aten.chunk
#2154 commented on May 20, 2025 • 0 new comments
[WIP] all-gather fp8 for rowwise
#2145 commented on May 23, 2025 • 0 new comments
ROCm mx-fp8 Gemm
#2066 commented on May 6, 2025 • 0 new comments
[sparsity] Add PartialLinear module for structured sparsity
#1982 commented on May 15, 2025 • 0 new comments
Fix wrong scale eps applied
#1770 commented on May 19, 2025 • 0 new comments
[draft] add all_gather_into_tensor
#1737 commented on May 16, 2025 • 0 new comments
Sam2 video
#1564 commented on Jun 1, 2025 • 0 new comments
[roadmap/tracker] Low precision training for MoEs
#2147 commented on May 27, 2025 • 0 new comments
MX single node performance tracker
#1768 commented on May 22, 2025 • 0 new comments
[feature request] np.packbits / np.unpackbits, general BitTensors (maybe can be just tensors with dtype torch.bits8 or have a new dtype torch.bits introduced) and bit packed tensors utilities for saving memory / accesses, support for BitTensors wherever BoolTensors are used
#292 commented on May 15, 2025 • 0 new comments
How does this work with ONNX export and quantization?
#777 commented on May 14, 2025 • 0 new comments
[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3
#1594 commented on May 13, 2025 • 0 new comments
Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding
#2074 commented on May 12, 2025 • 0 new comments
Can FP8 GEMM be enabled via module hooks instead of module swapping?
#1887 commented on May 12, 2025 • 0 new comments
[PT2E] observers do not handle inputs with different shapes correctly
#2112 commented on May 8, 2025 • 0 new comments
QAT model drops accuracy after converting with torch.ao.quantization.convert
#2138 commented on May 5, 2025 • 0 new comments