-
Notifications
You must be signed in to change notification settings - Fork 273
int4_weight_only get plain weight are padded #2249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jainapurva @HDCharles . Could you help to take a look? I see you upstreamed these codes. |
The motivation is that I have a cuda quantized model and want to load this quantized model from CPU or XPU. The data layout is different across different devices. I was wondering if we could have a common layout implementation that can be applied across different devices. |
Hi @jerryzh168 . Do you have any comments on that? The proposal is that we want to save the torchao model in a general format, and decide the data_layout / tensor_impl when loading. Do you if we can support this feature? |
Hey get_plain is defined for the layout as the data, scale and zero_point that is being stored. We don't capture the original shape without padding. The expectation is that you quantize and run the model on the same device however we do support having the model on cpu and then moving the layers to cuda as they get quantized. normally if you have a cpu model you can do quantize_(model, config, device='cuda') and it will move it to cuda as it does quantization. I do not think this functionality is in huggingface though since they have their own model loading system. I don't think they support loading it on cpu and then quantizing on cuda. we've discussed how annoying the UX is for int4 cuda/cpu and I believe @jerryzh168 was planning to implement it soon, not sure about the ETA. |
Thanks @HDCharles ! Refer to your comment 2. Can I expect that torchao will support this feature I proposed in the future? We can load a quantized model on different device types (CPU/XPU/CUDA). |
yes, the current gap is more about transferring from one to another rather than the loading but that would be possible. |
@jiqing-feng for padding, I feel we should just drop it, I'm planning to replace the current int4wo kernel that's powered by tinygemm, that requires padding externally, with the gemlite kernels from @mobicham since it doesn't require padding, main reason is padding causes additional issues like slicing: #2174 for a "standard" layout, I think it will just be plain layout, we can store that and implement ao/torchao/quantization/quant_api.py Line 1743 in 4d5f657
Another related aspect is device, I think ideally we can implement all these conversions in CPU, but if not possible, we can always move the plain layout to the target device (like CUDA, XPU) and then run the conversion (packing) also cc @metascroy this is related to the retractability discussion as well. Another consideration here is that the requirement for using
|
I try to quantize a model with int4_weight_only, and want to get the plained weight, but found the weight has been padded. To reproduce it, run the following script:
output
The original shape should be
(768, 768)
, but the plained weight shape is(768, 1024)
. Can we have a remove padding process inget_plain()
function?The text was updated successfully, but these errors were encountered: