Skip to content

Commit abc4c4e

Browse files
jianan-guMingxuZh
andauthored
support qconfig summary load for llama static int8 (#2280)
* Update run_llama_quantization.py * Update run_llama_quantization.py * Update run_gpt-j_quantization.py * Update run.py * Update run_accuracy.py * Update run_accuracy_with_deepspeed.py * Update run_gpt-j_quantization.py * Update run_llama_quantization.py * Update run_gpt-j_quantization.py * Update env_setup.sh * Update README.md * Update run_gpt-j_quantization.py * Update run_llama_quantization.py * Update run.py --------- Co-authored-by: MingxuZh <[email protected]>
1 parent c561cd9 commit abc4c4e

File tree

7 files changed

+300
-278
lines changed

7 files changed

+300
-278
lines changed

examples/cpu/inference/python/llm/README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -124,17 +124,18 @@ OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama
124124
#### Static quantization (int8):
125125
```bash
126126
# general command:
127-
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --ipex-smooth-quant --alpha <Tuned alpha for specific models> --output-dir "saved_results" --int8
128-
# For the best alpha values (range [0, 1.0], float) tuned for specific models, we verified good accuracy: "EleutherAI/gpt-j-6b" with alpha=1.0, "meta-llama/Llama-2-7b-chat-hf" with alpha=0.8.
129-
# For more recipes, please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/smooth_quant.md#validated-models
130-
# Note: by default, we use "--int8" to run int8 mixed fp32 mode, while for peak performance of static quantization, please use "--int8-bf16-mixed" instead (may impact accuracy).
127+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --ipex-smooth-quant --qconfig-summary-file <path to specific model qconfig> --output-dir "saved_results" --int8
128+
# We provide tuned qconfig recipes files for "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-7b-chat-hf" and "EleutherAI/gpt-j-6b"
129+
# For the qconfig recipes of more models, you can just run your model_id and try with IPEX default recipes by removing "--qconfig-summary-file <path to specific model qconfig>"
130+
# If IPEX default recipes are not good enough for accuracy requirements, please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/smooth_quant.md#validated-models for tuning more recipes.
131+
# Note: by default, we use "--int8" to run int8 mixed fp32 inference, while for the peak performance of static quantization, please use "--int8-bf16-mixed" instead (may impact accuracy).
131132

132133
# An example of llama2 7b model:
133-
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Llama-2-7b-chat-hf --ipex-smooth-quant --alpha 0.8 --output-dir "saved_results" --int8
134+
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Llama-2-7b-hf --ipex-smooth-quant --qconfig-summary-file <path to meta-llama/Llama-2-7b-hf model qconfig> --output-dir "saved_results" --int8
134135
```
135136
*Notes for all quantizations:*
136137

137-
(1) for quantization benchmarks, the first runs will auto-generate the quantized model named "best_model.pt" in the "--output-dir" path, you can reuse these quantized models for inference-only benchmarks by adding "--quantized-model-path <output_dir + "best_model.pt">".
138+
(1) <a name="generation_sq">for all quantization benchmarks</a>, the first runs will auto-generate the quantized model named "best_model.pt" in the "--output-dir" path, you can reuse these quantized models for inference-only benchmarks by adding "--quantized-model-path <output_dir + "best_model.pt">". Specific for static quantization, if not using "--qconfig-summary-file", a qconfig recipe will also been generated in the "--output-dir" path.
138139

139140
(2) for Falcon quantizations, "--config-file <CONFIG_FILE>" is needed and example of <CONFIG_FILE>: "utils/model_config/tiiuae_falcon-40b_config.json".
140141

@@ -245,11 +246,11 @@ OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list
245246
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py --accuracy-only -m meta-llama/Llama-2-7b-hf --dtype bfloat16 --ipex --jit --tasks lambada_openai
246247
```
247248
### Quantizations:
249+
For the quantized models to be used in accuracy tests, we can reuse the model files that are named "best_model.pt" in the "--output-dir" path ([generated during inference performance tests](#generation_sq)).
248250
```bash
249251
# general command:
250-
# For the quantized models to be used in accuracy tests, we can reuse the model files that are named "best_model.pt" in the "--output-dir" path (generated during inference performance tests).
251252
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_accuracy.py --model <MODEL ID> --quantized-model-path "./saved_results/best_model.pt" --dtype int8 --accuracy-only --jit --tasks {TASK_NAME}
252-
# please also add "--int8-bf16-mixed" if your model is quantized with this flag
253+
# Please also add "--int8-bf16-mixed" if your model is quantized with this flag
253254

254255
# An example of llama2 7b model:
255256
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Llama-2-7b-hf --quantized-model-path "./saved_results/best_model.pt" --dtype int8 --accuracy-only --jit --int8 --tasks lambada_openai

examples/cpu/inference/python/llm/distributed/run_accuracy_with_deepspeed.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -582,7 +582,6 @@ def _model_call(
582582
enabled=True
583583
if args.int8_bf16_mixed or self._dtype == torch.bfloat16
584584
else False,
585-
dtype=torch.bfloat16,
586585
):
587586
if self._dtype != "int8":
588587
if (
@@ -680,7 +679,6 @@ def _model_call(
680679
enabled=True
681680
if args.int8_bf16_mixed or self._dtype == torch.bfloat16
682681
else False,
683-
dtype=torch.bfloat16,
684682
):
685683
if self._with_jit:
686684
output = self.model(
@@ -697,7 +695,6 @@ def _model_call(
697695
enabled=True
698696
if args.int8_bf16_mixed or self._dtype == torch.bfloat16
699697
else False,
700-
dtype=torch.bfloat16,
701698
):
702699
if self._with_jit:
703700
output = self.model(

examples/cpu/inference/python/llm/run.py

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -49,16 +49,16 @@ def main(args_in: Optional[List[str]] = None) -> None:
4949
parser.add_argument("--output-dir", nargs="?", default="./saved_results")
5050

5151
# quantization related arguments.
52-
parser.add_argument("--int8", action="store_true")
52+
parser.add_argument("--int8", action="store_true", help="default static int8 path (fp32 mixed)")
5353
parser.add_argument(
5454
"--int8-bf16-mixed",
5555
action="store_true",
56-
help="by default it is int8-fp32 mixed, to enable int8 mixed amp bf16 (work on platforms like SPR)",
56+
help="by default static quant is int8-fp32 mixed, to enable int8 mixed amp bf16 (work on platforms like SPR)",
5757
)
58-
parser.add_argument("--quantized-model-path", default="")
59-
58+
parser.add_argument("--quantized-model-path", default="", help="path to the quantized model file")
59+
parser.add_argument("--qconfig-summary-file", default="", help="qconfig for static quantization")
6060
parser.add_argument("--dataset", nargs="?", default="NeelNanda/pile-10k")
61-
parser.add_argument("--ipex-smooth-quant", action="store_true")
61+
parser.add_argument("--ipex-smooth-quant", action="store_true", help="smoothquant forstatic quantization")
6262
parser.add_argument("--alpha", default=0.5, type=float, help="alpha value for smoothquant")
6363
parser.add_argument(
6464
"--ipex-weight-only-quantization",
@@ -154,8 +154,9 @@ def main(args_in: Optional[List[str]] = None) -> None:
154154
if args.config_file is not None:
155155
infer_cmd.extend(["--config-file", str(args.config_file)])
156156

157-
print("running model geneartion...")
157+
print("LLM RUNTIME INFO: running model geneartion...")
158158
subprocess.run(infer_cmd)
159+
print("LLM RUNTIME INFO: finishing for geneartion process , exiting...")
159160
else:
160161
if args.config_file is None:
161162
config = AutoConfig.from_pretrained(
@@ -231,7 +232,8 @@ def main(args_in: Optional[List[str]] = None) -> None:
231232
quant_cmd.extend(["--ipex-smooth-quant"])
232233
quant_cmd.extend(["--alpha", str(args.alpha)])
233234
quant_cmd.extend(["--dataset", str(args.dataset)])
234-
print("quantizing model ...")
235+
quant_cmd.extend(["--qconfig-summary-file", str(args.qconfig_summary_file)])
236+
print("LLM RUNTIME INFO: quantizing model ...")
235237
subprocess.run(quant_cmd)
236238
infer_cmd.extend(
237239
["--quantized-model-path", str(args.output_dir) + "/best_model.pt"]
@@ -268,8 +270,9 @@ def main(args_in: Optional[List[str]] = None) -> None:
268270
if args.config_file is not None:
269271
infer_cmd.extend(["--config-file", str(args.config_file)])
270272

271-
print("running model geneartion...")
273+
print("LLM RUNTIME INFO: running model geneartion...")
272274
subprocess.run(infer_cmd)
275+
print("LLM RUNTIME INFO: finishing for geneartion process , exiting...")
273276

274277
else:
275278
path = Path(parent_path, "distributed/run_generation_with_deepspeed.py")
@@ -296,6 +299,7 @@ def main(args_in: Optional[List[str]] = None) -> None:
296299
Path.mkdir(model_path)
297300
shard_cmd.extend(["--save-path", str(args.output_dir)+str(MODEL_CLASSES[model_type])])
298301
shard_cmd.extend(["--local_rank", str(args.local_rank)])
302+
print("LLM RUNTIME INFO: sharding model...")
299303
subprocess.run(shard_cmd)
300304
infer_cmd.extend(["-m", str(args.output_dir)+str(MODEL_CLASSES[model_type])])
301305
else:
@@ -333,8 +337,9 @@ def main(args_in: Optional[List[str]] = None) -> None:
333337
if args.int8_bf16_mixed:
334338
infer_cmd.extend(["--int8-bf16-mixed"])
335339

336-
print("running model geneartion with deepspeed (autotp)...")
340+
print("LLM RUNTIME INFO: running model geneartion with deepspeed (autotp)...")
337341
subprocess.run(infer_cmd)
342+
print("LLM RUNTIME INFO: finishing for geneartion process , exiting...")
338343

339344

340345
if __name__ == "__main__":

examples/cpu/inference/python/llm/single_instance/run_accuracy.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -419,7 +419,6 @@ def _model_call(
419419
enabled=True
420420
if args.int8_bf16_mixed or self._dtype == torch.bfloat16
421421
else False,
422-
dtype=torch.bfloat16,
423422
):
424423
if self._dtype != "int8":
425424
if (
@@ -517,7 +516,6 @@ def _model_call(
517516
enabled=True
518517
if args.int8_bf16_mixed or self._dtype == torch.bfloat16
519518
else False,
520-
dtype=torch.bfloat16,
521519
):
522520
if self._with_jit:
523521
output = self.model(
@@ -534,7 +532,6 @@ def _model_call(
534532
enabled=True
535533
if args.int8_bf16_mixed or self._dtype == torch.bfloat16
536534
else False,
537-
dtype=torch.bfloat16,
538535
):
539536
if self._with_jit:
540537
output = self.model(

0 commit comments

Comments
 (0)