make eval script also handle performance measurement #3473

vkuzo · 2025-12-09T20:47:51Z

Summary:

refactors the eval script to also handle performance measurement in
vllm
adds a simple vllm bench latency script to bench in vllm for prefill and decode

Also, add convenience flags to skip model creation, lm_eval, vllm as
needed to enable running just a single model + single step.

Test Plan:

with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// full output: https://www.internalfb.com/phabricator/paste/view/P2094641791

Results (on H100):

Library Versions:
================================================================================
torch.__version__: 2.9.0+cu128
torch.cuda.get_device_name(): NVIDIA H100
torchao.__version__: 0.14.0+git5c8a14207
vllm.__version__: 0.13.0

Quantization Recipe Results:
================================================================================
+--------------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+
| Recipe                   |   Checkpoint |     Wikitext |   Winogrande |   Winogrande |   Prefill |   Decode |   Speedup |   Speedup |
|                          |         (GB) |   Perplexity |          Acc |       Stderr |    toks/s |   toks/s |   Prefill |    Decode |
+==========================+==============+==============+==============+==============+===========+==========+===========+===========+
| None                     |        16.08 |       7.5435 |       0.7419 |       0.0123 |   30946.5 |  6612    |     1     |     1     |
+--------------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+
| float8_rowwise           |         9.1  |       7.5919 |       0.7348 |       0.0124 |   45312.5 |  8025.95 |     1.464 |     1.214 |
+--------------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+
| int8_rowwise_weight_only |         9.11 |       7.5561 |       0.7427 |       0.0123 |   28231.9 |  4309.8  |     0.912 |     0.652 |
+--------------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+
| int8_rowwise             |         9.1  |       7.6567 |       0.738  |       0.0124 |           |          |           |           |
+--------------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-12-09T20:47:52Z

Stack from ghstack (oldest at bottom):

-> make eval script also handle performance measurement #3473

pytorch-bot · 2025-12-09T20:47:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3473

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6896527 with merge base 486fe0d ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestFSDP2::test_fsdp2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e1d713e ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 15f7481 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 15f7481 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 665f2c8 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cae97ab ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 42466df ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 79c5722 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a0019d4 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 404c330 ghstack-comment-id: 3634216524 Pull-Request: #3473

jainapurva

LGTM, thanks!

vkuzo added 4 commits December 9, 2025 06:30

Update

cbc18b3

[ghstack-poisoned]

Update

cf212a9

[ghstack-poisoned]

Update

d4f3afd

[ghstack-poisoned]

Update

9450d20

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2025

This was referenced Dec 9, 2025

simplify accuracy eval #3470

Merged

refactor accuracy eval script to be organized by hardware #3472

Merged

vkuzo added the topic: for developers Use this tag if this PR is mainly developer facing label Dec 9, 2025

vkuzo requested review from jainapurva and jerryzh168 December 10, 2025 11:24

vkuzo added 2 commits December 10, 2025 10:08

Update

d99f6d8

[ghstack-poisoned]

Update

17ef1f7

[ghstack-poisoned]

Update

8aba356

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/182/head to main December 10, 2025 18:09

Update

ebde070

[ghstack-poisoned]

Update

86304cb

[ghstack-poisoned]

Update

d0f8a00

[ghstack-poisoned]

Update

1eb4438

[ghstack-poisoned]

Update

55c2ab4

[ghstack-poisoned]

Update

6896527

[ghstack-poisoned]

jainapurva approved these changes Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make eval script also handle performance measurement #3473

make eval script also handle performance measurement #3473

vkuzo commented Dec 9, 2025 •

edited

Loading

Uh oh!

vkuzo commented Dec 9, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

jainapurva left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

make eval script also handle performance measurement #3473

Are you sure you want to change the base?

make eval script also handle performance measurement #3473

Conversation

vkuzo commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3473

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

jainapurva left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vkuzo commented Dec 9, 2025 •

edited

Loading

vkuzo commented Dec 9, 2025 •

edited

Loading

pytorch-bot bot commented Dec 9, 2025 •

edited

Loading