Skip to content

Commit ed452ff

Browse files
committed
add eval config for Qwen3-235B-A22B-Thinking-2507-FP8
Signed-off-by: Huamin Li <[email protected]>
1 parent 99722d5 commit ed452ff

File tree

4 files changed

+32
-3
lines changed

4 files changed

+32
-3
lines changed
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
model_name: "Qwen/Qwen3-235B-A22B-Thinking-2507-FP8"
2+
backend: "vllm"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.8
8+
num_fewshot: 5
9+
limit: 1000
10+
max_model_len: 8096
11+
gen_kwargs: "top_p=1,top_k=0,max_gen_toks=1536"
12+
apply_chat_template: true
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Qwen3-235B-A22B-Thinking-2507-FP8.yaml

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,14 @@ def launch_lm_eval(eval_config, tp_size):
3737
limit=eval_config["limit"],
3838
# TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
3939
# text models. however, this is regressing measured strict-match for
40-
# existing text models in CI, so only apply it for mm.
41-
apply_chat_template=backend == "vllm-vlm",
40+
# existing text models in CI, so only apply it for mm or specified in config.
41+
apply_chat_template=eval_config.get(
42+
"apply_chat_template", backend == "vllm-vlm"
43+
),
4244
batch_size=batch_size,
45+
gen_kwargs=eval_config.get("gen_kwargs", None),
4346
)
47+
4448
return results
4549

4650

.buildkite/test-pipeline.yaml

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1084,7 +1084,7 @@ steps:
10841084
- tests/weight_loading
10851085
commands:
10861086
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
1087-
1087+
10881088
- label: NixlConnector PD accuracy tests (Distributed) # 30min
10891089
timeout_in_minutes: 30
10901090
working_dir: "/vllm-workspace/tests"
@@ -1126,6 +1126,18 @@ steps:
11261126
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
11271127
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
11281128

1129+
##### H100 test #####
1130+
- label: LM Eval Medium Models (H100) # optional
1131+
gpu: h100
1132+
optional: true
1133+
num_gpus: 4
1134+
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
1135+
source_file_dependencies:
1136+
- csrc/
1137+
- vllm/model_executor/layers/quantization
1138+
commands:
1139+
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-medium-h100.txt --tp-size=4
1140+
11291141
##### H200 test #####
11301142
- label: Distributed Tests (H200) # optional
11311143
gpu: h200

0 commit comments

Comments
 (0)