[Performance]: Quantized Qwen3 MoE

Hello
I quantized the Qwen3 235B MoE model using FP8 Dynamic and FP8 Static (weights, activations and KV cache) using the llm-compressor library, and benchmarked the online serving performance using vLLM benchmarking scripts on a H200 GPU. I was hoping to see a significant performance improvement, but:

- For model serving with tensor_parallel_size 8, there is barely any throughput/latency improvement, in fact latencies are slightly worse.
- For model serving with tensor_parallel_size 4, throughput improvement is seen in very specific scenarios where inputs are in the range of 5k-7k tokens and outputs in the range of 1k tokens. There are no latency improvements.

Considering that the weights are in half the precision, many activations are in half the precision and KV cache is also stored in half the precision, shouldn't there be a significant performance improvements?
Could you help me understand where my understanding could be off.

thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: Quantized Qwen3 MoE #2069

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Quantized Qwen3 MoE #2069

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions