Skip to content

[Performance]: Quantized Qwen3 MoE #2069

@sneha5gsm

Description

@sneha5gsm

Hello
I quantized the Qwen3 235B MoE model using FP8 Dynamic and FP8 Static (weights, activations and KV cache) using the llm-compressor library, and benchmarked the online serving performance using vLLM benchmarking scripts on a H200 GPU. I was hoping to see a significant performance improvement, but:

  • For model serving with tensor_parallel_size 8, there is barely any throughput/latency improvement, in fact latencies are slightly worse.
  • For model serving with tensor_parallel_size 4, throughput improvement is seen in very specific scenarios where inputs are in the range of 5k-7k tokens and outputs in the range of 1k tokens. There are no latency improvements.

Considering that the weights are in half the precision, many activations are in half the precision and KV cache is also stored in half the precision, shouldn't there be a significant performance improvements?
Could you help me understand where my understanding could be off.

thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions