Skip to content

Conversation

@Lucaskabela
Copy link
Contributor

@Lucaskabela Lucaskabela commented Oct 31, 2025

Purpose

We want to speedup up inference for mllama4 by applying torch.compile to the intensive workload, similar to what is done in #23207. We start by experimenting with the VisionEncoderLayer + PixelShuffle

Test Plan

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size=8 --gpu_memory_utilization=.8 --max_model_len=8192
vllm bench serve   --backend openai-chat   --model meta-llama/Llama-4-Scout-17B-16E-Instruct    --endpoint /v1/chat/completions   --dataset-name hf   --dataset-path lmarena-ai/VisionArena-Chat   --hf-split train   --num-prompts 1000

Test Result

Baseline (main) This Pr
Successful requests 998 998
Benchmark duration (s) 998 998
Successful requests 72.52 62.15
Total generated tokens 117376 117504
Request throughput (req/s) 13.76 16.06
Output token throughput (tok/s) 1618.52 1890.73
Mean TTFT (ms) 35483.34 28623.5
Mean TPOT (ms) 264.74 233.7
Mean ITL (ms) 256.56 227.07

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vision

Signed-off-by: Lucas Kabela <[email protected]>
@mergify mergify bot added the llama Related to Llama models label Oct 31, 2025
@Lucaskabela Lucaskabela changed the title [Misc][LLaMa4] Compile LLaMa Vision Encoder layers [Draft][DO NOT MERGE][Misc][LLaMa4] Compile LLaMa Vision Encoder layers Oct 31, 2025
@Lucaskabela
Copy link
Contributor Author

Lucaskabela commented Oct 31, 2025

<Updated to use dynamic dims - seems we still get very good speedup here!>

@Lucaskabela Lucaskabela force-pushed the lucaskabela/compile_llama4 branch from 29dec46 to d807a34 Compare November 4, 2025 17:56
@Lucaskabela Lucaskabela marked this pull request as ready for review November 4, 2025 18:17
@Lucaskabela
Copy link
Contributor Author

cc @zou3519 @ProExpertProg @ywang96

@Lucaskabela Lucaskabela changed the title [Draft][DO NOT MERGE][Misc][LLaMa4] Compile LLaMa Vision Encoder layers [Misc][LLaMa4] Compile LLaMa Vision Encoder layers Nov 4, 2025
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants