Skip to content

[Bug]: LoRA Support Missing for Encoder-Decoder Models in TensorRT-LLM CPP Implementation #10258

@xiaoxiaoyuwen

Description

@xiaoxiaoyuwen

System Info

NVIDIA Driver: 580.105.08

CUDA Version: 13.0

GPU: RTX 3090

Triton Server Image: nvcr.io/nvidia/tritonserver:25.05-trtllm-python-py3

Who can help?

@laikhtewari

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Build the TensorRT-LLM engine using the following command:
trtllm-build --checkpoint_dir $TLLM_MODEL_DIR \
             --output_dir $TLLM_ENGINE_DIR \
             --moe_plugin disable \
             --max_beam_width ${MAX_BEAM_WIDTH} \
             --max_batch_size 64 \
             --max_input_len 1 \
             --max_seq_len 300 \
             --max_encoder_input_len 1000 \
             --gemm_plugin ${INFERENCE_PRECISION} \
             --bert_attention_plugin ${INFERENCE_PRECISION} \
             --gpt_attention_plugin ${INFERENCE_PRECISION} \
             --lora_plugin float16 \
             --max_lora_rank 64 \
             --lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h cross_attn_q cross_attn_k cross_attn_v cross_attn_dense
  1. Deploy the generated engine using Triton Server with tensorrtllm backend and inflight_fused_batching
  2. Send a request to the deployed model.

Expected behavior

The TensorRT-LLM engine should properly support LoRA for encoder-decoder models

actual behavior

triton server error

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Input tensor 'host_encoder_input_lengths' not found; expected shape: (-1) (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:524)
1       0x7f467669df2b tensorrt_llm::runtime::TllmRuntime::setInputTensorsImpl(int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&, bool) + 827
2       0x7f46766a0ed6 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 70

additional notes

  1. The same configuration works for decoder-only models with LoRA.
  2. The issue specifically affects encoder-decoder architectures when LoRA is enabled.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Lora/P-tuningParameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.Triton backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions