-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
Lora/P-tuningParameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.Parameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.Triton backend<NV>Related to NVIDIA Triton Inference Server backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't workingSomething isn't working
Description
System Info
NVIDIA Driver: 580.105.08
CUDA Version: 13.0
GPU: RTX 3090
Triton Server Image: nvcr.io/nvidia/tritonserver:25.05-trtllm-python-py3
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Build the TensorRT-LLM engine using the following command:
trtllm-build --checkpoint_dir $TLLM_MODEL_DIR \
--output_dir $TLLM_ENGINE_DIR \
--moe_plugin disable \
--max_beam_width ${MAX_BEAM_WIDTH} \
--max_batch_size 64 \
--max_input_len 1 \
--max_seq_len 300 \
--max_encoder_input_len 1000 \
--gemm_plugin ${INFERENCE_PRECISION} \
--bert_attention_plugin ${INFERENCE_PRECISION} \
--gpt_attention_plugin ${INFERENCE_PRECISION} \
--lora_plugin float16 \
--max_lora_rank 64 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h cross_attn_q cross_attn_k cross_attn_v cross_attn_dense
- Deploy the generated engine using Triton Server with tensorrtllm backend and inflight_fused_batching
- Send a request to the deployed model.
Expected behavior
The TensorRT-LLM engine should properly support LoRA for encoder-decoder models
actual behavior
triton server error
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Input tensor 'host_encoder_input_lengths' not found; expected shape: (-1) (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:524)
1 0x7f467669df2b tensorrt_llm::runtime::TllmRuntime::setInputTensorsImpl(int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&, bool) + 827
2 0x7f46766a0ed6 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 70
additional notes
- The same configuration works for decoder-only models with LoRA.
- The issue specifically affects encoder-decoder architectures when LoRA is enabled.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
coderabbitai
Metadata
Metadata
Assignees
Labels
Lora/P-tuningParameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.Parameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.Triton backend<NV>Related to NVIDIA Triton Inference Server backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't workingSomething isn't working