[TensorRT RTX EP] Be able to specify aux streams #26569

hadiFute · 2025-11-13T17:26:16Z

Description

Be able to specify auxiliary streams to TensorRT RTX EP.

Motivation and Context

In some use cases, we want to have full control over all the streams used by TRT-RTX, even auxiliary ones.

…rity

…ernal aux stream

…ToSpecifyAuxStreams

hadiFute · 2025-11-13T17:29:50Z

For review @chilo-ms @edgchen1 @ishwar-raut1

anskumar01 · 2025-12-02T18:01:10Z

@chilo-ms , can you please review and merge it

chilo-ms · 2025-12-02T18:05:45Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-12-02T18:06:10Z

Azure Pipelines successfully started running 4 pipeline(s).

yuslepukhin · 2025-12-02T19:41:17Z

Looks good to me.

yuslepukhin · 2025-12-02T19:41:49Z

/azp run PR Test Linux CUDA x64 Release, Test Linux TensorRT x64 Release, web_Debug/build_onnxruntime_web, web_Release/build_onnxruntime_web

azure-pipelines · 2025-12-02T19:41:56Z

No pipelines are associated with this pull request.

yuslepukhin · 2025-12-02T19:42:17Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-12-02T19:42:38Z

Azure Pipelines successfully started running 4 pipeline(s).

yuslepukhin · 2025-12-02T19:42:58Z

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-python-checks-ci-pipeline

azure-pipelines · 2025-12-02T19:43:07Z

Azure Pipelines successfully started running 1 pipeline(s).

yuslepukhin · 2025-12-02T19:45:02Z

/azp run Build and Test OpenVINO EP, Build Linux arm64 Debug, Build Linux arm64 Release

azure-pipelines · 2025-12-02T19:45:08Z

No pipelines are associated with this pull request.

yuslepukhin · 2025-12-02T19:47:20Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows OpenVINO CI Pipeline, Windows x64 QNN CI Pipeline,

azure-pipelines · 2025-12-02T19:47:36Z

Azure Pipelines successfully started running 4 pipeline(s).

yuslepukhin · 2025-12-02T19:48:42Z

/azp run xnnpack / build-and-test

azure-pipelines · 2025-12-02T19:48:49Z

No pipelines are associated with this pull request.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

tianleiwu · 2025-12-03T05:15:43Z

Below is AI analysis:

This is a significant functional update for the TensorRT EP. While the configuration logic (parsing options) looks standard, there is a critical implementation bug in how the stream array pointers are handled in nv_execution_provider.cc and nv_execution_provider.h. This will likely cause undefined behavior (crashes or invalid stream usage) when auxiliary_streams > 1.

Here is the detailed review:

1. Critical Fix Required: Pointer Indirection Logic

The current implementation defines aux_streams_ as a single cudaStream_t, but tries to force a user-provided array into it.

The Issue:
In nv_execution_provider.h:

// [Current] This allocates space for exactly ONE stream handle
cudaStream_t aux_streams_ = nullptr;

In nv_execution_provider.cc:

// [Current] You cast the POINTER to the array (void*) into a SINGLE stream handle.
// This sets the value of 'aux_streams_' to the memory address of the user's array.
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

// ... later in compute_func ...

// [Current] You pass the address of 'aux_streams_'.
// setAuxStreams expects 'cudaStream_t*'.
// TensorRT will read the first element as the VALUE of 'aux_streams_' (which is the address of the user array).
// If count > 1, TensorRT will read past the 'aux_streams_' member variable into garbage memory.
trt_context->setAuxStreams(&aux_streams_, (int32_t)auxiliary_streams_);

The Fix:
You must store the pointer to the user's array, not the array's address casted to a stream.

In nv_execution_provider.h:

// Change type to a pointer to a stream (representing the array)
cudaStream_t* aux_streams_ = nullptr;

In nv_execution_provider.cc (Constructor):

// Cast void* to cudaStream_t* (pointer to array of streams)
aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);

In nv_execution_provider.cc (Compute):

// Pass the pointer directly. Do not use '&'.
trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);

2. Configuration & Validation

Redundant Validation:
You have validation logic in two places: inside the Constructor (NvExecutionProvider) and inside compute_func.

Location 1 (Constructor): This is good. It fails early if the config is bad.
Location 2 (Compute Func): This check (if(auxiliary_streams_ <= 0)) is redundant and adds branching to the hot path (inference). Since you already validated this in the constructor, you can remove it from compute_func.

Constructor Logic (Line 101):

if (info.user_aux_stream_array != nullptr) {
  if(info.auxiliary_streams <= 0){
     // Error...
  }
  // Logic OK, but ensure you apply the pointer fix mentioned in point #1
}

3. Compilation & Type Safety

Casting style:
In nv_execution_provider_info.cc:

// Line 114
user_aux_stream_array = reinterpret_cast<void*>(address);

This is the correct standard for ORT when parsing pointer addresses from strings.

However, in nv_execution_provider.cc:

// Line 102
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

If you adopt the fix in Point #1 (changing aux_streams_ to cudaStream_t*), you should use reinterpret_cast here, as cudaStream_t is an opaque pointer structure (struct CUstream_st *), and you are converting from void*. static_cast generally cannot convert void* to other pointer types.

Summary of Recommended Changes

Here is how the corrected code blocks should look:

nv_execution_provider.h

  bool external_aux_streams_ = false;
  cudaStream_t* aux_streams_ = nullptr; // Changed from cudaStream_t to cudaStream_t*
  int auxiliary_streams_ = -1;

nv_execution_provider.cc

// Constructor
  if (info.user_aux_stream_array != nullptr) {
    if(info.auxiliary_streams <= 0){
      ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, "Auxiliary streams must be greater than 0 when using external auxiliary streams"));
    }
    external_aux_streams_ = true;
    // Cast to pointer to streams
    aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);
  } else {
    external_aux_streams_ = false;
    aux_streams_ = nullptr;
  }

// ...

// Compute Function (remove the redundant <= 0 check here for speed)
    if (external_aux_streams_ && aux_streams_ != nullptr) {
      // Pass the pointer directly, do not use &
      trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);
    }

hadiFute · 2025-12-03T14:25:52Z

Below is AI analysis:

This is a significant functional update for the TensorRT EP. While the configuration logic (parsing options) looks standard, there is a critical implementation bug in how the stream array pointers are handled in nv_execution_provider.cc and nv_execution_provider.h. This will likely cause undefined behavior (crashes or invalid stream usage) when auxiliary_streams > 1.

Here is the detailed review:

1. Critical Fix Required: Pointer Indirection Logic

The current implementation defines aux_streams_ as a single cudaStream_t, but tries to force a user-provided array into it.

The Issue: In nv_execution_provider.h:
// [Current] This allocates space for exactly ONE stream handle
cudaStream_t aux_streams_ = nullptr;
In nv_execution_provider.cc:
// [Current] You cast the POINTER to the array (void*) into a SINGLE stream handle.
// This sets the value of 'aux_streams_' to the memory address of the user's array.
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

// ... later in compute_func ...

// [Current] You pass the address of 'aux_streams_'.
// setAuxStreams expects 'cudaStream_t*'.
// TensorRT will read the first element as the VALUE of 'aux_streams_' (which is the address of the user array).
// If count > 1, TensorRT will read past the 'aux_streams_' member variable into garbage memory.
trt_context->setAuxStreams(&aux_streams_, (int32_t)auxiliary_streams_);
The Fix: You must store the pointer to the user's array, not the array's address casted to a stream.

In nv_execution_provider.h:
// Change type to a pointer to a stream (representing the array)
cudaStream_t* aux_streams_ = nullptr; 
In nv_execution_provider.cc (Constructor):
// Cast void* to cudaStream_t* (pointer to array of streams)
aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);
In nv_execution_provider.cc (Compute):
// Pass the pointer directly. Do not use '&'.
trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);
2. Configuration & Validation

Redundant Validation: You have validation logic in two places: inside the Constructor (NvExecutionProvider) and inside compute_func.

Location 1 (Constructor): This is good. It fails early if the config is bad.

Location 2 (Compute Func): This check (if(auxiliary_streams_ <= 0)) is redundant and adds branching to the hot path (inference). Since you already validated this in the constructor, you can remove it from compute_func.

Constructor Logic (Line 101):
if (info.user_aux_stream_array != nullptr) {
  if(info.auxiliary_streams <= 0){
     // Error...
  }
  // Logic OK, but ensure you apply the pointer fix mentioned in point #1
}
3. Compilation & Type Safety

Casting style: In nv_execution_provider_info.cc:
// Line 114
user_aux_stream_array = reinterpret_cast<void*>(address);
This is the correct standard for ORT when parsing pointer addresses from strings.

However, in nv_execution_provider.cc:
// Line 102
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);
If you adopt the fix in Point #1 (changing aux_streams_ to cudaStream_t*), you should use reinterpret_cast here, as cudaStream_t is an opaque pointer structure (struct CUstream_st *), and you are converting from void*. static_cast generally cannot convert void* to other pointer types.

Summary of Recommended Changes

Here is how the corrected code blocks should look:

nv_execution_provider.h
  bool external_aux_streams_ = false;
  cudaStream_t* aux_streams_ = nullptr; // Changed from cudaStream_t to cudaStream_t*
  int auxiliary_streams_ = -1;
nv_execution_provider.cc
// Constructor
  if (info.user_aux_stream_array != nullptr) {
    if(info.auxiliary_streams <= 0){
      ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, "Auxiliary streams must be greater than 0 when using external auxiliary streams"));
    }
    external_aux_streams_ = true;
    // Cast to pointer to streams
    aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);
  } else {
    external_aux_streams_ = false;
    aux_streams_ = nullptr;
  }

// ...

// Compute Function (remove the redundant <= 0 check here for speed)
    if (external_aux_streams_ && aux_streams_ != nullptr) {
      // Pass the pointer directly, do not use &
      trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);
    }

cudaStream_t is already a pointer (of CUstream_st). So I would say that the AI is not right here ?

tianleiwu · 2025-12-03T18:20:37Z

cudaStream_t is already a pointer (of CUstream_st). So I would say that the AI is not right here ?

The array is list of cudaStream_t (pointers), and we shall pass in the address of the array to onnx runtime. If you pass cudaStream_t (the first element of array), it will work only when there is one aux stream.

Could you do some end-to-end test with two aux streams to verify?

Example usage for TensorRT:

    int numAuxStreams = 4;
    std::vector<cudaStream_t> auxStreams(numAuxStreams);

    for (int i = 0; i < numAuxStreams; ++i) {
        CHECK_CUDA(cudaStreamCreateWithFlags(&auxStreams[i], cudaStreamNonBlocking));
    }

    context->setAuxStreams(auxStreams.data(), numAuxStreams);

For onnxruntime, we shall pass auxStreams.data() to TRT EP.

include/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_options.h

yuslepukhin · 2025-12-05T00:46:52Z

Some small comments, otherwise looks good.

…r_aux_stream_array) > 1

hadiFute · 2025-12-05T11:01:16Z

Thanks @tianleiwu for the feedbacks ! I pushed new changes based on that. Let me know if you see any more things that needs improvements

yuslepukhin · 2025-12-05T18:46:53Z

Need to run lintrunner -a

tianleiwu · 2025-12-05T22:31:25Z

The code looks good to me.

Please format the code to pass CI pipeline:
https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting

hadiFute · 2025-12-08T08:55:18Z

Linting errors should be fixed now !

yuslepukhin · 2025-12-08T18:40:18Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows OpenVINO CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-12-08T18:40:38Z

Azure Pipelines successfully started running 4 pipeline(s).

hadiFute added 5 commits November 6, 2025 17:02

Option to specify aux streams used by TRT RTX. Necessary for CIG prio…

489b564

…rity

make sure that auxiliary_streams is higher than 0 if user provide ext…

9808943

…ernal aux stream

update aux streams to aux stream array

d30a3cb

remove kHasUserAuxStreamArray

fbfaab4

Merge branch 'htemmar/BeAbleToSpecifyAuxStreams' into hadiFute/BeAble…

8a5dbe4

…ToSpecifyAuxStreams

yuslepukhin requested review from Copilot and tianleiwu December 2, 2025 19:36

Copilot AI reviewed Dec 2, 2025

View reviewed changes

tianleiwu reviewed Dec 3, 2025

View reviewed changes

include/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_options.h Outdated Show resolved Hide resolved

hadiFute added 2 commits December 5, 2025 11:59

Make sure that cudaStream_t is a pointer to avoid issues when len(use…

ff184f8

…r_aux_stream_array) > 1

rename kMaxAuxiliaryStreams to kLengthAuxStreamArray

2ba9635

fix linting error

c3415c6

tianleiwu approved these changes Dec 8, 2025

View reviewed changes

tianleiwu enabled auto-merge (squash) December 8, 2025 19:28

tianleiwu merged commit 0b8180e into microsoft:main Dec 8, 2025
89 checks passed

[TensorRT RTX EP] Be able to specify aux streams #26569

[TensorRT RTX EP] Be able to specify aux streams #26569

Conversation

hadiFute commented Nov 13, 2025

Description

Motivation and Context

Uh oh!

hadiFute commented Nov 13, 2025

Uh oh!

anskumar01 commented Dec 2, 2025

Uh oh!

chilo-ms commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

yuslepukhin commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

tianleiwu commented Dec 3, 2025

Below is AI analysis:

1. Critical Fix Required: Pointer Indirection Logic

2. Configuration & Validation

3. Compilation & Type Safety

Summary of Recommended Changes

Uh oh!

hadiFute commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Below is AI analysis:

1. Critical Fix Required: Pointer Indirection Logic

2. Configuration & Validation

3. Compilation & Type Safety

Summary of Recommended Changes

Uh oh!

tianleiwu commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yuslepukhin commented Dec 5, 2025

Uh oh!

hadiFute commented Dec 5, 2025

Uh oh!

yuslepukhin commented Dec 5, 2025

Uh oh!

tianleiwu commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadiFute commented Dec 8, 2025

Uh oh!

yuslepukhin commented Dec 8, 2025

Uh oh!

azure-pipelines bot commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

hadiFute commented Dec 3, 2025 •

edited

Loading

tianleiwu commented Dec 3, 2025 •

edited

Loading

tianleiwu commented Dec 5, 2025 •

edited

Loading