Skip to content

Conversation

@hadiFute
Copy link
Contributor

Description

Be able to specify auxiliary streams to TensorRT RTX EP.

Motivation and Context

In some use cases, we want to have full control over all the streams used by TRT-RTX, even auxiliary ones.

@hadiFute
Copy link
Contributor Author

For review @chilo-ms @edgchen1 @ishwar-raut1

@anskumar01
Copy link

@chilo-ms , can you please review and merge it

@chilo-ms
Copy link
Contributor

chilo-ms commented Dec 2, 2025

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yuslepukhin
Copy link
Member

Looks good to me.

@yuslepukhin
Copy link
Member

/azp run PR Test Linux CUDA x64 Release, Test Linux TensorRT x64 Release, web_Debug/build_onnxruntime_web, web_Release/build_onnxruntime_web

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@yuslepukhin
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yuslepukhin
Copy link
Member

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-python-checks-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yuslepukhin
Copy link
Member

/azp run Build and Test OpenVINO EP, Build Linux arm64 Debug, Build Linux arm64 Release

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@yuslepukhin
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows OpenVINO CI Pipeline, Windows x64 QNN CI Pipeline,

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yuslepukhin
Copy link
Member

/azp run xnnpack / build-and-test

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tianleiwu
Copy link
Contributor

Below is AI analysis:

This is a significant functional update for the TensorRT EP. While the configuration logic (parsing options) looks standard, there is a critical implementation bug in how the stream array pointers are handled in nv_execution_provider.cc and nv_execution_provider.h. This will likely cause undefined behavior (crashes or invalid stream usage) when auxiliary_streams > 1.

Here is the detailed review:

1. Critical Fix Required: Pointer Indirection Logic

The current implementation defines aux_streams_ as a single cudaStream_t, but tries to force a user-provided array into it.

The Issue:
In nv_execution_provider.h:

// [Current] This allocates space for exactly ONE stream handle
cudaStream_t aux_streams_ = nullptr;

In nv_execution_provider.cc:

// [Current] You cast the POINTER to the array (void*) into a SINGLE stream handle.
// This sets the value of 'aux_streams_' to the memory address of the user's array.
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

// ... later in compute_func ...

// [Current] You pass the address of 'aux_streams_'.
// setAuxStreams expects 'cudaStream_t*'.
// TensorRT will read the first element as the VALUE of 'aux_streams_' (which is the address of the user array).
// If count > 1, TensorRT will read past the 'aux_streams_' member variable into garbage memory.
trt_context->setAuxStreams(&aux_streams_, (int32_t)auxiliary_streams_);

The Fix:
You must store the pointer to the user's array, not the array's address casted to a stream.

In nv_execution_provider.h:

// Change type to a pointer to a stream (representing the array)
cudaStream_t* aux_streams_ = nullptr; 

In nv_execution_provider.cc (Constructor):

// Cast void* to cudaStream_t* (pointer to array of streams)
aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);

In nv_execution_provider.cc (Compute):

// Pass the pointer directly. Do not use '&'.
trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);

2. Configuration & Validation

Redundant Validation:
You have validation logic in two places: inside the Constructor (NvExecutionProvider) and inside compute_func.

  • Location 1 (Constructor): This is good. It fails early if the config is bad.
  • Location 2 (Compute Func): This check (if(auxiliary_streams_ <= 0)) is redundant and adds branching to the hot path (inference). Since you already validated this in the constructor, you can remove it from compute_func.

Constructor Logic (Line 101):

if (info.user_aux_stream_array != nullptr) {
  if(info.auxiliary_streams <= 0){
     // Error...
  }
  // Logic OK, but ensure you apply the pointer fix mentioned in point #1
}

3. Compilation & Type Safety

Casting style:
In nv_execution_provider_info.cc:

// Line 114
user_aux_stream_array = reinterpret_cast<void*>(address);

This is the correct standard for ORT when parsing pointer addresses from strings.

However, in nv_execution_provider.cc:

// Line 102
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

If you adopt the fix in Point #1 (changing aux_streams_ to cudaStream_t*), you should use reinterpret_cast here, as cudaStream_t is an opaque pointer structure (struct CUstream_st *), and you are converting from void*. static_cast generally cannot convert void* to other pointer types.


Summary of Recommended Changes

Here is how the corrected code blocks should look:

nv_execution_provider.h

  bool external_aux_streams_ = false;
  cudaStream_t* aux_streams_ = nullptr; // Changed from cudaStream_t to cudaStream_t*
  int auxiliary_streams_ = -1;

nv_execution_provider.cc

// Constructor
  if (info.user_aux_stream_array != nullptr) {
    if(info.auxiliary_streams <= 0){
      ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, "Auxiliary streams must be greater than 0 when using external auxiliary streams"));
    }
    external_aux_streams_ = true;
    // Cast to pointer to streams
    aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);
  } else {
    external_aux_streams_ = false;
    aux_streams_ = nullptr;
  }

// ...

// Compute Function (remove the redundant <= 0 check here for speed)
    if (external_aux_streams_ && aux_streams_ != nullptr) {
      // Pass the pointer directly, do not use &
      trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);
    }

@hadiFute
Copy link
Contributor Author

hadiFute commented Dec 3, 2025

Below is AI analysis:

This is a significant functional update for the TensorRT EP. While the configuration logic (parsing options) looks standard, there is a critical implementation bug in how the stream array pointers are handled in nv_execution_provider.cc and nv_execution_provider.h. This will likely cause undefined behavior (crashes or invalid stream usage) when auxiliary_streams > 1.

Here is the detailed review:

1. Critical Fix Required: Pointer Indirection Logic

The current implementation defines aux_streams_ as a single cudaStream_t, but tries to force a user-provided array into it.

The Issue: In nv_execution_provider.h:

// [Current] This allocates space for exactly ONE stream handle
cudaStream_t aux_streams_ = nullptr;

In nv_execution_provider.cc:

// [Current] You cast the POINTER to the array (void*) into a SINGLE stream handle.
// This sets the value of 'aux_streams_' to the memory address of the user's array.
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

// ... later in compute_func ...

// [Current] You pass the address of 'aux_streams_'.
// setAuxStreams expects 'cudaStream_t*'.
// TensorRT will read the first element as the VALUE of 'aux_streams_' (which is the address of the user array).
// If count > 1, TensorRT will read past the 'aux_streams_' member variable into garbage memory.
trt_context->setAuxStreams(&aux_streams_, (int32_t)auxiliary_streams_);

The Fix: You must store the pointer to the user's array, not the array's address casted to a stream.

In nv_execution_provider.h:

// Change type to a pointer to a stream (representing the array)
cudaStream_t* aux_streams_ = nullptr; 

In nv_execution_provider.cc (Constructor):

// Cast void* to cudaStream_t* (pointer to array of streams)
aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);

In nv_execution_provider.cc (Compute):

// Pass the pointer directly. Do not use '&'.
trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);

2. Configuration & Validation

Redundant Validation: You have validation logic in two places: inside the Constructor (NvExecutionProvider) and inside compute_func.

  • Location 1 (Constructor): This is good. It fails early if the config is bad.
  • Location 2 (Compute Func): This check (if(auxiliary_streams_ <= 0)) is redundant and adds branching to the hot path (inference). Since you already validated this in the constructor, you can remove it from compute_func.

Constructor Logic (Line 101):

if (info.user_aux_stream_array != nullptr) {
  if(info.auxiliary_streams <= 0){
     // Error...
  }
  // Logic OK, but ensure you apply the pointer fix mentioned in point #1
}

3. Compilation & Type Safety

Casting style: In nv_execution_provider_info.cc:

// Line 114
user_aux_stream_array = reinterpret_cast<void*>(address);

This is the correct standard for ORT when parsing pointer addresses from strings.

However, in nv_execution_provider.cc:

// Line 102
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);

If you adopt the fix in Point #1 (changing aux_streams_ to cudaStream_t*), you should use reinterpret_cast here, as cudaStream_t is an opaque pointer structure (struct CUstream_st *), and you are converting from void*. static_cast generally cannot convert void* to other pointer types.

Summary of Recommended Changes

Here is how the corrected code blocks should look:

nv_execution_provider.h

  bool external_aux_streams_ = false;
  cudaStream_t* aux_streams_ = nullptr; // Changed from cudaStream_t to cudaStream_t*
  int auxiliary_streams_ = -1;

nv_execution_provider.cc

// Constructor
  if (info.user_aux_stream_array != nullptr) {
    if(info.auxiliary_streams <= 0){
      ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, "Auxiliary streams must be greater than 0 when using external auxiliary streams"));
    }
    external_aux_streams_ = true;
    // Cast to pointer to streams
    aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);
  } else {
    external_aux_streams_ = false;
    aux_streams_ = nullptr;
  }

// ...

// Compute Function (remove the redundant <= 0 check here for speed)
    if (external_aux_streams_ && aux_streams_ != nullptr) {
      // Pass the pointer directly, do not use &
      trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);
    }

cudaStream_t is already a pointer (of CUstream_st). So I would say that the AI is not right here ?

@tianleiwu
Copy link
Contributor

tianleiwu commented Dec 3, 2025

cudaStream_t is already a pointer (of CUstream_st). So I would say that the AI is not right here ?

The array is list of cudaStream_t (pointers), and we shall pass in the address of the array to onnx runtime. If you pass cudaStream_t (the first element of array), it will work only when there is one aux stream.

Could you do some end-to-end test with two aux streams to verify?

Example usage for TensorRT:

    int numAuxStreams = 4;
    std::vector<cudaStream_t> auxStreams(numAuxStreams);

    for (int i = 0; i < numAuxStreams; ++i) {
        CHECK_CUDA(cudaStreamCreateWithFlags(&auxStreams[i], cudaStreamNonBlocking));
    }

    context->setAuxStreams(auxStreams.data(), numAuxStreams);

For onnxruntime, we shall pass auxStreams.data() to TRT EP.

@yuslepukhin
Copy link
Member

Some small comments, otherwise looks good.

@hadiFute
Copy link
Contributor Author

hadiFute commented Dec 5, 2025

Thanks @tianleiwu for the feedbacks ! I pushed new changes based on that. Let me know if you see any more things that needs improvements

@yuslepukhin
Copy link
Member

Need to run lintrunner -a

@tianleiwu
Copy link
Contributor

tianleiwu commented Dec 5, 2025

The code looks good to me.

Please format the code to pass CI pipeline:
https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting

@hadiFute
Copy link
Contributor Author

hadiFute commented Dec 8, 2025

Linting errors should be fixed now !

@yuslepukhin
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows OpenVINO CI Pipeline, Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu tianleiwu enabled auto-merge (squash) December 8, 2025 19:28
@tianleiwu tianleiwu merged commit 0b8180e into microsoft:main Dec 8, 2025
89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants