-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[TensorRT RTX EP] Be able to specify aux streams #26569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TensorRT RTX EP] Be able to specify aux streams #26569
Conversation
…ToSpecifyAuxStreams
|
For review @chilo-ms @edgchen1 @ishwar-raut1 |
|
@chilo-ms , can you please review and merge it |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Looks good to me. |
|
/azp run PR Test Linux CUDA x64 Release, Test Linux TensorRT x64 Release, web_Debug/build_onnxruntime_web, web_Release/build_onnxruntime_web |
|
No pipelines are associated with this pull request. |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-python-checks-ci-pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Build and Test OpenVINO EP, Build Linux arm64 Debug, Build Linux arm64 Release |
|
No pipelines are associated with this pull request. |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows OpenVINO CI Pipeline, Windows x64 QNN CI Pipeline, |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run xnnpack / build-and-test |
|
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Below is AI analysis:This is a significant functional update for the TensorRT EP. While the configuration logic (parsing options) looks standard, there is a critical implementation bug in how the stream array pointers are handled in Here is the detailed review: 1. Critical Fix Required: Pointer Indirection LogicThe current implementation defines The Issue: // [Current] This allocates space for exactly ONE stream handle
cudaStream_t aux_streams_ = nullptr;In // [Current] You cast the POINTER to the array (void*) into a SINGLE stream handle.
// This sets the value of 'aux_streams_' to the memory address of the user's array.
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);
// ... later in compute_func ...
// [Current] You pass the address of 'aux_streams_'.
// setAuxStreams expects 'cudaStream_t*'.
// TensorRT will read the first element as the VALUE of 'aux_streams_' (which is the address of the user array).
// If count > 1, TensorRT will read past the 'aux_streams_' member variable into garbage memory.
trt_context->setAuxStreams(&aux_streams_, (int32_t)auxiliary_streams_);The Fix: In // Change type to a pointer to a stream (representing the array)
cudaStream_t* aux_streams_ = nullptr; In // Cast void* to cudaStream_t* (pointer to array of streams)
aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);In // Pass the pointer directly. Do not use '&'.
trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);2. Configuration & ValidationRedundant Validation:
Constructor Logic (Line 101): if (info.user_aux_stream_array != nullptr) {
if(info.auxiliary_streams <= 0){
// Error...
}
// Logic OK, but ensure you apply the pointer fix mentioned in point #1
}3. Compilation & Type SafetyCasting style: // Line 114
user_aux_stream_array = reinterpret_cast<void*>(address);This is the correct standard for ORT when parsing pointer addresses from strings. However, in // Line 102
aux_streams_ = static_cast<cudaStream_t>(info.user_aux_stream_array);If you adopt the fix in Point #1 (changing Summary of Recommended ChangesHere is how the corrected code blocks should look:
bool external_aux_streams_ = false;
cudaStream_t* aux_streams_ = nullptr; // Changed from cudaStream_t to cudaStream_t*
int auxiliary_streams_ = -1;
// Constructor
if (info.user_aux_stream_array != nullptr) {
if(info.auxiliary_streams <= 0){
ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, "Auxiliary streams must be greater than 0 when using external auxiliary streams"));
}
external_aux_streams_ = true;
// Cast to pointer to streams
aux_streams_ = reinterpret_cast<cudaStream_t*>(info.user_aux_stream_array);
} else {
external_aux_streams_ = false;
aux_streams_ = nullptr;
}
// ...
// Compute Function (remove the redundant <= 0 check here for speed)
if (external_aux_streams_ && aux_streams_ != nullptr) {
// Pass the pointer directly, do not use &
trt_context->setAuxStreams(aux_streams_, (int32_t)auxiliary_streams_);
} |
cudaStream_t is already a pointer (of CUstream_st). So I would say that the AI is not right here ? |
The array is list of cudaStream_t (pointers), and we shall pass in the address of the array to onnx runtime. If you pass cudaStream_t (the first element of array), it will work only when there is one aux stream. Could you do some end-to-end test with two aux streams to verify? Example usage for TensorRT: For onnxruntime, we shall pass auxStreams.data() to TRT EP. |
include/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_options.h
Outdated
Show resolved
Hide resolved
|
Some small comments, otherwise looks good. |
|
Thanks @tianleiwu for the feedbacks ! I pushed new changes based on that. Let me know if you see any more things that needs improvements |
|
Need to run |
|
The code looks good to me. Please format the code to pass CI pipeline: |
|
Linting errors should be fixed now ! |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows OpenVINO CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Description
Be able to specify auxiliary streams to TensorRT RTX EP.
Motivation and Context
In some use cases, we want to have full control over all the streams used by TRT-RTX, even auxiliary ones.