[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server #24258

AzizCode92 · 2025-09-04T15:53:20Z

Purpose

The __exit__ method of the RemoteOpenAIServer is not sufficient to clean up GPU memory.
This creates a race condition between the different tests in CI.
This PR ensures we properly clean the GPU memory when exiting the openAI remote server.

Solves: #24144

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…AI server Signed-off-by: AzizCode92 <[email protected]>

Signed-off-by: AzizCode92 <[email protected]>

DarkLight1337 · 2025-09-05T02:49:35Z

tests/utils.py

            self.proc.kill()
+        # GPU memory cleanup
+        try:
+            if torch.cuda.is_available():


Let's use current_platform so that the tests can run on other platforms as well

Good point!

Signed-off-by: AzizCode92 <[email protected]>

Centralizes the GPU memory cleanup logic into a single static method to prevent flaky test failures from OOM errors. Signed-off-by: AzizCode92 <[email protected]>

[feat]: ensure the gpu memory is cleaned when exiting the remote open…

19ffe12

…AI server Signed-off-by: AzizCode92 <[email protected]>

AzizCode92 mentioned this pull request Sep 4, 2025

[CI Failure]: Flaky OOM in Entrypoints Tests #24144

Closed

3 tasks

AzizCode92 changed the title ~~[feat]: ensure the gpu memory is cleaned when exiting the remote openAI remote server~~ [fix]: ensure the gpu memory is cleaned when exiting the remote openAI remote server Sep 4, 2025

fix: use wait_for_gpu_memory_to_clear to clear gpu memory

4c8625f

Signed-off-by: AzizCode92 <[email protected]>

AzizCode92 marked this pull request as ready for review September 4, 2025 18:41

AzizCode92 changed the title ~~[fix]: ensure the gpu memory is cleaned when exiting the remote openAI remote server~~ [ci]: ensure the gpu memory is cleaned when exiting the remote openAI remote server Sep 4, 2025

AzizCode92 changed the title ~~[ci]: ensure the gpu memory is cleaned when exiting the remote openAI remote server~~ [ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server Sep 4, 2025

DarkLight1337 reviewed Sep 5, 2025

View reviewed changes

AzizCode92 added 5 commits September 5, 2025 09:34

[fix]: ensure the cleaning of the GPU memory is hardware-agnostic

963fe4b

Signed-off-by: AzizCode92 <[email protected]>

Merge branch 'main' into fix-oom-test-entrypoints

5e3ab94

Merge branch 'main' into fix-oom-test-entrypoints

276ec32

chore: Improve GPU cleanup in tests

2c9ed5b

Centralizes the GPU memory cleanup logic into a single static method to prevent flaky test failures from OOM errors. Signed-off-by: AzizCode92 <[email protected]>

Merge branch 'main' into fix-oom-test-entrypoints

a8cb4f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server #24258

[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server #24258

Uh oh!

AzizCode92 commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

DarkLight1337 Sep 5, 2025

Uh oh!

AzizCode92 Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server #24258

Are you sure you want to change the base?

[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server #24258

Uh oh!

Conversation

AzizCode92 commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

DarkLight1337 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

AzizCode92 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AzizCode92 commented Sep 4, 2025 •

edited by github-actions bot

Loading