Skip to content

Conversation

@PrometheusPi
Copy link
Member

I encountered an issue on OLCF Frontier where we ran through the memtest loop, saw a defective node/GPU, but could not write out the error file. @psychocoderHPC and I suspect that the file system was not available or up-to-date on the defective node, thus preventing any useful error log.

To prevent this kind of bug in the future, this PR adds a check that writes to stderr if the directory to which to write the error log is not available.

@PrometheusPi PrometheusPi added this to the 0.9.0 / next stable milestone Dec 9, 2025
@PrometheusPi PrometheusPi added component: tools scripts, python libs and CMake CI:no-compile CI is skipping compile/runtime tests but runs PICMI tests labels Dec 9, 2025
ikbuibui
ikbuibui previously approved these changes Dec 9, 2025
@ikbuibui ikbuibui force-pushed the add_checkFilesystem_memtest branch from dce8646 to af1c9f3 Compare December 10, 2025 10:55
@ikbuibui
Copy link
Contributor

ikbuibui commented Dec 10, 2025

Force pushed to re-trigger the CI because the gitlab runner was stuck

@psychocoderHPC psychocoderHPC removed their assignment Dec 10, 2025
@psychocoderHPC
Copy link
Member

@PrometheusPi you need to force push again, you should not assign someone to the pull requests else the CI bot can not handle the PR.
This is a bug in the CI bot which is known since at least 5 years and it will not be fixed by our IT.

@PrometheusPi PrometheusPi force-pushed the add_checkFilesystem_memtest branch from af1c9f3 to 018fd29 Compare December 10, 2025 20:36
@chillenzer
Copy link
Contributor

Why don't you also print the output in the failing case? It might still be valuable information in there.

@PrometheusPi
Copy link
Member Author

@chillenzer good point - I will add that

echo "Error: $0 did not find directory: $old_path (on host: $host_name with rank: $host_rank)" >&2
echo "error message of memtest is:" >&2
echo -e "$output" >&2
exit 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a space is missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:no-compile CI is skipping compile/runtime tests but runs PICMI tests component: tools scripts, python libs and CMake

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants