-
Notifications
You must be signed in to change notification settings - Fork 225
add check for existance of output dir for memtest #5574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
add check for existance of output dir for memtest #5574
Conversation
dce8646 to
af1c9f3
Compare
|
Force pushed to re-trigger the CI because the gitlab runner was stuck |
|
@PrometheusPi you need to force push again, you should not assign someone to the pull requests else the CI bot can not handle the PR. |
af1c9f3 to
018fd29
Compare
|
Why don't you also print the output in the failing case? It might still be valuable information in there. |
|
@chillenzer good point - I will add that |
| echo "Error: $0 did not find directory: $old_path (on host: $host_name with rank: $host_rank)" >&2 | ||
| echo "error message of memtest is:" >&2 | ||
| echo -e "$output" >&2 | ||
| exit 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a space is missing
I encountered an issue on OLCF Frontier where we ran through the memtest loop, saw a defective node/GPU, but could not write out the error file. @psychocoderHPC and I suspect that the file system was not available or up-to-date on the defective node, thus preventing any useful error log.
To prevent this kind of bug in the future, this PR adds a check that writes to stderr if the directory to which to write the error log is not available.