-
Notifications
You must be signed in to change notification settings - Fork 333
DAOS-18196 object: large stack for collective object RPC ULT #17147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ticket title is 'daos_engine segfaults when we perform concurrent IO operations on multiple pools on mdonssd testing.' |
2aadeff to
536e91e
Compare
536e91e to
563beed
Compare
|
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect |
|
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect |
|
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect |
|
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect |
|
Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect |
To avoid potential ULT stack overflow. Allow-unstable-test: true Signed-off-by: Fan Yong <[email protected]>
563beed to
5d3541e
Compare
wangshilong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious that did we confirm the PR help fix the issue?
|
|
||
| rc = dss_thread_collective_reduce(&coll_ops, &coll_args, DSS_USE_CURRENT_ULT); | ||
| rc = dss_thread_collective_reduce(&coll_ops, &coll_args, | ||
| DSS_ULT_DEEP_STACK | DSS_USE_CURRENT_ULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one line change is the whole purpose of this PR. Other changes looks not necessary (and not correct) to me.
BTW, due to the DSS_USE_CURRENT_ULT flag, the collective function executed on current xstream won't be able to run in deep stack. I think that's something needs be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one line change is the whole purpose of this PR. Other changes looks not necessary (and not correct) to me.
No, current ULT itself is also part of the collective operation, it also needs deep stack when it is created. That is the changes for others in this patch.
BTW, due to the DSS_USE_CURRENT_ULT flag, the collective function executed on current xstream won't be able to run in deep stack. I think that's something needs be fixed.
If current ULT can do the task by itself, why need to create new ULT on the same XS?
NiuYawei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid potential ULT stack overflow.
Allow-unstable-test: true
Steps for the author:
After all prior steps are complete: