-
Notifications
You must be signed in to change notification settings - Fork 334
DAOS-17843 rebuild: fix potential migrate ULT counter leak #17149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ticket title is 'Pool rebuild stuck in pulling for 5+ hours' |
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/301/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/317/log |
|
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/309/log |
d0c05e3 to
b09d6d8
Compare
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/302/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/318/log |
|
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/310/log |
|
Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/406/log |
b09d6d8 to
f1672ff
Compare
After add the counter, the ULT possibly has not been scheduled and then the rebuild be aborted that caused the migrate_pool_tls be destroyed by migrate_fini_one_ult, that will cause the migrate_obj_ult/migrate_one_ult cannot drop the ULT counter and further cause the rebuild cannot be treated as complete due to non-zero total_ult_cnt. This PR fix it by pass the ult counter pointer to migrate ULT so need not depend on migrate_pool_tls lookup to drop the counter. Signed-off-by: Xuezhao Liu <[email protected]>
f1672ff to
fa533b1
Compare
|
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect |
|
Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect |
|
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect |
|
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect |
kccain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to decrement the counts look good. I'm not totally following though where in the code a rebuild with nonzero ULT count(s) would be treated as not complete (also does that mean it would hang?)
This PR cannot fix the problem, is not going to land. |
After add the counter, the ULT possibly has not been scheduled and then the rebuild be aborted that caused the migrate_pool_tls be destroyed by migrate_fini_one_ult, that will cause the migrate_obj_ult/migrate_one_ult cannot drop the ULT counter and further cause the rebuild cannot be treated as complete due to non-zero total_ult_cnt.
This PR fix it by pass the ult counter pointer to migrate ULT so need not depend on migrate_pool_tls lookup to drop the counter.
Steps for the author:
After all prior steps are complete: