Skip to content

Conversation

@liuxuezhao
Copy link
Contributor

After add the counter, the ULT possibly has not been scheduled and then the rebuild be aborted that caused the migrate_pool_tls be destroyed by migrate_fini_one_ult, that will cause the migrate_obj_ult/migrate_one_ult cannot drop the ULT counter and further cause the rebuild cannot be treated as complete due to non-zero total_ult_cnt.
This PR fix it by pass the ult counter pointer to migrate ULT so need not depend on migrate_pool_tls lookup to drop the counter.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@liuxuezhao liuxuezhao requested review from a team as code owners November 19, 2025 04:13
@github-actions
Copy link

github-actions bot commented Nov 19, 2025

Ticket title is 'Pool rebuild stuck in pulling for 5+ hours'
Status is 'In Progress'
Labels: 'ALCF,hpe_cluster'
https://daosio.atlassian.net/browse/DAOS-17843

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/301/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/302/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

After add the counter, the ULT possibly has not been scheduled
and then the rebuild be aborted that caused the migrate_pool_tls
be destroyed by migrate_fini_one_ult, that will cause the
migrate_obj_ult/migrate_one_ult cannot drop the ULT counter and
further cause the rebuild cannot be treated as complete due to
non-zero total_ult_cnt.
This PR fix it by pass the ult counter pointer to migrate ULT
so need not depend on migrate_pool_tls lookup to drop the counter.

Signed-off-by: Xuezhao Liu <[email protected]>
@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

@liuxuezhao liuxuezhao requested review from NiuYawei, kccain and wangshilong and removed request for NiuYawei, kccain and wangshilong November 19, 2025 04:58
@liuxuezhao liuxuezhao removed request for a team November 19, 2025 09:27
Copy link
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to decrement the counts look good. I'm not totally following though where in the code a rebuild with nonzero ULT count(s) would be treated as not complete (also does that mean it would hang?)

@liuxuezhao
Copy link
Contributor Author

The changes to decrement the counts look good. I'm not totally following though where in the code a rebuild with nonzero ULT count(s) would be treated as not complete (also does that mean it would hang?)

This PR cannot fix the problem, is not going to land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants