Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Nov 19, 2025

To avoid potential ULT stack overflow.

Allow-unstable-test: true

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'daos_engine segfaults when we perform concurrent IO operations on multiple pools on mdonssd testing.'
Status is 'In Progress'
Labels: 'aurora_post_at,md_on_ssd'
https://daosio.atlassian.net/browse/DAOS-18196

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18196_1 branch from 2aadeff to 536e91e Compare November 19, 2025 03:39
@Nasf-Fan Nasf-Fan changed the title DAOS-18196 object: collective object RPC handler uses deep stack DAOS-18196 object: large stack for collective object RPC ULT Nov 19, 2025
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18196_1 branch from 536e91e to 563beed Compare November 19, 2025 07:58
@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17147/5/display/redirect

To avoid potential ULT stack overflow.

Allow-unstable-test: true

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18196_1 branch from 563beed to 5d3541e Compare November 20, 2025 02:41
@Nasf-Fan Nasf-Fan marked this pull request as ready for review November 21, 2025 01:15
@Nasf-Fan Nasf-Fan requested review from a team as code owners November 21, 2025 01:15
Copy link
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious that did we confirm the PR help fix the issue?


rc = dss_thread_collective_reduce(&coll_ops, &coll_args, DSS_USE_CURRENT_ULT);
rc = dss_thread_collective_reduce(&coll_ops, &coll_args,
DSS_ULT_DEEP_STACK | DSS_USE_CURRENT_ULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one line change is the whole purpose of this PR. Other changes looks not necessary (and not correct) to me.

BTW, due to the DSS_USE_CURRENT_ULT flag, the collective function executed on current xstream won't be able to run in deep stack. I think that's something needs be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one line change is the whole purpose of this PR. Other changes looks not necessary (and not correct) to me.

No, current ULT itself is also part of the collective operation, it also needs deep stack when it is created. That is the changes for others in this patch.

BTW, due to the DSS_USE_CURRENT_ULT flag, the collective function executed on current xstream won't be able to run in deep stack. I think that's something needs be fixed.

If current ULT can do the task by itself, why need to create new ULT on the same XS?

@Nasf-Fan Nasf-Fan requested a review from NiuYawei November 24, 2025 02:03
Copy link
Contributor

@NiuYawei NiuYawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan I suppose this one is replaced by #17165 ?

@Nasf-Fan Nasf-Fan closed this Nov 27, 2025
@Nasf-Fan Nasf-Fan deleted the Nasf-Fan/DAOS-18196_1 branch November 27, 2025 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants