-
Notifications
You must be signed in to change notification settings - Fork 244
Sealing fixes #7437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sealing fixes #7437
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR modifies the recovery local unsealing test to improve test identification and fix the order of operations. The key changes ensure that sealed ledger secrets are saved before stopping nodes, which is critical for the recovery test to function correctly.
- Updated the test label to include
rekeyandrecovery_shares_refreshparameters for better test identification - Fixed a bug where
network.stop_all_nodes()was called aftersave_sealed_ledger_secret(), when it should be called before
That should never happen, do we have evidence to the contrary? |
I think you're right, it should never happen, but its either that or filesystem corruption or... From the logs we have the following facts:
This implies that something weird was happening with the disk. But the write of 1.sealed.json completed and my understanding was that we were guaranteed that a read of a written and closed file did not require an fsync? (it'll either hit the cached block in the kernel memory or it will already be written to disk) |
@cjen1-msft I wonder if this is one of those situations where it is necessary to |
| if recovery_shares_refresh: | ||
| network.consortium.trigger_recovery_shares_refresh(primary) | ||
|
|
||
| network.stop_all_nodes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why moving the stop_all_nodes() call will help. From my reading of the other comments, it sounds like trigger_recovery_shares_refresh is starting some async process which eventually results in a new sealed secret file being written to disk, and races with a future step that's trying to read that file. Stopping the nodes will terminate any in-progress writes sure, but seems like it's still possible that the sealed secret doesn't reach disk? If that's required for the future save/recovery, I think we should be explicitly waiting-for+checking it here.
If this is the risk, then we could force a repro by changing the trigger_ (or some step downstream of it) to be even-more-async. Dispatch its work to a delayed task rather than executing immediately, and we'll be forced to wait here.
Possible fix for #7431
The two hypothesis are:
This PR addresses both of these options, reordering the stop and the copy, and making sure working directories don't overlap.
I don't have a repro of this so cannot say for certain if which (if either) was the issue, but have an ACI instance looking for repros currently which should shed some light on this.