Skip to content

Conversation

@cjen1-msft
Copy link
Contributor

Possible fix for #7431

The two hypothesis are:

  • The secrets are copied before the rekey is complete, so the recovering node misses the right secret (Tx 28)
  • There is aliasing between the working directory of different test runs, so I can imagine writes of different tests interfering.

This PR addresses both of these options, reordering the stop and the copy, and making sure working directories don't overlap.

I don't have a repro of this so cannot say for certain if which (if either) was the issue, but have an ACI instance looking for repros currently which should shed some light on this.

Copilot AI review requested due to automatic review settings November 5, 2025 14:10
@cjen1-msft cjen1-msft requested a review from a team as a code owner November 5, 2025 14:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modifies the recovery local unsealing test to improve test identification and fix the order of operations. The key changes ensure that sealed ledger secrets are saved before stopping nodes, which is critical for the recovery test to function correctly.

  • Updated the test label to include rekey and recovery_shares_refresh parameters for better test identification
  • Fixed a bug where network.stop_all_nodes() was called after save_sealed_ledger_secret(), when it should be called before

@achamayou
Copy link
Member

There is aliasing between the working directory of different test runs, so I can imagine writes of different tests interfering.

That should never happen, do we have evidence to the contrary?

@cjen1-msft
Copy link
Contributor Author

cjen1-msft commented Nov 5, 2025

There is aliasing between the working directory of different test runs, so I can imagine writes of different tests interfering.

That should never happen, do we have evidence to the contrary?

I think you're right, it should never happen, but its either that or filesystem corruption or...
After ~100 runs so far, I haven't managed to repro this (using main not this branch), that machine is currently burning the midnight oil so we'll see tomorrow.

From the logs we have the following facts:

  • At 19:58.57 1 completed a write of 1.sealed.json to disk via files::dump (but no fsync)
  • Between 20:02.735 and 20:02.736 the secrets were copied into the common directory
  • Between 20:02.720 and 20:02.737 1's secret at 28.sealed.json was written to disk
  • Then when 4 recovers using 1's secrets from the common dir, it sees 1.sealed.json not 28.sealed.json

This implies that something weird was happening with the disk.
Since the copy of the directory and the write of 28.sealed.json were concurrent, it is reasonable that that might not get copied.
Hence why this PR is definitely required to fix this problem specifically.

But the write of 1.sealed.json completed and my understanding was that we were guaranteed that a read of a written and closed file did not require an fsync? (it'll either hit the cached block in the kernel memory or it will already be written to disk)
So unless we had random corruption on the disk, we're left with speculating about filesystem level bugs etc, or aliasing between the files.

@achamayou
Copy link
Member

But the write of 1.sealed.json completed and my understanding was that we were guaranteed that a read of a written and closed file did not require an fsync? (it'll either hit the cached block in the kernel memory or it will already be written to disk)
So unless we had random corruption on the disk, we're left with speculating about filesystem level bugs etc, or aliasing between the files.

@cjen1-msft I wonder if this is one of those situations where it is necessary to fsync() the directory, to make sure that the next reader of the directory does see 28 (see #7030).

if recovery_shares_refresh:
network.consortium.trigger_recovery_shares_refresh(primary)

network.stop_all_nodes()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why moving the stop_all_nodes() call will help. From my reading of the other comments, it sounds like trigger_recovery_shares_refresh is starting some async process which eventually results in a new sealed secret file being written to disk, and races with a future step that's trying to read that file. Stopping the nodes will terminate any in-progress writes sure, but seems like it's still possible that the sealed secret doesn't reach disk? If that's required for the future save/recovery, I think we should be explicitly waiting-for+checking it here.

If this is the risk, then we could force a repro by changing the trigger_ (or some step downstream of it) to be even-more-async. Dispatch its work to a delayed task rather than executing immediately, and we'll be forced to wait here.

@cjen1-msft cjen1-msft marked this pull request as draft November 6, 2025 11:50
@cjen1-msft cjen1-msft closed this Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants