Sealing fixes #7437

cjen1-msft · 2025-11-05T14:10:27Z

Possible fix for #7431

The two hypothesis are:

The secrets are copied before the rekey is complete, so the recovering node misses the right secret (Tx 28)
There is aliasing between the working directory of different test runs, so I can imagine writes of different tests interfering.

This PR addresses both of these options, reordering the stop and the copy, and making sure working directories don't overlap.

I don't have a repro of this so cannot say for certain if which (if either) was the issue, but have an ACI instance looking for repros currently which should shed some light on this.

Copilot

Pull Request Overview

This PR modifies the recovery local unsealing test to improve test identification and fix the order of operations. The key changes ensure that sealed ledger secrets are saved before stopping nodes, which is critical for the recovery test to function correctly.

Updated the test label to include rekey and recovery_shares_refresh parameters for better test identification
Fixed a bug where network.stop_all_nodes() was called after save_sealed_ledger_secret(), when it should be called before

tests/e2e_operations.py

achamayou · 2025-11-05T14:58:41Z

There is aliasing between the working directory of different test runs, so I can imagine writes of different tests interfering.

That should never happen, do we have evidence to the contrary?

cjen1-msft · 2025-11-05T18:42:35Z

There is aliasing between the working directory of different test runs, so I can imagine writes of different tests interfering.

That should never happen, do we have evidence to the contrary?

I think you're right, it should never happen, but its either that or filesystem corruption or...
After ~100 runs so far, I haven't managed to repro this (using main not this branch), that machine is currently burning the midnight oil so we'll see tomorrow.

From the logs we have the following facts:

At 19:58.57 1 completed a write of 1.sealed.json to disk via files::dump (but no fsync)
Between 20:02.735 and 20:02.736 the secrets were copied into the common directory
Between 20:02.720 and 20:02.737 1's secret at 28.sealed.json was written to disk
Then when 4 recovers using 1's secrets from the common dir, it sees 1.sealed.json not 28.sealed.json

This implies that something weird was happening with the disk.
Since the copy of the directory and the write of 28.sealed.json were concurrent, it is reasonable that that might not get copied.
Hence why this PR is definitely required to fix this problem specifically.

But the write of 1.sealed.json completed and my understanding was that we were guaranteed that a read of a written and closed file did not require an fsync? (it'll either hit the cached block in the kernel memory or it will already be written to disk)
So unless we had random corruption on the disk, we're left with speculating about filesystem level bugs etc, or aliasing between the files.

achamayou · 2025-11-05T19:18:14Z

But the write of 1.sealed.json completed and my understanding was that we were guaranteed that a read of a written and closed file did not require an fsync? (it'll either hit the cached block in the kernel memory or it will already be written to disk)
So unless we had random corruption on the disk, we're left with speculating about filesystem level bugs etc, or aliasing between the files.

@cjen1-msft I wonder if this is one of those situations where it is necessary to fsync() the directory, to make sure that the next reader of the directory does see 28 (see #7030).

eddyashton · 2025-11-06T09:48:00Z

tests/e2e_operations.py

        if recovery_shares_refresh:
            network.consortium.trigger_recovery_shares_refresh(primary)

+        network.stop_all_nodes()


I don't understand why moving the stop_all_nodes() call will help. From my reading of the other comments, it sounds like trigger_recovery_shares_refresh is starting some async process which eventually results in a new sealed secret file being written to disk, and races with a future step that's trying to read that file. Stopping the nodes will terminate any in-progress writes sure, but seems like it's still possible that the sealed secret doesn't reach disk? If that's required for the future save/recovery, I think we should be explicitly waiting-for+checking it here.

If this is the risk, then we could force a repro by changing the trigger_ (or some step downstream of it) to be even-more-async. Dispatch its work to a delayed task rather than executing immediately, and we'll be forced to wait here.

Ensure that network has stopped before copying secrets

f74b806

Copilot AI review requested due to automatic review settings November 5, 2025 14:10

cjen1-msft requested a review from a team as a code owner November 5, 2025 14:10

Copilot AI reviewed Nov 5, 2025

View reviewed changes

tests/e2e_operations.py Show resolved Hide resolved

eddyashton reviewed Nov 6, 2025

View reviewed changes

cjen1-msft marked this pull request as draft November 6, 2025 11:50

cjen1-msft closed this Nov 10, 2025

cjen1-msft mentioned this pull request Nov 10, 2025

Wait for sealed secrets #7450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sealing fixes #7437

Sealing fixes #7437

Uh oh!

cjen1-msft commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

achamayou commented Nov 5, 2025

Uh oh!

cjen1-msft commented Nov 5, 2025 •

edited

Loading

Uh oh!

achamayou commented Nov 5, 2025

Uh oh!

eddyashton Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sealing fixes #7437

Sealing fixes #7437

Uh oh!

Conversation

cjen1-msft commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

achamayou commented Nov 5, 2025

Uh oh!

cjen1-msft commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achamayou commented Nov 5, 2025

Uh oh!

eddyashton Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjen1-msft commented Nov 5, 2025 •

edited

Loading