Skip to content

Conversation

@cjqzhao
Copy link
Contributor

@cjqzhao cjqzhao commented Aug 22, 2025

Add graceful switch load balancing policy.

@dfawley @easwars

@cjqzhao cjqzhao force-pushed the add-gracefulswitch branch 2 times, most recently from 551c605 to ecd0f18 Compare August 22, 2025 17:16
@dfawley dfawley changed the title feat(grpc): add gracefulswitch feat(grpc): add gracefulswitch load balancing policy Aug 22, 2025
@cjqzhao cjqzhao force-pushed the add-gracefulswitch branch from ecd0f18 to 28869a7 Compare August 24, 2025 05:35
rename mock picker and remove spaces

make test picker private
@cjqzhao cjqzhao force-pushed the add-gracefulswitch branch from 28869a7 to 962fb29 Compare August 24, 2025 05:52
Comment on lines +46 to +48
/**
Struct for Graceful Switch.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a good docstring for this.

Struct for Graceful Switch.
*/
pub struct GracefulSwitchPolicy {
subchannel_to_policy: HashMap<WeakSubchannel, ChildKind>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the ChildManager for this policy? I was really hoping we could.

The goal was to make it so that no policy would need to manage its own tracking of children.

Ideally graceful switch could do something with the incoming resolver update so that the update sharder would produce 1 or 2 children with the proper configs.

If this is too much to change at the last minute for you, then we can figure out what TODOs to add.

Copy link
Contributor Author

@cjqzhao cjqzhao Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it a lot easier to manage children directly when I looked into using ChildManager but I can look into redesigning. The challenge about this is that the ResolverUpdate from shard_update would not be used and there would have to be some other design. I think that changes will need to be made for the Child Manager due to the Child manager forwarding resolver updates to children even when unnecessary. I will draft an appropriate design and get approval from you and Easwar.

Comment on lines +67 to +70
match config.unwrap().convert_to::<Arc<GracefulSwitchLbConfig>>() {
Ok(cfg) => (*cfg).clone(),
Err(e) => panic!("convert_to failed: {e}"),
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the match on the Result instead of .unwrap()?

if update.service_config.as_ref().is_ok_and(|sc| sc.is_some()) {
return Err("can't do service configs yet".into());
}
let cfg: Arc<GracefulSwitchLbConfig> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to elide this type?

Comment on lines +63 to +65
if update.service_config.as_ref().is_ok_and(|sc| sc.is_some()) {
return Err("can't do service configs yet".into());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. Graceful switch shouldn't need to do anything special with service configs. It should just forward it to all children.

Comment on lines +74 to +76
// Determine if we can switch the new policy in. If there is no children
// yet or the new policy isn't the same as the latest policy, then
// we can swap.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Determine if we can switch the new policy in. If there is no children
// yet or the new policy isn't the same as the latest policy, then
// we can swap.
// Determine if we can switch to the new policy. If there is no child
// yet or the new policy isn't the same as the previous policy, then
// we can swap.

if needs_switch {
target_child_kind = self.switch_to(config);
}
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add a scope here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to locking. I wanted to lock the managing_policy just to call resolve_update and then release it for resolve_child_controller.

dfawley added a commit to dfawley/tonic that referenced this pull request Nov 6, 2025
General:
- Add Debug to many traits and derive/impl in structs.
- Pass LB config to LB policies via `Option<LbConfig>` instead of
  `Option<&LbConfig>`.  It should be rare that policies want to store
  a config except for the leaf policy.

Child manager:

The original assumption was that all children would be the same
type/configuration, but several policies (including gracefulswitch)
will not have that property.  So, several changes are made:

- Children are considered unique by both their identifier and their
  LbPolicyBuilder's name().
- Make it so the sharder also can shard LbConfig and provide it via
  the ChildUpdate.child_update field in addition to the
  ResolverUpdate.
- Make ResolverUpdateSharder a generic instead of Box<dyn>.
- Add booleans so users of child manager can easily easily tell
  whether any child policies updated themselves, and which ones did.
- Pass &mut self for sharder so that it can maintain and update its
  state if needed.
- Change the sharder's output ChildUpdate.child_update field to an
  Option; if None then the child will not be called during the
  resolver update, but will remain in the child manager.
- Change child_states into children and provide the whole Child
  struct, exposing the fields it contains.
- Provide mutable access to the sharder.
- Minor test cleanups

Graceful switch:

The previous implementation in hyperium#2399 contained a lot of logic to
manage child policy delegation.  It was intended that only
ChildManager should need to have this kind of logic.

- Create a new implementation of this policy that delegates to
  ChildManager.
- Uses a Sharder that simply emits the active policy with no update
  alongside any new policy in the new LbConfig.
- maybe_swap is called after every call into the ChildManager to
  determine if child updates necessitate a swap.
- This logic is simple: if the active policy is not Ready, or if there
  is a new policy and it is not Connecting, then set the new policy as
  the active policy and call resolver_update on the ChildManager.  The
  sharder will see that no LbConfig is provided and just emit the
  active policy with no config, causing the ChildManager to drop the
  previously active policy.  If no swap is needed, update the picker
  of the active policy if it had an update.
- Minor test cleanups/fixes vs. hyperium#2399.
dfawley added a commit to dfawley/tonic that referenced this pull request Nov 6, 2025
General:

- Add Debug to many traits and derive/impl in structs.
- Pass LB config to LB policies via `Option<LbConfig>` instead of
  `Option<&LbConfig>`.  It should be rare that policies want to store
  a config except for the leaf policy.

Child manager:

The original assumption was that all children would be the same
type/configuration, but several policies (including gracefulswitch)
will not have that property.  So, several changes are made:

- Children are considered unique by both their identifier and their
  LbPolicyBuilder's name().
- Make it so the sharder also can shard LbConfig and provide it via
  the ChildUpdate.child_update field in addition to the
  ResolverUpdate.
- Make ResolverUpdateSharder a generic instead of Box<dyn>.
- Add booleans so users of child manager can easily easily tell
  whether any child policies updated themselves, and which ones did.
- Pass &mut self for sharder so that it can maintain and update its
  state if needed.
- Change the sharder's output ChildUpdate.child_update field to an
  Option; if None then the child will not be called during the
  resolver update, but will remain in the child manager.
- Change child_states into children and provide the whole Child
  struct, exposing the fields it contains.
- Provide mutable access to the sharder.
- Minor test cleanups

Graceful switch:

The previous implementation in hyperium#2399 contained a lot of logic to
manage child policy delegation.  It was intended that only
ChildManager should need to have this kind of logic.

- Create a new implementation of this policy that delegates to
  ChildManager.
- Uses a Sharder that simply emits the active policy with no update
  alongside any new policy in the new LbConfig.
- maybe_swap is called after every call into the ChildManager to
  determine if child updates necessitate a swap.
- This logic is simple: if the active policy is not Ready, or if there
  is a new policy and it is not Connecting, then set the new policy as
  the active policy and call resolver_update on the ChildManager.  The
  sharder will see that no LbConfig is provided and just emit the
  active policy with no config, causing the ChildManager to drop the
  previously active policy.  If no swap is needed, update the picker
  of the active policy if it had an update.
- Minor test cleanups/fixes vs. hyperium#2399.
dfawley added a commit to dfawley/tonic that referenced this pull request Nov 6, 2025
General:

- Add Debug to many traits and derive/impl in structs.
- Pass LB config to LB policies via `Option<LbConfig>` instead of
  `Option<&LbConfig>`.  It should be rare that policies want to store
  a config except for the leaf policy.

Child manager:

The original assumption was that all children would be the same
type/configuration, but several policies (including gracefulswitch)
will not have that property.  So, several changes are made:

- Children are considered unique by both their identifier and their
  LbPolicyBuilder's name().
- Make it so the sharder also can shard LbConfig and provide it via
  the ChildUpdate.child_update field in addition to the
  ResolverUpdate.
- Make ResolverUpdateSharder a generic instead of Box<dyn>.
- Add booleans so users of child manager can easily easily tell
  whether any child policies updated themselves, and which ones did.
- Pass &mut self for sharder so that it can maintain and update its
  state if needed.
- Change the sharder's output ChildUpdate.child_update field to an
  Option; if None then the child will not be called during the
  resolver update, but will remain in the child manager.
- Change child_states into children and provide the whole Child
  struct, exposing the fields it contains.
- Provide mutable access to the sharder.
- Change the LB config to be a flat JSON array to facilitate use
  within another LB policy that should not need a struct to contain on
  the children.
- Minor test cleanups

Graceful switch:

The previous implementation in hyperium#2399 contained a lot of logic to
manage child policy delegation.  It was intended that only
ChildManager should need to have this kind of logic.

- Create a new implementation of this policy that delegates to
  ChildManager.
- Uses a Sharder that simply emits the active policy with no update
  alongside any new policy in the new LbConfig.
- maybe_swap is called after every call into the ChildManager to
  determine if child updates necessitate a swap.
- This logic is simple: if the active policy is not Ready, or if there
  is a new policy and it is not Connecting, then set the new policy as
  the active policy and call resolver_update on the ChildManager.  The
  sharder will see that no LbConfig is provided and just emit the
  active policy with no config, causing the ChildManager to drop the
  previously active policy.  If no swap is needed, update the picker
  of the active policy if it had an update.
- Minor test cleanups/fixes vs. hyperium#2399.
@dfawley dfawley closed this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants