Skip to content

Conversation

@zhanggbj
Copy link
Contributor

@zhanggbj zhanggbj commented Oct 28, 2025

What this PR does / why we need it:
Support Node Auto Placement and Node AF/AAF

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

zhanggbj and others added 11 commits October 22, 2025 14:19
- Bump VMOP including Node AF/AAF support
- Add NodeAutoPlacement Feature Gate

(cherry picked from commit 700c8ae)
Removes the extra cases for VMG creation, such that VMG is created for:
1. Multiple zones, multiple MDs with no failureDomain
2. Multiple zones, multiple MDs with failureDomain
3. Single zone, existing cluster with no failureDomain MDs

Signed-off-by: Sagar Muchhal <[email protected]>
- Updates VMOP API dependency

Misc VMG fixes
- Use namingStrategy to calculate VM names
- Use MachineDeployment names for VMG placement label
- Includes all machinedeployments to generate node-pool -> zone
  mapping

Fixes VMG webhook validation error
- Adds cluster-name label to Af/AAF spec
- re-adds zone topology key back to anti-aff spec

Signed-off-by: Sagar Muchhal <[email protected]>
…gs#71)

* Refine VMG controller when generate per-MD zone labels

- Skip legacy already-placed VM which do not have placement info
- Skip VM which do not have zone info

* Apply suggestions from code review

---------

Co-authored-by: Sagar Muchhal <[email protected]>
- Sync VSphereMachines during day-2 operations in VMG controller
- Only wait for all intended VSphereMachines during initial Cluster creation
- Use annotations in VMG for per-md-zone info

Signed-off-by: Gong Zhang <[email protected]>
- Add VMG recociler unit test
- Bump VMOP due to API change
- Filter out VSphereMachine event except create/delete events

Signed-off-by: Gong Zhang <[email protected]>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 28, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 28, 2025
@zhanggbj zhanggbj changed the title [WIP][DoNotReview]Support Node Auto Placement and Node AF/AAF [WIP][DoNotReview]✨ Support Node Auto Placement and Node AF/AAF Oct 28, 2025
@zhanggbj zhanggbj force-pushed the node_auto_placement branch 3 times, most recently from 1160ce9 to 1cd61f9 Compare October 28, 2025 10:31
@zhanggbj
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@zhanggbj: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
/test pull-cluster-api-provider-vsphere-test-main
/test pull-cluster-api-provider-vsphere-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-vsphere-apidiff-main
/test pull-cluster-api-provider-vsphere-janitor-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-vsphere-apidiff-main
pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
pull-cluster-api-provider-vsphere-test-main
pull-cluster-api-provider-vsphere-verify-main
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
/test pull-cluster-api-provider-vsphere-test-main
/test pull-cluster-api-provider-vsphere-verify-main
/test pull-cluster-api-provider-vsphere-apidiff-main
/test pull-cluster-api-provider-vsphere-janitor-main

@zhanggbj zhanggbj force-pushed the node_auto_placement branch 7 times, most recently from 22604bf to 7c7c1a7 Compare November 3, 2025 08:55
@zhanggbj
Copy link
Contributor Author

Hey @fabriziopandini @sbueringer ,
All comments are addressed, including the ones missed by mistake last time. Now CI are green. Please help to take another look, really appreciate!

For the concern about race conditions, I have implemented checks to mitigate race conditions during initial and post-placement, and also have unit teststo cover the scenarios.

@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 3ed2e93 to 9b1754a Compare November 25, 2025 06:44
@zhanggbj
Copy link
Contributor Author

Pushed a small change and checking UT failures.

@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 9b1754a to 2ac47d4 Compare November 25, 2025 09:25
@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main

@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main

Since Encryption Class requires API in this newer version, bump vm-operator package

Signed-off-by: Gong Zhang <[email protected]>
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from fd72f09 to c41abeb Compare November 26, 2025 07:49
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from b8c569e to 08b8304 Compare November 27, 2025 09:23
continue
}
if member.Placement != nil && member.Placement.Zone != "" {
log.Info(fmt.Sprintf("MachineDeployment %s has been placed to failure domanin %s", md.Name, member.Placement.Zone), "MachineDeployment", klog.KObj(&md))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
log.Info(fmt.Sprintf("MachineDeployment %s has been placed to failure domanin %s", md.Name, member.Placement.Zone), "MachineDeployment", klog.KObj(&md))
log.Info(fmt.Sprintf("MachineDeployment %s has been placed to failure domain %s", md.Name, member.Placement.Zone), "MachineDeployment", klog.KObj(&md))

want bool
}{
{
name: "Should create a VMG if all the expected VSphereMachines exists",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a unit test ensuring that a deleting vSphereMachine blocks.
I would suggest to use this test case as a setup and add an additional deleting machine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini
Trying to understand this scenario, does it mean a deleting vSphereMachine won't be added to the VMGroup member list?

If yes, this case should cover it name: "Should create a VMG if all the expected VSphereMachines exists, deleting MD should be ignored"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a deleting machine should not be ignored, it should block creation of a VMG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot recall we discussed deleting machine to block the creation of a VMG.

Is this the scaling down case? IIRC, we agreed that scaling down MD is allowed during any phase (before VMS is created or VMG is created by still during placement)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is in the current implementation, and it makes sense to me because it makes sure that we start a placement decision only when the list of vSphereMachines is stable.

Let's please add a unit test and do not change the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini Oh I got your points, we won't create VMG if expected VSphereMachine haven't existed yet (VM in deleting). I misread it.

Added case:
name: "Should not create a VMG if the expected VSphereMachines do not exist yet"

clusterv1.ClusterNameLabel: cluster.Name,
},
Annotations: map[string]string{
fmt.Sprintf("%s/%s", vmoperator.ZoneAnnotationPrefix, "md3"): "zone1", // failureDomain for md3 is explicitly set by the user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an annotation without the ZoneAnnotationPrefix and check it is preserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Namespace: cluster.Namespace,
Name: cluster.Name,
UID: types.UID("uid"),
Labels: map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an label different than the ClusterNameLabel and check it is preserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Reason: infrav1.VSphereMachineVirtualMachineWaitingForVirtualMachineGroupV1Beta2Reason,
Message: fmt.Sprintf("Waiting for VSphereMachine's VirtualMachineGroup %s to exist", key),
})
log.V(4).Info(fmt.Sprintf("Waiting for VirtualMachineGroup %s, requeueing", key.Name), "VirtualMachineGroup", klog.KRef(key.Namespace, key.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.V(4).Info(fmt.Sprintf("Waiting for VirtualMachineGroup %s, requeueing", key.Name), "VirtualMachineGroup", klog.KRef(key.Namespace, key.Name))
log.V(4).Info(fmt.Sprintf("Waiting for VirtualMachineGroup %s to exist, requeueing", key.Name), "VirtualMachineGroup", klog.KRef(key.Namespace, key.Name))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}

// Set the zone label using the annotation of the per-md zone mapping from VirtualMachineGroup.
// This is for new VMs created during day-2 operations when Node Auto Placement is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is for new VMs created during day-2 operations when Node Auto Placement is enabled.
// This is for new VMs created after initial placement decision/with a failureDomain defined by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

client.MatchingLabels{clusterv1.ClusterNameLabel: supervisorMachineCtx.Cluster.Name}); err != nil {
return false, err
}
othermMDNames := []string{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
othermMDNames := []string{}
otherMDNames := []string{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye:)

})

// control plane machine is the machine with the control plane label set
Specify("Reconcile valid control plane Machine", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a check that the CP machine is placed in the expected zone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional check for CP machine, that CP machine has the vmopVM.Labels[corev1.LabelTopologyZone] which is the contract btw CAPV&VMOP.

vmGroup *vmoprv1.VirtualMachineGroup
)

BeforeEach(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are missing a test for VMG not created yet

Copy link
Contributor Author

@zhanggbj zhanggbj Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new case to test
"requeue the Machine waiting for VirtualMachineGroup creation"


machineNotMember.Spec.Bootstrap.DataSecretName = &secretName

By("VirtualMachine is not created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check the VM actually does not exist, not only result

Copy link
Contributor Author

@zhanggbj zhanggbj Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main

@zhanggbj
Copy link
Contributor Author

@fabriziopandini @sbueringer
Except comment here is still open #3655 (comment),

I've addressed other comments, could yout please take a look at others firstly? Thanks!

@fabriziopandini
Copy link
Member

I have answered to the comment, please fix it first

@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 68f5b16 to ea10895 Compare December 16, 2025 07:26
- Add missed test cases
- miscellaneous

Signed-off-by: Gong Zhang <[email protected]>
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from ea10895 to 6d76fbd Compare December 16, 2025 08:08
@zhanggbj
Copy link
Contributor Author

zhanggbj commented Dec 17, 2025

pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main failed due to infra failure and I'm not able to re-trigger again. Need to retry later.

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Dec 17, 2025
@sbueringer
Copy link
Member

@zhanggbj You have to use / test

/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants