Skip to content

Conversation

@aamijar
Copy link
Member

@aamijar aamijar commented Oct 21, 2025

Resolves #7071

This branch should be retargeted to 25.12

This PR introduces the SpectralClustering estimator which has python bindings that is similar to sklearn. The spectral clustering implementation from cuvs is called under the hood.

Here is a plot comparison of sklearn vs cuml.

spectral_clustering_comparison

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 21, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Oct 21, 2025
@aamijar aamijar added algo: spectral-embedding non-breaking Non-breaking change feature request New feature or request labels Oct 21, 2025
@aamijar aamijar marked this pull request as ready for review November 21, 2025 10:42
@aamijar aamijar requested review from a team as code owners November 21, 2025 10:42
@aamijar aamijar requested review from KyleFromNVIDIA, betatim and jinsolp and removed request for KyleFromNVIDIA, betatim and jinsolp November 21, 2025 10:42
Comment on lines 284 to 285
eigen_tol : float, default=0.0
Tolerance for the eigensolver. 0.0 uses the default solver tolerance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use default: "auto"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d8aeb06

Comment on lines 286 to 291
affinity : {'nearest_neighbors', 'precomputed'}, default='precomputed'
How to construct the affinity matrix.
- 'nearest_neighbors' : construct the affinity matrix by computing a
graph of nearest neighbors from the input data.
- 'precomputed' : interpret X as a precomputed affinity matrix,
where larger values indicate greater similarity between instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default for the affinity parameter should be nearest_neighbors to match the user expectation that typically, we do not have to perform additional computations to train an estimator and to more closely match the scikit-learn API.

That would also be in line with the estimator docs:

When calling ``fit``, an affinity matrix is constructed using a
k-nearest neighbors connectivity matrix.

Alternatively, a user-provided affinity matrix can be specified by
setting ``affinity='precomputed'``.

Copy link
Member Author

@aamijar aamijar Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Changing to default of nearest_neighbors solved the test_input_estimators CI fail. Addressed in 99f7268

Comment on lines 54 to 65
sk_spectral = skSpectralClustering(
n_clusters=n_clusters,
affinity="precomputed",
random_state=42,
)
y_sklearn = sk_spectral.fit_predict(knn_graph)

cuml_spectral = SpectralClustering(
n_clusters=n_clusters,
affinity="precomputed",
random_state=42,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we place shared parameters into a params dict. In this way it's a bit easier to see where there are differences between sklearn's and cuml's API.

Copy link
Member Author

@aamijar aamijar Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, also removes repeated code. Addressed in 78e74e6

Comment on lines 115 to 121
knn_graph = kneighbors_graph(
X_np,
n_neighbors=30,
mode="connectivity",
include_self=True,
)
knn_graph = 0.5 * (knn_graph + knn_graph.T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be much easier to run these tests with affinity=nearest_neighbors? Or why would the output_type handling be affected somehow? If that's the case, then we should run this for both affinity="nearest_neighbors" and affinity="precomputed".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 309e006

cdef int affinity_nnz = 0

if affinity == "nearest_neighbors":
from cuml.internals.input_utils import input_to_cupy_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a delayed import IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 05dce17

Comment on lines 134 to 137
>>> from sklearn.neighbors import kneighbors_graph
>>> from cuml.cluster import spectral_clustering
>>> X = cp.random.rand(100, 10, dtype=cp.float32)
>>> A = kneighbors_graph(cp.asnumpy(X), n_neighbors=10, include_self=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to have an example that does not require scikit-learn and that perhaps avoids the need for a pre-computed graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 86a3cb1

Comment on lines 164 to 169
if A.dtype != np.float32:
A = A.astype("float32")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to issue a UserWarning when we trigger the conversion. User can avoid it by converting prior to the call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 30db43a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to issue that warning before the if: else: block since we might be coercing on lines 171, and 173 as well.

@aamijar aamijar removed request for a team and gforsyth November 25, 2025 06:57
@aamijar
Copy link
Member Author

aamijar commented Nov 25, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

/ok to test

@aamijar, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@aamijar
Copy link
Member Author

aamijar commented Nov 25, 2025

/ok to test 78e89dd

@aamijar aamijar added the New Algorithm For tracking new algorithms that will be added to our existing collection label Nov 25, 2025
@csadorf
Copy link
Contributor

csadorf commented Nov 25, 2025

/ok to test 3b07729

Comment on lines 164 to 169
if A.dtype != np.float32:
A = A.astype("float32")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to issue that warning before the if: else: block since we might be coercing on lines 171, and 173 as well.

Comment on lines 124 to 125
labels : cupy.ndarray of shape (n_samples,)
Cluster labels for each sample.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually true? Or are we returning numpy arrays etc. if X is provided as a numpy array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 241e2ff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the return value of the spectral_clustering() function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3c9316a

Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aamijar cpp side looks good! Small comments on the python side

@aamijar
Copy link
Member Author

aamijar commented Nov 25, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

/ok to test

@aamijar, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@aamijar
Copy link
Member Author

aamijar commented Nov 25, 2025

/ok to test 9a2818e

@aamijar
Copy link
Member Author

aamijar commented Nov 25, 2025

/ok to test f74949d

1 similar comment
@aamijar
Copy link
Member Author

aamijar commented Nov 26, 2025

/ok to test f74949d

@csadorf
Copy link
Contributor

csadorf commented Nov 26, 2025

/ok to test 8f69c01

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aamijar 👍

@csadorf
Copy link
Contributor

csadorf commented Nov 26, 2025

/merge

@rapids-bot rapids-bot bot merged commit 63e26e7 into rapidsai:release/25.12 Nov 26, 2025
107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

algo: spectral-embedding CMake CUDA/C++ Cython / Python Cython or Python issue feature request New feature or request New Algorithm For tracking new algorithms that will be added to our existing collection non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Implement SpectralClustering estimator

5 participants