-
Notifications
You must be signed in to change notification settings - Fork 606
SpectralClustering estimator #7372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpectralClustering estimator #7372
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| eigen_tol : float, default=0.0 | ||
| Tolerance for the eigensolver. 0.0 uses the default solver tolerance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use default: "auto"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in d8aeb06
| affinity : {'nearest_neighbors', 'precomputed'}, default='precomputed' | ||
| How to construct the affinity matrix. | ||
| - 'nearest_neighbors' : construct the affinity matrix by computing a | ||
| graph of nearest neighbors from the input data. | ||
| - 'precomputed' : interpret X as a precomputed affinity matrix, | ||
| where larger values indicate greater similarity between instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default for the affinity parameter should be nearest_neighbors to match the user expectation that typically, we do not have to perform additional computations to train an estimator and to more closely match the scikit-learn API.
That would also be in line with the estimator docs:
When calling ``fit``, an affinity matrix is constructed using a k-nearest neighbors connectivity matrix. Alternatively, a user-provided affinity matrix can be specified by setting ``affinity='precomputed'``.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Changing to default of nearest_neighbors solved the test_input_estimators CI fail. Addressed in 99f7268
| sk_spectral = skSpectralClustering( | ||
| n_clusters=n_clusters, | ||
| affinity="precomputed", | ||
| random_state=42, | ||
| ) | ||
| y_sklearn = sk_spectral.fit_predict(knn_graph) | ||
|
|
||
| cuml_spectral = SpectralClustering( | ||
| n_clusters=n_clusters, | ||
| affinity="precomputed", | ||
| random_state=42, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we place shared parameters into a params dict. In this way it's a bit easier to see where there are differences between sklearn's and cuml's API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, also removes repeated code. Addressed in 78e74e6
| knn_graph = kneighbors_graph( | ||
| X_np, | ||
| n_neighbors=30, | ||
| mode="connectivity", | ||
| include_self=True, | ||
| ) | ||
| knn_graph = 0.5 * (knn_graph + knn_graph.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be much easier to run these tests with affinity=nearest_neighbors? Or why would the output_type handling be affected somehow? If that's the case, then we should run this for both affinity="nearest_neighbors" and affinity="precomputed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 309e006
| cdef int affinity_nnz = 0 | ||
|
|
||
| if affinity == "nearest_neighbors": | ||
| from cuml.internals.input_utils import input_to_cupy_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a delayed import IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 05dce17
| >>> from sklearn.neighbors import kneighbors_graph | ||
| >>> from cuml.cluster import spectral_clustering | ||
| >>> X = cp.random.rand(100, 10, dtype=cp.float32) | ||
| >>> A = kneighbors_graph(cp.asnumpy(X), n_neighbors=10, include_self=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to have an example that does not require scikit-learn and that perhaps avoids the need for a pre-computed graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 86a3cb1
| if A.dtype != np.float32: | ||
| A = A.astype("float32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to issue a UserWarning when we trigger the conversion. User can avoid it by converting prior to the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 30db43a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to issue that warning before the if: else: block since we might be coercing on lines 171, and 173 as well.
This PR updates the repository to version 26.02. This is part of the 25.12 release burndown process.
|
/ok to test |
@aamijar, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 78e89dd |
|
/ok to test 3b07729 |
| if A.dtype != np.float32: | ||
| A = A.astype("float32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to issue that warning before the if: else: block since we might be coercing on lines 171, and 173 as well.
| labels : cupy.ndarray of shape (n_samples,) | ||
| Cluster labels for each sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually true? Or are we returning numpy arrays etc. if X is provided as a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 241e2ff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the return value of the spectral_clustering() function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 3c9316a
jinsolp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aamijar cpp side looks good! Small comments on the python side
Co-authored-by: Simon Adorf <[email protected]>
Co-authored-by: Simon Adorf <[email protected]>
|
/ok to test |
@aamijar, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 9a2818e |
|
/ok to test f74949d |
1 similar comment
|
/ok to test f74949d |
|
/ok to test 8f69c01 |
csadorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work!
jinsolp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aamijar 👍
|
/merge |
Resolves #7071
This branch should be retargeted to 25.12
This PR introduces the SpectralClustering estimator which has python bindings that is similar to sklearn. The spectral clustering implementation from cuvs is called under the hood.
Here is a plot comparison of sklearn vs cuml.