-
Notifications
You must be signed in to change notification settings - Fork 606
Fix PCA sign flip #7331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PCA sign flip #7331
Conversation
…element of each eigen-vector should be positive.
lowener
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, thanks for the PR.
Can you also modify the assertion in test_pca.py to see if we can re-establish with_sign=True in the comparison tests?
https://github.com/rapidsai/cuml/blob/branch-25.12/python/cuml/tests/test_pca.py#L187
Ok, working on it. |
…e `raft::linalg::reduce()`.
…change tolerance to 1e-3
…n flip works properly.
Done. Now the tests pass with |
|
/ok to test 711ac27 |
cpp/src/tsvd/tsvd.cuh
Outdated
| raft::handle_t handle{stream}; | ||
| raft::linalg::map_offset( | ||
| handle, | ||
| raft::make_device_matrix_view<math_t, std::size_t>(components, n_rows, n_cols), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the device_matrix_view based API above in raft::linalg::reduce as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the
device_matrix_viewbased API above inraft::linalg::reduceas well?
Resolved in #4992cef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the
device_matrix_viewbased API above inraft::linalg::reduceas well?
In #43660b8, I found a possible bug in the device_matrix_view overload of raft::linalg::reduce and filed an issue. For now, I added raft::linalg::reduce2 to compute the values with max abs per row, listed here:
This should fix the error in computing max_vals in #4992cef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the bug was resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@divyegala do you want this to be changed prior to merge?
|
@zhuxr11 hi, the check-style CI job is failing. Please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR corrects PCA's sign-flip algorithm to match scikit-learn's behavior: ensuring the maximum absolute value in each component row is positive, while leaving transformed data unchanged. Previously, cuML flipped signs on transformed outputs, causing discrepancies with sklearn. The fix introduces signFlipComponents in cpp/src/tsvd/tsvd.cuh, calls it after computing components in both single-GPU (pca.cuh) and multi-GPU (pca_mg.cu) code paths, and removes incorrect sign flipping from transformed data. Tests now enforce exact sign matching (with_sign=True) instead of ignoring signs. The multi-GPU version reuses the single-GPU implementation because components are not partitioned.
Potential Issues
Critical: Non-deterministic sign behavior in reduction lambda (cpp/src/tsvd/tsvd.cuh:155-159)
The reduction lambda returns the element with max absolute value by comparing abs_a > abs_b ? a : b. When multiple elements share the same max absolute value with opposite signs (e.g., [3.0, -3.0]), which element is selected depends on evaluation order. This creates non-deterministic signs across runs.
Critical: Inconsistent test expectation (python/cuml/tests/test_pca.py:81)
Line 81 explicitly sets with_sign=False for components_ comparisons: with_sign = False if attr in ["components_"] else True. This contradicts the PR's goal of matching sklearn's sign convention and will allow incorrect sign flipping to pass undetected. This appears to be leftover from the old implementation.
High: Pointer aliasing in map_offset lambda (cpp/src/tsvd/tsvd.cuh:167-170)
The lambda captures both the raw components pointer and a device_matrix_view wrapping the same data. Reading from max_vals[row] and writing to components[idx] creates potential aliasing issues, though CUDA's memory model may make this safe in practice.
Medium: Handle construction from stream only (cpp/src/tsvd/tsvd.cuh:163)
Constructing a new raft::handle_t from just a stream (raft::handle_t handle{stream}) may lose synchronization context or resource manager state from the original handle passed to parent functions.
Low: Matrix dimension semantics unclear (cpp/src/tsvd/tsvd.cuh:148-149)
The code passes n_rows, n_cols to reduce, but whether this matches the actual component matrix layout (column-major vs row-major) should be verified against the rest of the PCA/TSVD implementation.
Confidence: 3/5
The core concept is sound, but the reduction lambda's non-determinism and the test inconsistency at line 81 could cause intermittent failures or mask bugs. Verify the reduction logic handles ties correctly and update the test to enforce with_sign=True for components.
Additional Comments (1)
-
python/cuml/tests/test_pca.py, line 81 (link)logic: test assertion was changed from
with_sign=Falsetowith_sign=Truebut line 81 still sets it toFalse- this contradicts the PR's intent to enable sign checking
4 files reviewed, 3 comments
The behavior in cuML standard is no longer dependent on the installed scikit-learn version. For cuml.accel we adjust behavior based on the emulated scikit-learn version.
|
/ok to test 40e4746 |
|
/ok to test c444788 |
| # Exposed to support sklearn's `get_feature_names_out` | ||
| return self.components_.shape[0] | ||
|
|
||
| def _flip_sign_u_based(self, components, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is implemented with cpp code and you may not need to re-implement them in python, except for sparse matrix where SVD is computed via cupy. Here, you can control whether to flip by U or V (components) via setting flip_signs_based_on_U in pcaFit().
|
/ok to test afb6c2c |
|
/ok to test ed7a860 |
csadorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to go IMO, assuming that tests pass.
This reverts commit 5ca1361.
|
/ok to test 1916077 |
|
/ok to test 23d3801 |
|
/merge |
|
@zhuxr11 Thanks a lot for pulling through on this one! This has been a long-standing issue. Very much appreciated!! |
Thanks for all who have helped me along the way. Cheers! |
The sign normalization behavior in `PCA`/`TruncatedSVD` now matches that of sklearn (#7331), we no longer need this callout in the `cuml.accel` limitations docs page. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #7492
- Don't branch the sign flipping behavior based on the version of sklearn installed. This somehow slipped through in #7331. We always want `cuml` behavior to be the same regardless of sklearn version - the only thing we branch on is the testing where we don't assert sign matches for sklearn < 1.5 (this matches the single-gpu testing strategy as well). - Adds a sync point in multi-gpu PCA before calling `signFlipComponents`. The multi-gpu implementation makes use of multiple streams, but before only the first stream was passed to `signFlipComponents` (without any sync beforehand) leading to potential stream ordering issues. It's hard to prove a negative, but with this change I can no longer reproduce an issue reported in `rapids_singlecell`. A similar fix isn't needed for `TruncatedSVD` since that implementation only uses one stream. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #7560
Close #4560
This PR implemented PCA sign flip algorithm such that the max absolute value in each row is always positive in
components, leavingtrans_inputunchanged, just likesklearn.decomposition.PCA.In
_mgversion of PCA,componentsis not chunked. Thus, the_mgversion can reuse sign flipping from single-GPU version (by settingstream = streams[0]). In this PR,cuml\decomposition\sign_flip_mg.hppandcpp\src\pca\sign_flip_mg.cuare not in use (but the files are not removed).