Fix PCA sign flip #7331

zhuxr11 · 2025-10-12T05:57:29Z

This PR implemented PCA sign flip algorithm such that the max absolute value in each row is always positive in components, leaving trans_input unchanged, just like sklearn.decomposition.PCA.

In _mg version of PCA, components is not chunked. Thus, the _mg version can reuse sign flipping from single-GPU version (by setting stream = streams[0]). In this PR, cuml\decomposition\sign_flip_mg.hpp and cpp\src\pca\sign_flip_mg.cu are not in use (but the files are not removed).

…element of each eigen-vector should be positive.

…n_flip

copy-pr-bot · 2025-10-12T05:57:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…n_flip

lowener

This looks pretty good, thanks for the PR.
Can you also modify the assertion in test_pca.py to see if we can re-establish with_sign=True in the comparison tests?
https://github.com/rapidsai/cuml/blob/branch-25.12/python/cuml/tests/test_pca.py#L187

cpp/src/tsvd/tsvd.cuh

python/cuml/tests/test_pca_sign_flip.py

zhuxr11 · 2025-10-22T01:08:25Z

This looks pretty good, thanks for the PR. Can you also modify the assertion in test_pca.py to see if we can re-establish with_sign=True in the comparison tests? https://github.com/rapidsai/cuml/blob/branch-25.12/python/cuml/tests/test_pca.py#L187

Ok, working on it.

…e `raft::linalg::reduce()`.

…change tolerance to 1e-3

…ip`.

…n flip works properly.

zhuxr11 · 2025-10-22T16:42:23Z

This looks pretty good, thanks for the PR. Can you also modify the assertion in test_pca.py to see if we can re-establish with_sign=True in the comparison tests? https://github.com/rapidsai/cuml/blob/branch-25.12/python/cuml/tests/test_pca.py#L187

Done. Now the tests pass with with_sign=True on my computer.

divyegala · 2025-10-22T16:52:14Z

/ok to test 711ac27

divyegala · 2025-10-22T17:06:28Z

cpp/src/tsvd/tsvd.cuh

+  raft::handle_t handle{stream};
+  raft::linalg::map_offset(
+    handle,
+    raft::make_device_matrix_view<math_t, std::size_t>(components, n_rows, n_cols),


Could you please use the device_matrix_view based API above in raft::linalg::reduce as well?

Could you please use the device_matrix_view based API above in raft::linalg::reduce as well?

Resolved in #4992cef

Could you please use the device_matrix_view based API above in raft::linalg::reduce as well?

In #43660b8, I found a possible bug in the device_matrix_view overload of raft::linalg::reduce and filed an issue. For now, I added raft::linalg::reduce2 to compute the values with max abs per row, listed here:

https://github.com/zhuxr11/cuml/blob/43660b8e6b45b32acb566f087d50bf929bf9bc33/cpp/src/tsvd/tsvd.cuh#L46-L92

This should fix the error in computing max_vals in #4992cef.

Looks like the bug was resolved?

@divyegala do you want this to be changed prior to merge?

divyegala · 2025-10-23T02:07:27Z

@zhuxr11 hi, the check-style CI job is failing. Please run pre-commit to automatically fix.

greptile-apps

Greptile Overview

Greptile Summary

This PR corrects PCA's sign-flip algorithm to match scikit-learn's behavior: ensuring the maximum absolute value in each component row is positive, while leaving transformed data unchanged. Previously, cuML flipped signs on transformed outputs, causing discrepancies with sklearn. The fix introduces signFlipComponents in cpp/src/tsvd/tsvd.cuh, calls it after computing components in both single-GPU (pca.cuh) and multi-GPU (pca_mg.cu) code paths, and removes incorrect sign flipping from transformed data. Tests now enforce exact sign matching (with_sign=True) instead of ignoring signs. The multi-GPU version reuses the single-GPU implementation because components are not partitioned.

Potential Issues

Critical: Non-deterministic sign behavior in reduction lambda (cpp/src/tsvd/tsvd.cuh:155-159)
The reduction lambda returns the element with max absolute value by comparing abs_a > abs_b ? a : b. When multiple elements share the same max absolute value with opposite signs (e.g., [3.0, -3.0]), which element is selected depends on evaluation order. This creates non-deterministic signs across runs.

Critical: Inconsistent test expectation (python/cuml/tests/test_pca.py:81)
Line 81 explicitly sets with_sign=False for components_ comparisons: with_sign = False if attr in ["components_"] else True. This contradicts the PR's goal of matching sklearn's sign convention and will allow incorrect sign flipping to pass undetected. This appears to be leftover from the old implementation.

High: Pointer aliasing in map_offset lambda (cpp/src/tsvd/tsvd.cuh:167-170)
The lambda captures both the raw components pointer and a device_matrix_view wrapping the same data. Reading from max_vals[row] and writing to components[idx] creates potential aliasing issues, though CUDA's memory model may make this safe in practice.

Medium: Handle construction from stream only (cpp/src/tsvd/tsvd.cuh:163)
Constructing a new raft::handle_t from just a stream (raft::handle_t handle{stream}) may lose synchronization context or resource manager state from the original handle passed to parent functions.

Low: Matrix dimension semantics unclear (cpp/src/tsvd/tsvd.cuh:148-149)
The code passes n_rows, n_cols to reduce, but whether this matches the actual component matrix layout (column-major vs row-major) should be verified against the rest of the PCA/TSVD implementation.

Confidence: 3/5

The core concept is sound, but the reduction lambda's non-determinism and the test inconsistency at line 81 could cause intermittent failures or mask bugs. Verify the reduction logic handles ties correctly and update the test to enforce with_sign=True for components.

Additional Comments (1)

python/cuml/tests/test_pca.py, line 81 (link)

logic: test assertion was changed from with_sign=False to with_sign=True but line 81 still sets it to False - this contradicts the PR's intent to enable sign checking

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

cpp/src/tsvd/tsvd.cuh

cpp/include/cuml/decomposition/pca.hpp

cpp/include/cuml/decomposition/sign_flip_mg.hpp

python/cuml/cuml/decomposition/pca.pyx

The behavior in cuML standard is no longer dependent on the installed scikit-learn version. For cuml.accel we adjust behavior based on the emulated scikit-learn version.

csadorf · 2025-11-07T18:42:09Z

/ok to test 40e4746

jcrist · 2025-11-07T21:10:53Z

/ok to test c444788

zhuxr11 · 2025-11-08T11:22:41Z

python/cuml/cuml/decomposition/pca.pyx

        # Exposed to support sklearn's `get_feature_names_out`
        return self.components_.shape[0]

+    def _flip_sign_u_based(self, components, X):


This logic is implemented with cpp code and you may not need to re-implement them in python, except for sparse matrix where SVD is computed via cupy. Here, you can control whether to flip by U or V (components) via setting flip_signs_based_on_U in pcaFit().

csadorf · 2025-11-13T22:27:24Z

/ok to test afb6c2c

csadorf · 2025-11-14T15:10:09Z

/ok to test ed7a860

csadorf

This is good to go IMO, assuming that tests pass.

This reverts commit 5ca1361.

csadorf · 2025-11-14T17:58:13Z

/ok to test 1916077

csadorf · 2025-11-14T18:50:19Z

/ok to test 23d3801

csadorf · 2025-11-14T19:53:59Z

/merge

csadorf · 2025-11-14T20:08:11Z

@zhuxr11 Thanks a lot for pulling through on this one! This has been a long-standing issue. Very much appreciated!!

zhuxr11 · 2025-11-14T23:36:13Z

@zhuxr11 Thanks a lot for pulling through on this one! This has been a long-standing issue. Very much appreciated!!

Thanks for all who have helped me along the way. Cheers!

The sign normalization behavior in `PCA`/`TruncatedSVD` now matches that of sklearn (#7331), we no longer need this callout in the `cuml.accel` limitations docs page. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #7492

- Don't branch the sign flipping behavior based on the version of sklearn installed. This somehow slipped through in #7331. We always want `cuml` behavior to be the same regardless of sklearn version - the only thing we branch on is the testing where we don't assert sign matches for sklearn < 1.5 (this matches the single-gpu testing strategy as well). - Adds a sync point in multi-gpu PCA before calling `signFlipComponents`. The multi-gpu implementation makes use of multiple streams, but before only the first stream was passed to `signFlipComponents` (without any sync beforehand) leading to potential stream ordering issues. It's hard to prove a negative, but with this change I can no longer reproduce an issue reported in `rapids_singlecell`. A similar fix isn't needed for `TruncatedSVD` since that implementation only uses one stream. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #7560

朱修锐 and others added 3 commits October 12, 2025 12:19

Change sign flip logic of PCA to that of sklearn: the max absolute …

31332bc

…element of each eigen-vector should be positive.

Apply the change in sign flip logic to pca_mg.

934875d

Merge remote-tracking branch 'upstream/branch-25.12' into fix-pca_sig…

1ca05b6

…n_flip

zhuxr11 requested review from a team as code owners October 12, 2025 05:57

zhuxr11 requested a review from dantegd October 12, 2025 05:57

github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Oct 12, 2025

zhuxr11 mentioned this pull request Oct 12, 2025

[FEA] add sklearn's "out-of-bag" and "feature importance" scores to cuML's Random Forest #3361

Closed

csadorf assigned zhuxr11 Oct 13, 2025

divyegala self-requested a review October 17, 2025 20:39

Merge remote-tracking branch 'upstream/branch-25.12' into fix-pca_sig…

03912f8

…n_flip

lowener requested changes Oct 20, 2025

View reviewed changes

cpp/src/tsvd/tsvd.cuh Outdated Show resolved Hide resolved

python/cuml/tests/test_pca_sign_flip.py Outdated Show resolved Hide resolved

python/cuml/tests/test_pca_sign_flip.py Outdated Show resolved Hide resolved

python/cuml/tests/test_pca_sign_flip.py Outdated Show resolved Hide resolved

csadorf added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change bug Something isn't working and removed improvement Improvement / enhancement to an existing function labels Oct 20, 2025

朱修锐 added 5 commits October 23, 2025 00:18

Use raft::linalg::map_offset() instead of thrust::for_each().

f831322

Remove detail::AbsMaxOp and replace it with a lambda function insid…

7693e6f

…e `raft::linalg::reduce()`.

Remove prints and comments from cuml.tests.test_pca_sign_flip, and …

2b599a6

…change tolerance to 1e-3

Use test_pca.py for the tests and remove `cuml.tests.test_pca_sign_fl…

ff75dbc

…ip`.

Change with_sign=True in cuml.tests.test_pca to check whether sig…

711ac27

…n flip works properly.

zhuxr11 requested a review from lowener October 22, 2025 16:44

divyegala reviewed Oct 22, 2025

View reviewed changes

greptile-apps bot reviewed Oct 23, 2025

View reviewed changes

cpp/src/tsvd/tsvd.cuh Outdated Show resolved Hide resolved

cpp/src/tsvd/tsvd.cuh Outdated Show resolved Hide resolved

csadorf requested changes Nov 7, 2025

View reviewed changes

cpp/include/cuml/decomposition/pca.hpp Outdated Show resolved Hide resolved

cpp/include/cuml/decomposition/sign_flip_mg.hpp Show resolved Hide resolved

python/cuml/cuml/decomposition/pca.pyx Outdated Show resolved Hide resolved

csadorf added 2 commits November 7, 2025 11:08

Merge remote-tracking branch 'origin/main' into fix-pca_sign_flip

1578808

Refactor sign flip to be class-based and dynamic behavior to cuml.accel.

40e4746

The behavior in cuML standard is no longer dependent on the installed scikit-learn version. For cuml.accel we adjust behavior based on the emulated scikit-learn version.

No need for a subclass, simplify things

c444788

朱修锐 and others added 3 commits November 8, 2025 16:54

Replace u_based_decision with flip_signs_based_on_U.

2d1d49d

Fix missing rmm::CUDA_ALLOCATION_ALIGNMENT.

5ca1361

Merge remote-tracking branch 'upstream/main' into fix-pca_sign_flip

8826e07

zhuxr11 commented Nov 8, 2025

View reviewed changes

csadorf added 3 commits November 13, 2025 15:43

Merge remote-tracking branch 'origin/main' into fix-pca_sign_flip

cf17597

Update PCA tests to only check with sign for sklearn>=1.5

6fa4c57

Apply sign correction also for fit_transform.

afb6c2c

Merge branch 'main' into fix-pca_sign_flip

ed7a860

csadorf approved these changes Nov 14, 2025

View reviewed changes

Revert "Fix missing rmm::CUDA_ALLOCATION_ALIGNMENT."

1916077

This reverts commit 5ca1361.

divyegala approved these changes Nov 14, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix-pca_sign_flip

23d3801

rapids-bot bot merged commit ccf23dc into rapidsai:main Nov 14, 2025
106 checks passed

jcrist mentioned this pull request Nov 17, 2025

Remove caveat in cuml.accel docs about PCA/TSVD sign #7492

Merged

jcrist mentioned this pull request Dec 2, 2025

A few small fixes for multi-gpu PCA/TruncatedSVD #7560

Merged

Fix PCA sign flip #7331

Fix PCA sign flip #7331

Uh oh!

Conversation

zhuxr11 commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 12, 2025

Uh oh!

lowener left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuxr11 commented Oct 22, 2025

Uh oh!

zhuxr11 commented Oct 22, 2025

Uh oh!

divyegala commented Oct 22, 2025

Uh oh!

divyegala Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhuxr11 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhuxr11 Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

csadorf Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

csadorf Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

divyegala commented Oct 23, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Potential Issues

Confidence: 3/5

Additional Comments (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csadorf commented Nov 7, 2025

Uh oh!

jcrist commented Nov 7, 2025

Uh oh!

zhuxr11 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csadorf commented Nov 13, 2025

Uh oh!

csadorf commented Nov 14, 2025

Uh oh!

csadorf left a comment

Choose a reason for hiding this comment

Uh oh!

csadorf commented Nov 14, 2025

Uh oh!

csadorf commented Nov 14, 2025

Uh oh!

csadorf commented Nov 14, 2025

Uh oh!

Uh oh!

csadorf commented Nov 14, 2025

Uh oh!

zhuxr11 commented Nov 14, 2025

Uh oh!

Reviewers

zhuxr11 commented Oct 12, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

zhuxr11 Nov 8, 2025 •

edited

Loading