Skip to content

Conversation

@matbesancon
Copy link
Contributor

No description provided.

@matbesancon
Copy link
Contributor Author

I can't run the GPU tests locally but wanted to see if this fixed the CI

@matbesancon
Copy link
Contributor Author

ping @devmotion if you can allow CI :)

@matbesancon
Copy link
Contributor Author

I was about to fix the tolerance in the simple GPU quadratic regularization but this might actually be an algorithmic issue more than numeric, both the CPU and GPU versions issue the warning that the semi-smooth Newton didn't converge.

This also occurs if I increase the maxiter to 1000, which is already too high for a Newton-type method

@matbesancon
Copy link
Contributor Author

TIL sources in Project.toml 😮

@matbesancon
Copy link
Contributor Author

The semi-smooth is stalling at this value:

┌ Debug: Semi-smooth Newton algorithm (486/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (487/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (488/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (489/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (490/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (491/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (492/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (493/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (494/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (495/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (496/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (497/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (498/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (499/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Debug: Semi-smooth Newton algorithm (500/500: absolute error of source marginal = 0.0075981272
└ @ OptimalTransport ~/.julia/dev/OptimalTransport/src/quadratic_newton.jl:200
┌ Warning: Semi-smooth Newton algorithm (500/500): not converged

@matbesancon
Copy link
Contributor Author

alright, slowly getting there

G[(M + 1):end, 1:M] .= σ'
# G[diagind(G)] .+= δ # regularise cg
G += δ * I
view(G, diagind(G)) .+= δ # regularise cg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matbesancon this shouldn't matter but here G wasn't updated in-place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G wasn't regularized altogether in the current master then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, it's only used in this function here, so it doesn't matter whether it's updated in-place or not. It just unintentionally caused additional allocations in every iteration which one would like to get rid of completely by caching G upfront.

@matbesancon
Copy link
Contributor Author

still no convergence of the semi-smooth Newton in the GPU example but well, might just be inherent to the method? 🤷

@devmotion
Copy link
Member

still no convergence of the semi-smooth Newton in the GPU example but well, might just be inherent to the method? 🤷

Yeah, I'm not completely sure. My impression is that the tests are quite bad - other methods also show convergence warnings and even POT fails/warns on some test examples.

@codecov
Copy link

codecov bot commented Aug 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.28%. Comparing base (9da044c) to head (3394f67).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #194      +/-   ##
==========================================
- Coverage   95.30%   94.28%   -1.02%     
==========================================
  Files          14       14              
  Lines         681      683       +2     
==========================================
- Hits          649      644       -5     
- Misses         32       39       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@matbesancon
Copy link
Contributor Author

I'm a bit puzzled that this is still not converging (from buildkite logs), I got it to work fine locally with the latest version

@matbesancon
Copy link
Contributor Author

But at least CI is green and we can update to the latest packages 🥳

@devmotion
Copy link
Member

I'm a bit puzzled that this is still not converging (from buildkite logs), I got it to work fine locally with the latest version

Did you run the GPU test locally (without GPUs)? Or the non-GPU test? The GPU and non-GPU tests are different, e.g., the former uses random histograms whereas the latter uses uniform marginal measures, and they use different regularization parameters.

@matbesancon matbesancon changed the title Update compat - fixed tolerances Update compatibility bounds, fixed tolerances and issues in quadratic regularization Aug 6, 2025
@matbesancon
Copy link
Contributor Author

Did you run the GPU test locally (without GPUs)?

I ran the CPU part of the GPU test. Since two warnings are produced, it means the algorithm doesn't converge, regardless of the hardware it runs on

@devmotion
Copy link
Member

I just checked out the PR locally. I don't see any convergence issues in the GPU tests locally, everything works and converges as expected 😕 Only if I decrease the regularization parameter to very small values (e.g. 1f-5) I see convergence issues.

@matbesancon
Copy link
Contributor Author

I would be surprised if buildkite is caching anything. For now we can merge this and see in later things whether this is still an issue?

@devmotion
Copy link
Member

No, I'm not worried about caches, most likely it's just due to the different hardware architecture (I'm running it on an Apple M2). But it shows that most likely there isn't a general problem with the algorithm and/or the test but at most it's a slightly brittle test.

I went through the tests more generally and fixed and improved a few things. I think generally the regularization parameters were unnecessarily small (which caused convergence failures in a few cases, both in OptimalTransport and POT). Additionally, some comparisons with POT were just failing due to a too high threshold/tolerance in POT. Moreover, the comparisons of the unbalanced Sinkhorn algorithm with POT were actually failing since POT by default uses a different regularization; so instead of adjusting the tolerances, it was sufficient to tell POT to also use entropy regularization.

@matbesancon
Copy link
Contributor Author

@devmotion I think that's good to go?

@devmotion devmotion merged commit af842b5 into JuliaOptimalTransport:master Aug 6, 2025
16 of 17 checks passed
@matbesancon matbesancon deleted the update_compat branch August 6, 2025 19:58
@matbesancon matbesancon mentioned this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants