-
Notifications
You must be signed in to change notification settings - Fork 137
[torch] Use safer 'devrocm' version suffixes instead of 'rocmsdk' #2392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
HereThereBeDragons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i am missing a few explanations here. e.g. why you change to devrocm.
and we probably can put the derive_suffix just as part of the build_prod_wheels?
| type: string | ||
| rocm_version: | ||
| description: ROCm version to pip install | ||
| description: ROCm version to pip install (e.g. "7.10.0a20251124") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i see it correctly these changes are already included in #2394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made this PR first but decided the description changes would be a good fit there too. I'll remove them from this PR.
| self.assertEqual( | ||
| optional_build_prod_arguments, | ||
| "--rocm-sdk-version ==7.0.0.dev0+515115ea2cb85a0b71b5507ce56a627d14c7ae73 --version-suffix +rocm7.0.0.dev0-515115ea2cb85a0b71b5507ce56a627d14c7ae73", | ||
| "--rocm-sdk-version ==7.0.0.dev0+515115ea2cb85a0b71b5507ce56a627d14c7ae73 --version-suffix +devrocm7.0.0.dev0-515115ea2cb85a0b71b5507ce56a627d14c7ae73", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no? i dont think this is correct to change to devrocm. i think it needs to stay rocm
| # | ----------- | ------------------ | -------------------------- | | ||
| # | final | 7.10.0 | +rocm7.10.0 | | ||
| # | nightly | 7.10.0a20251124 | +rocm7.10.0a20251124 | | ||
| # | dev | 7.10.0.dev0+efed3c | +devrocm7.10.0.dev0-efed3c | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as i wrote above, i dont understand why devrocm is used. we are not using it.
e.g.
torch-2.10.0a0+rocm7.10.0.dev0.c7bfa586d12394ce5585c974d969985a72145a92-cp311-cp311-win_amd64.whl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That dev version is a problem:
>>> from packaging.version import Version, parse
>>> Version("2.10.0a0+rocm7.10.0.dev0.c7bfa58") > Version("2.10.0a0+rocm7.10.0")
TrueFinal versions should be considered newer than dev versions, but the text after the + follows the rules of https://packaging.python.org/en/latest/specifications/version-specifiers/#local-version-identifiers
Additionally a local version with a great number of segments will always compare as greater than a local version with fewer segments, as long as the shorter local version’s segments match the beginning of the longer local version’s segments exactly.
Nightly versions do not have the same issue:
>>> from packaging.version import Version, parse
>>> Version("2.10.0a0+rocm7.10.0a20251203") > Version("2.10.0a0+rocm7.10.0")
FalseThe difference there is the 0a20251203 component is newer than 0 due to
When comparing a numeric and lexicographic segment, the numeric section always compares as greater than the lexicographic segment.
We could do this:
-7.10.0.dev0.c7bfa58
+7.10.0dev0.c7bfa58but then a dev version would still be newer than an alpha/nightly version:
>>> from packaging.version import Version, parse
>>> Version("2.10.0a+7.10.0dev0.c7bfa58") > Version("2.10.0a0+rocm7.10.0a20251203")
TrueBack on #965, @marbre said about using rocmsdk instead of rocm for dev versions:
The only downside of the proposed local version identifier is that the ones used for the dev packages result in a version that is evaluated larger as the ones of the nightly packages. With this, dev packages and nightly releases cannot be easily mixed and we might want to revise this later.
So here I'm flipping that "evaluated larger" to "evaluated smaller" by picking devrocm which is lower than rocm and rocmsdk.
| "--version-suffix", | ||
| default=f"+rocmsdk{formatted_date}", | ||
| help="PyTorch version suffix", | ||
| default=f"+devrocm{formatted_date}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
devrocm
| default=f"+rocmsdk{formatted_date}", | ||
| help="PyTorch version suffix", | ||
| default=f"+devrocm{formatted_date}", | ||
| help="PyTorch version suffix (favor computing with build_tools/github_actions/determine_version.py and passing explicitly)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my understanding it was a bit different what we wanted to do:
- if --rocm-sdk-version is given, derive the suffix ourselves (just add
+rocmas a prefix) - if --rocm-sdk-version is not given, extract it from the py venv?
or did i miss here something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can/should still make that change to remove the default here and infer from the installed packages.
Could move the logic from build_tools/github_actions/determine_version.py to this script, though then we might want to copy/paste the same logic for JAX and other packages...
| # | | ||
| # |--- devrocm sorts older than rocm | ||
|
|
||
| parsed_version = parse(rocm_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does parse() do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://packaging.pypa.io/en/stable/version.html#packaging.version.parse
I use it pretty interchangeably with the Version() class constructor
…2394) We've had a few developers trigger "nightly" releases that should have been "dev" releases and we're still working on putting more technical barriers in place to prevent that. See for example #2227 (comment) and #2280 (comment). Until then, this adds documentation showing the canonical triggering patterns for "dev" releases and adds stronger language to workflow input descriptions guiding developers towards the safe defaults. See also #2392 that will make misconfigured release builds cause fewer issues.
Motivation
On #2368, we found that a design decision made back on #965 resulted in the default release version suffix
+rocmsdk{formatted_date}sorting to "more recent" than nightly release version suffixes like+rocm7.10.0a20251124.This changes the default and dev release versions to use a lower sorting 'devrocm' version suffix instead of 'rocmsdk'.
Technical Details
See the spec at https://packaging.python.org/en/latest/specifications/version-specifiers/#local-version-identifiers.
Notes on how we use the versions:
When a workflow is triggered without specifying
rocm_version, the default is used since--version-suffixis not set:TheRock/.github/workflows/build_portable_linux_pytorch_wheels.yml
Lines 183 to 188 in 51e83db
A related scenario is possible (and will be more common in the future) when a "dev release" with a version like
torch-2.9.1+rocm7.11.0.dev0.c9ae912c3887ae8538f12a99704fc3973a186e83is installed:More versions to consider:
Test Plan
New unit tests show the sorting behavior.