-
Notifications
You must be signed in to change notification settings - Fork 683
Feat/posix sem migration #2683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: criu-dev
Are you sure you want to change the base?
Feat/posix sem migration #2683
Conversation
Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>
This patch fixes the following warnings that appear when building an RPM package: + /usr/lib/rpm/redhat/brp-mangle-shebangs *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit Signed-off-by: Radostin Stoyanov <[email protected]>
By default, CRIU uses the path "/usr/lib/criu" to install and load plugins at runtime. This path is defined by the `PLUGINDIR` variable in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`. However, some distribution packages might install the CRIU plugins at "/usr/lib64/criu" instead. This patch updates the makefile to align the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`. Signed-off-by: Radostin Stoyanov <[email protected]>
We only use the last pid from the list in NSpid entry (from /proc/<pid>/fdinfo/<pidfd>) while restoring pidfds. The last pid refers to the pid of the process in the most deeply nested pid namespace. Since CRIU does not currently support nested pid namespaces, this entry is the one we want. After Linux 6.9, inode numbers can be used to compare pidfds. pidfds referring to the same process will have the same inode numbers. We use inode numbers to restore pidfds that point to dead processes. Signed-off-by: Bhavik Sachdev <[email protected]>
Process file descriptors (pidfds) were introduced to provide a stable handle on a process. They solve the problem of pid recycling. For a detailed explanation, see https://lwn.net/Articles/801319/ and http://www.corsix.org/content/what-is-a-pidfd Before Linux 6.9, anonymous inodes were used for the implementation of pidfds. So, we detect them in a fashion similiar to other fd types that use anonymous inodes by calling `readlink()`. After 6.9, pidfs (a file system for pidfds) was introduced. In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with 6.10. (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285) After this change, pidfs inodes have no file type in st_mode in userspace. We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9 Hence, check for pidfds occurs before the check for regular files. For pidfds that refer to dead processes, we lose the pid of the process as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1. So, we create a temporary process for each unique inode and open pidfds that refer to this process. After all pidfds have been opened we kill this temporary process. This commit does not include support for pidfds that point to a specific thread, i.e pidfds opened with `PIDFD_THREAD` flag. Fixes: checkpoint-restore#2258 Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same. Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R. Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different processes after C/R using the `pidfd_send_signal()` syscall. Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might change. But if two pidfds point to same dead process they should continue to do so after C/R. This test ensures that this happens by calling `statx()` on pidfds after C/R and then comparing their inode numbers. Support for comparing pidfds by using `statx()` and inode numbers was introduced alongside pidfs. So if `f_type` of pidfd is not equal to `PID_FS_MAGIC` then we skip this test. signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can read from it after C/R. signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`. signed-off-by: Bhavik Sachdev <[email protected]>
The command `ruff <path>` has been deprecated and removed: https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features Signed-off-by: Radostin Stoyanov <[email protected]>
This patch fixes the following errors reported by ruff:
lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
305 | elif field.type in _basic_cast:
306 | cast = _basic_cast[field.type]
307 | if pretty and (cast == int):
| ^^^^^^^^^^^ E721
308 | if is_hex:
309 | # Fields that have (criu).hex = true option set
|
lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
377 | elif field.type in _basic_cast:
378 | cast = _basic_cast[field.type]
379 | if (cast == int) and is_string(value):
| ^^^^^^^^^^^ E721
380 | if _marked_as_dev(field):
381 | return encode_dev(field, value)
|
Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the inventory image with a `plugins` field that contains an array of plugins which were used during checkpoint, for example, to save GPU state. In particular, the CUDA and AMDGPU plugins are added to this field only when the checkpoint contains GPU state. This allows to disable unnecessary plugins during restore, show appropriate error messages if required CRIU plugin are missing, and migrate a process that does not use GPU from a GPU-enabled system to CPU-only environment. We use the `optional plugins_entry` for backwards compatibility. This entry allows us to distinguish between *unset* and *missing* field: - When the field is missing, it indicates that the checkpoint was created with a previous version of CRIU, and all plugins should be *enabled* during restore. - When the field is empty, it indicates that no plugins were used during checkpointing. Thus, all plugins can be *disabled* during restore. Signed-off-by: Radostin Stoyanov <[email protected]>
This patch adds two test plugins to verify that CRIU plugins listed in the inventory image are enabled, while those that are not listed can be disabled. Signed-off-by: Radostin Stoyanov <[email protected]>
This patch blocks SIGCHLD during temporary process creation to prevent a race condition between kill() and waitpid() where sigchld_handler() causes `criu restore` to fail with an error. Fixes: checkpoint-restore#2490 Signed-off-by: Bhavik Sachdev <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - imported a page_size() type fix (authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - added a membarrier() to syscall table (fix authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Link: SerenityOS/serenity@e300da4 Signed-off-by: PukNgae Cryolitia <[email protected]> --- - cherry-picked Signed-off-by: Alexander Mikhalitsyn <[email protected]>
After a fork, both the child and parent processes may trigger a page fault (#PF) at the same virtual address, referencing the same position in the page image. If deduplication is enabled, the last process to trigger the page fault will fail. Therefore, deduplication should be disabled after a fork to prevent this issue. Signed-off-by: Liu Hua <[email protected]>
When restoring dumps in new mount + pid namespaces where multiple dumps share the same network namespace, CRIU may fail due to conflicting unix socket names. This happens because the service worker creates sockets using a pattern that includes criu_run_id, but util_init() is called after cr_service_work() starts. The socket naming pattern "crtools-fd-%d-%d" uses the restore PID and criu_run_id, however criu_run_id is always 0 when not initialized, leading to conflicts when multiple restores run simultaneously either in the same CRIU process or because of multiple CRIU processes doing the same operation in different PID namespaces. Fix this by: - Moving util_init() before cr_service_work() starts - Adding a second util_init() call in the service worker fork to ensure unique IDs across multiple worker runs - Making sure that dump and restore operations have util_init() called early to generate unique socket names With this fix, socket names always include the namespace ID, preventing conflicts when multiple processes with the same pid share a network namespace. Fixes checkpoint-restore#2499 [ avagin: minore code changes ] Signed-off-by: Lorenzo Fontana <[email protected]> Signed-off-by: Andrei Vagin <[email protected]>
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>
Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different [1]. This patch changes the logic to check if a GPU device is available in `/proc/driver/nvidia/gpus/`. This approach is similar to `torch.cuda.is_available()` and it is a more accurate indicator. The subsequent check for support of the `cuda-checkpoint --action` option would confirm if the driver supports checkpoint/restore. [1] https://github.com/NVIDIA/gpu-operator Fixes: checkpoint-restore#2509 Signed-off-by: Radostin Stoyanov <[email protected]>
Currently, the `waitpid()` call on the tmp process can be made by a process which is not its parent. This causes restore to fail. This patch instead selects one process to create the tmp process and open all the fds that point to it. These fds are sent to the correct process(es). Fixes: checkpoint-restore#2496 Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Bhavik Sachdev <[email protected]>
Signed-off-by: Brandon Smith <[email protected]>
…ix semaphores for migration Signed-off-by: Brandon Smith <[email protected]>
Signed-off-by: Brandon Smith <[email protected]>
b794348 to
3c67c0d
Compare
Signed-off-by: Brandon Smith <[email protected]>
Signed-off-by: Brandon Smith <[email protected]>
b66e06f to
30492d6
Compare
| * The semaphore file should already exist from FD restoration. | ||
| * If the VMA doesn't have an fd, we'll try to find semaphore files -> Same node | ||
| */ | ||
| void *addr; |
Check notice
Code scanning / CodeQL
Declaration hides variable
| if (bytes_read == sizeof(uint64_t)) { | ||
| /* Extract value from lower 32 bits */ | ||
| value = (int)(data64 & SEM_VALUE_MASK); | ||
| if (value >= 0 && value <= SEM_VALUE_MAX) { |
Check warning
Code scanning / CodeQL
Comparison result is always the same
| bytes_read = read(fd, &uvalue, sizeof(unsigned int)); | ||
| if (bytes_read == sizeof(unsigned int)) { | ||
| int shifted_value = (int)(uvalue >> SEM_VALUE_SHIFT); | ||
| if (shifted_value >= 0 && shifted_value <= SEM_VALUE_MAX) { |
Check warning
Code scanning / CodeQL
Comparison result is always the same
| bytes_read = read(fd, &uvalue, sizeof(unsigned int)); | ||
| if (bytes_read == sizeof(unsigned int)) { | ||
| int shifted_value = (int)(uvalue >> SEM_VALUE_SHIFT); | ||
| if (shifted_value >= 0 && shifted_value <= SEM_VALUE_MAX) { |
Check warning
Code scanning / CodeQL
Comparison result is always the same
|
|
||
| if (lseek(fd, 0, SEEK_SET) >= 0) { | ||
| bytes_read = read(fd, &int_value, sizeof(int)); | ||
| if (bytes_read == sizeof(int) && int_value >= 0 && int_value <= SEM_VALUE_MAX) { |
Check warning
Code scanning / CodeQL
Comparison result is always the same
30492d6 to
2303327
Compare
fix: unused vma image definition chore: bool opt for posix-sem-migration feature Signed-off-by: Brandon Smith <[email protected]>
2303327 to
350d9dc
Compare
|
@bsmithai I think you might be over-complicating this. First, POSIX semaphores are implemented in libc, and their behavior can vary across different libc implementations. In CRIU, we rely solely on kernel behavior. Second, I think it can be a good idea to dump all deleted shm files as ghost files: |
|
@avagin How does ghost file work w/ libc semaphores since they aren't regular files? I'm using the semaphore api to recreate their state. If CRIU doesn't want libc support that's fine I can put this elsewhere but I do want to c/r semaphore state |
Under the hood, POSIX semaphores are a user-space abstraction. sem_open() opens or creates a /dev/shm file and establishes a shared mapping. If you save/restore the content of this mapping, you will save/restore the semaphore state. Look at this program and its strace output: |
got it thank you |
|
@bsmithai why did you close this PR? Right now, deleted files on tmpfs are not dumped as ghost files because the kernel allows us to use linkat on them. So, we need to figure out the right user interface to explain to CRIU how to properly dump /shm files. |
|
Ahh okay my bad, I misunderstood. Honestly I didn’t read the ghost file
code so I just assumed it worked for tmpfs files. This makes sense
…On Fri, Jun 20, 2025 at 1:24 PM Andrei Vagin ***@***.***> wrote:
*avagin* left a comment (checkpoint-restore/criu#2683)
<#2683 (comment)>
@bsmithai <https://github.com/bsmithai> why did you close this PR? Right
now, deleted files on tmpfs are not dumped as ghost files because the
kernel allows us to use linkat on them. So, we need to figure out the right
user interface to explain to CRIU how to properly dump /shm files.
—
Reply to this email directly, view it on GitHub
<#2683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BBXBXMCQDJB4TPJCL6KKTGD3ERGV3AVCNFSM6AAAAAB7IXOSTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJSGQ3DMMZYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
A friendly reminder that this PR had no activity for 30 days. |
Problem:
There's a fairly common problem in testing various training workloads that make use of semaphores in multi process multi gpu training. Forced to use
--link-remapfor all posix semaphores. This is because glibc semaphore library causes the semaphore mapping to be in a deleted state as the temp file w/ a different naming format from the programs original creation of semaphore file in/dev/shm. This feature works fine for same node checkpoint restore but will obviously fail for migration.Solution:
Instead of treating semaphores as regular files, the implementation:
Detection:
proc_parse.cValue extraction
posix-sem.cSerialize
Restore Recreation
posix-sem.cVMA Restore
pie/restorer.cUsage:
criu dump -t PID --posix-sem-migration -D <dir>migrate
criu restore --posix-sem-migration -D <dir>Tested:
CRIU before the patch:
command used:
sudo "$ORIGINAL_CRIU" dump -t $SEM_PID -D "$TEST_DIR_ORIG" -v4 > "$TEST_DIR_ORIG/dump.log" 2>&1CRIU after the patch:
command used:
sudo "$MODIFIED_CRIU" dump -t $SEM_PID -D "$TEST_DIR_MOD" --posix-sem-migration -v4 > "$TEST_DIR_MOD/dump.log" 2>&1And restore: