Feat/posix sem migration #2683

bsmithai · 2025-06-13T17:58:38Z

Problem:

There's a fairly common problem in testing various training workloads that make use of semaphores in multi process multi gpu training. Forced to use --link-remap for all posix semaphores. This is because glibc semaphore library causes the semaphore mapping to be in a deleted state as the temp file w/ a different naming format from the programs original creation of semaphore file in /dev/shm. This feature works fine for same node checkpoint restore but will obviously fail for migration.

Solution:

Instead of treating semaphores as regular files, the implementation:

Detects POSIX semaphore files and VMAs during checkpoint
Extracts semaphore values and metadata
Recreates semaphores with original values during restore

Detection:

proc_parse.c

if (file_path[0] == '/' && strstr(file_path, "/dev/shm/sem.")) {
    vma_area->e->status = VMA_AREA_REGULAR | VMA_AREA_POSIX_SEM;
}

Value extraction

posix-sem.c

static int get_semaphore_value_from_fd(int fd, const char *sem_name, bool is_deleted) {
    // Try POSIX API first, then direct file reading
    // Handles 64-bit, 32-bit, and legacy formats
}

Serialize

pse.name = sem_name;
pse.value = sem_value;
pse.mode = p->stat.st_mode;
pse.uid = p->stat.st_uid;
pse.gid = p->stat.st_gid;

Restore Recreation

posix-sem.c

sem = sem_open(pse->name, O_CREAT | O_EXCL, pse->mode, value);
fd = open(sem_path, O_RDWR);

VMA Restore

pie/restorer.c

if (vma_entry_is(vma_entry, VMA_AREA_POSIX_SEM)) {
    addr = sys_mmap(vma_entry->start, vma_entry_len(vma_entry),
                    vma_entry->prot, vma_entry->flags | MAP_FIXED, 
                    sem_fd, vma_entry->pgoff);
}

Usage:

criu dump -t PID --posix-sem-migration -D <dir>
migrate
criu restore --posix-sem-migration -D <dir>

Tested:

CRIU before the patch:

command used: sudo "$ORIGINAL_CRIU" dump -t $SEM_PID -D "$TEST_DIR_ORIG" -v4 > "$TEST_DIR_ORIG/dump.log" 2>&1

(00.021101) Handling VMA with the following smaps entry: 71d674d91000-71d674d93000 r--p 00000000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.021108) Found regular file mapping, OK
(00.021120) Dumping path for -3 fd via self 12 [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2]
(00.021137) Handling VMA with the following smaps entry: 71d674d93000-71d674dbd000 r-xp 00002000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.021143) vma 71d674d93000 borrows vfi from previous 71d674d91000
(00.021146) Handling VMA with the following smaps entry: 71d674dbd000-71d674dc8000 r--p 0002c000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.021151) vma 71d674dbd000 borrows vfi from previous 71d674d93000
(00.021154) Handling VMA with the following smaps entry: 71d674dc8000-71d674dc9000 rw-s 00000000 00:1d 281125                     /dev/shm/sem.ZdOdCM (deleted)
(00.021162) Found regular file mapping, OK
(00.021174) Dumping path for -3 fd via self 12 [/dev/shm/sem.ZdOdCM (deleted)]
(00.021179) Strip ' (deleted)' tag from './dev/shm/sem.ZdOdCM (deleted)'
(00.021181) Error (criu/files-reg.c:1122): Can't create link remap for /dev/shm/sem.ZdOdCM. Use link-remap option.
(00.021186) Error (criu/cr-dump.c:1570): Collect mappings (pid: 762439) failed with -1
(00.021227) net: Unlock network
(00.021230) Unfreezing tasks into 1
(00.021232) 	Unseizing 762439 into 1
(00.021241) Error (criu/cr-dump.c:2111): Dumping FAILED.

CRIU after the patch:

command used: sudo "$MODIFIED_CRIU" dump -t $SEM_PID -D "$TEST_DIR_MOD" --posix-sem-migration -v4 > "$TEST_DIR_MOD/dump.log" 2>&1

(00.052742) Handling VMA with the following smaps entry: 71d674d93000-71d674dbd000 r-xp 00002000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.052745) vma 71d674d93000 borrows vfi from previous 71d674d91000
(00.052749) Handling VMA with the following smaps entry: 71d674dbd000-71d674dc8000 r--p 0002c000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.052753) vma 71d674dbd000 borrows vfi from previous 71d674d93000
(00.052755) Handling VMA with the following smaps entry: 71d674dc8000-71d674dc9000 rw-s 00000000 00:1d 281125                     /dev/shm/sem.ZdOdCM (deleted)
(00.052763) Found POSIX semaphore VMA mapping: /dev/shm/sem.ZdOdCM (deleted)
(00.052764) POSIX semaphore migration mode enabled, dumping as object: /dev/shm/sem.ZdOdCM (deleted)
(00.052769) POSIX semaphore VMA mapping for deleted semaphore: /dev/shm/sem.ZdOdCM (deleted)
(00.052772) Detected POSIX semaphore file: dev/shm/sem.ZdOdCM (deleted)
(00.052773) Dumping POSIX semaphore fd 12 with id 0x44a25
(00.052795) Extracted semaphore name: 'ZdOdCM' from path: '/dev/shm/sem.ZdOdCM (deleted)'
(00.052796) Looking for real semaphore name for ino=281125, dev=29
(00.052814) Found real semaphore name 'test_sem_762439' for inode 281125 (path: /dev/shm/sem.test_sem_762439)
(00.052821) Using real semaphore name 'test_sem_762439' (extracted name was 'ZdOdCM')
(00.052823) Attempting to read semaphore value directly from fd 12
(00.052825) Semaphore file size: 32 bytes
(00.052829) Read semaphore value 1 from 64-bit data field (fd 12)
(00.052834) Successfully dumped POSIX semaphore test_sem_762439 (value=1, deleted=yes)
(00.052837) Skipping regular file processing for POSIX semaphore VMA
(00.052840) Handling VMA with the following smaps entry: 71d674dc9000-71d674dcb000 r--p 00037000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.052849) Found regular file mapping, OK
(00.052865) Handling VMA with the following smaps entry: 71d674dcb000-71d674dcd000 rw-p 00039000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.052871) vma 71d674dcb000 borrows vfi from previous 71d674dc9000
(00.052874) Handling VMA with the following smaps entry: 7ffd2dd21000-7ffd2dd43000 rw-p 00000000 00:00 0                          [stack]
(00.052882) Handling VMA with the following smaps entry: ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

And restore:

.
.
(00.000472) Collected POSIX semaphore test_sem_762439 (value=1, id=0x44a25)
.
.
(00.001218) 762439: Opening 0x0071d674d91000-0x0071d674d93000 0000000000000000 (40000041) vma
(00.001229) 762439: Opening 0x0071d674d93000-0x0071d674dbd000 0x00000000002000 (40000041) vma
(00.001233) 762439: Opening 0x0071d674dbd000-0x0071d674dc8000 0x0000000002c000 (40000041) vma
(00.001236) 762439: Opening 0x0071d674dc8000-0x0071d674dc9000 0000000000000000 (11) vma
(00.001237) 762439: Opening POSIX semaphore VMA 71d674dc8000-71d674dc9000 (shmid=44a25)
(00.001239) 762439: Found matching POSIX semaphore test_sem_762439 for VMA (ino=44a25)
(00.001242) 762439: Restoring POSIX semaphore test_sem_762439 (value=1) for cross-host migration
(00.001277) 762439: Successfully created POSIX semaphore test_sem_762439 with initial value 1
(00.001282) 762439: Semaphore file /dev/shm/sem.test_sem_762439: mode=0100644, uid=0, gid=0
(00.001290) 762439: Successfully restored POSIX semaphore test_sem_762439 with fd 6 (cross-host migration)
(00.001293) 762439: Assigned fd 6 to POSIX semaphore VMA 71d674dc8000-71d674dc9000
(00.001295) 762439: Opening 0x0071d674dc9000-0x0071d674dcb000 0x00000000037000 (41) vma
(00.001296) 762439: Opening 0x0071d674dcb000-0x0071d674dcd000 0x00000000039000 (41) vma
.
.
(00.001848) pie: 762439: 	mmap(0x71d674d93000 -> 0x71d674dbd000, 0x5 0x12 5)
(00.001852) pie: 762439: 	mmap(0x71d674dbd000 -> 0x71d674dc8000, 0x1 0x12 5)
(00.001855) pie: 762439: Restoring POSIX semaphore VMA at 0x71d674dc8000-0x71d674dc9000 (fd=6)
(00.001860) pie: 762439: Successfully mapped semaphore VMA 0x71d674dc8000-0x71d674dc9000 (fd=6)
(00.001863) pie: 762439: 	mmap(0x71d674dc9000 -> 0x71d674dcb000, 0x3 0x12 5)
(00.001867) pie: 762439: 	mmap(0x71d674dcb000 -> 0x71d674dcd000, 0x3 0x12 5)
.
.

Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>

This patch fixes the following warnings that appear when building an RPM package: + /usr/lib/rpm/redhat/brp-mangle-shebangs *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit Signed-off-by: Radostin Stoyanov <[email protected]>

By default, CRIU uses the path "/usr/lib/criu" to install and load plugins at runtime. This path is defined by the `PLUGINDIR` variable in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`. However, some distribution packages might install the CRIU plugins at "/usr/lib64/criu" instead. This patch updates the makefile to align the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`. Signed-off-by: Radostin Stoyanov <[email protected]>

We only use the last pid from the list in NSpid entry (from /proc/<pid>/fdinfo/<pidfd>) while restoring pidfds. The last pid refers to the pid of the process in the most deeply nested pid namespace. Since CRIU does not currently support nested pid namespaces, this entry is the one we want. After Linux 6.9, inode numbers can be used to compare pidfds. pidfds referring to the same process will have the same inode numbers. We use inode numbers to restore pidfds that point to dead processes. Signed-off-by: Bhavik Sachdev <[email protected]>

Process file descriptors (pidfds) were introduced to provide a stable handle on a process. They solve the problem of pid recycling. For a detailed explanation, see https://lwn.net/Articles/801319/ and http://www.corsix.org/content/what-is-a-pidfd Before Linux 6.9, anonymous inodes were used for the implementation of pidfds. So, we detect them in a fashion similiar to other fd types that use anonymous inodes by calling `readlink()`. After 6.9, pidfs (a file system for pidfds) was introduced. In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with 6.10. (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285) After this change, pidfs inodes have no file type in st_mode in userspace. We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9 Hence, check for pidfds occurs before the check for regular files. For pidfds that refer to dead processes, we lose the pid of the process as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1. So, we create a temporary process for each unique inode and open pidfds that refer to this process. After all pidfds have been opened we kill this temporary process. This commit does not include support for pidfds that point to a specific thread, i.e pidfds opened with `PIDFD_THREAD` flag. Fixes: checkpoint-restore#2258 Signed-off-by: Bhavik Sachdev <[email protected]>

Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same. Signed-off-by: Bhavik Sachdev <[email protected]>

Ensure `pidfd_send_signal()` syscall works as expected after C/R. Signed-off-by: Bhavik Sachdev <[email protected]>

Validate that pidfds can been used to send signals to different processes after C/R using the `pidfd_send_signal()` syscall. Signed-off-by: Bhavik Sachdev <[email protected]>

After, C/R of pidfds that point to dead processes their inodes might change. But if two pidfds point to same dead process they should continue to do so after C/R. This test ensures that this happens by calling `statx()` on pidfds after C/R and then comparing their inode numbers. Support for comparing pidfds by using `statx()` and inode numbers was introduced alongside pidfs. So if `f_type` of pidfd is not equal to `PID_FS_MAGIC` then we skip this test. signed-off-by: Bhavik Sachdev <[email protected]>

We get the read end of a pipe using `pidfd_getfd` and check if we can read from it after C/R. signed-off-by: Bhavik Sachdev <[email protected]>

We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`. signed-off-by: Bhavik Sachdev <[email protected]>

The command `ruff <path>` has been deprecated and removed: https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features Signed-off-by: Radostin Stoyanov <[email protected]>

This patch fixes the following errors reported by ruff: lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks | 305 | elif field.type in _basic_cast: 306 | cast = _basic_cast[field.type] 307 | if pretty and (cast == int): | ^^^^^^^^^^^ E721 308 | if is_hex: 309 | # Fields that have (criu).hex = true option set | lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks | 377 | elif field.type in _basic_cast: 378 | cast = _basic_cast[field.type] 379 | if (cast == int) and is_string(value): | ^^^^^^^^^^^ E721 380 | if _marked_as_dev(field): 381 | return encode_dev(field, value) | Signed-off-by: Radostin Stoyanov <[email protected]>

This patch extends the inventory image with a `plugins` field that contains an array of plugins which were used during checkpoint, for example, to save GPU state. In particular, the CUDA and AMDGPU plugins are added to this field only when the checkpoint contains GPU state. This allows to disable unnecessary plugins during restore, show appropriate error messages if required CRIU plugin are missing, and migrate a process that does not use GPU from a GPU-enabled system to CPU-only environment. We use the `optional plugins_entry` for backwards compatibility. This entry allows us to distinguish between *unset* and *missing* field: - When the field is missing, it indicates that the checkpoint was created with a previous version of CRIU, and all plugins should be *enabled* during restore. - When the field is empty, it indicates that no plugins were used during checkpointing. Thus, all plugins can be *disabled* during restore. Signed-off-by: Radostin Stoyanov <[email protected]>

This patch adds two test plugins to verify that CRIU plugins listed in the inventory image are enabled, while those that are not listed can be disabled. Signed-off-by: Radostin Stoyanov <[email protected]>

This patch blocks SIGCHLD during temporary process creation to prevent a race condition between kill() and waitpid() where sigchld_handler() causes `criu restore` to fail with an error. Fixes: checkpoint-restore#2490 Signed-off-by: Bhavik Sachdev <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>

Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - imported a page_size() type fix (authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>

Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - added a membarrier() to syscall table (fix authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>

Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>

Signed-off-by: Haorong Lu <[email protected]>

Link: SerenityOS/serenity@e300da4 Signed-off-by: PukNgae Cryolitia <[email protected]> --- - cherry-picked Signed-off-by: Alexander Mikhalitsyn <[email protected]>

After a fork, both the child and parent processes may trigger a page fault (#PF) at the same virtual address, referencing the same position in the page image. If deduplication is enabled, the last process to trigger the page fault will fail. Therefore, deduplication should be disabled after a fork to prevent this issue. Signed-off-by: Liu Hua <[email protected]>

When restoring dumps in new mount + pid namespaces where multiple dumps share the same network namespace, CRIU may fail due to conflicting unix socket names. This happens because the service worker creates sockets using a pattern that includes criu_run_id, but util_init() is called after cr_service_work() starts. The socket naming pattern "crtools-fd-%d-%d" uses the restore PID and criu_run_id, however criu_run_id is always 0 when not initialized, leading to conflicts when multiple restores run simultaneously either in the same CRIU process or because of multiple CRIU processes doing the same operation in different PID namespaces. Fix this by: - Moving util_init() before cr_service_work() starts - Adding a second util_init() call in the service worker fork to ensure unique IDs across multiple worker runs - Making sure that dump and restore operations have util_init() called early to generate unique socket names With this fix, socket names always include the namespace ID, preventing conflicts when multiple processes with the same pid share a network namespace. Fixes checkpoint-restore#2499 [ avagin: minore code changes ] Signed-off-by: Lorenzo Fontana <[email protected]> Signed-off-by: Andrei Vagin <[email protected]>

When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>

Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>

Signed-off-by: Radostin Stoyanov <[email protected]>

The check for `/dev/nvidiactl` to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different [1]. This patch changes the logic to check if a GPU device is available in `/proc/driver/nvidia/gpus/`. This approach is similar to `torch.cuda.is_available()` and it is a more accurate indicator. The subsequent check for support of the `cuda-checkpoint --action` option would confirm if the driver supports checkpoint/restore. [1] https://github.com/NVIDIA/gpu-operator Fixes: checkpoint-restore#2509 Signed-off-by: Radostin Stoyanov <[email protected]>

Currently, the `waitpid()` call on the tmp process can be made by a process which is not its parent. This causes restore to fail. This patch instead selects one process to create the tmp process and open all the fds that point to it. These fds are sent to the correct process(es). Fixes: checkpoint-restore#2496 Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Bhavik Sachdev <[email protected]>

Signed-off-by: Brandon Smith <[email protected]>

…ix semaphores for migration Signed-off-by: Brandon Smith <[email protected]>

Signed-off-by: Brandon Smith <[email protected]>

criu/config.c

criu/include/image.h

criu/pie/restorer.c

+		 * The semaphore file should already exist from FD restoration.
+		 * If the VMA doesn't have an fd, we'll try to find semaphore files -> Same node
+		 */
+		void *addr;


criu/posix-sem.c

+			if (bytes_read == sizeof(uint64_t)) {
+				/* Extract value from lower 32 bits */
+				value = (int)(data64 & SEM_VALUE_MASK);
+				if (value >= 0 && value <= SEM_VALUE_MAX) {


criu/posix-sem.c

+			bytes_read = read(fd, &uvalue, sizeof(unsigned int));
+			if (bytes_read == sizeof(unsigned int)) {
+				int shifted_value = (int)(uvalue >> SEM_VALUE_SHIFT);
+				if (shifted_value >= 0 && shifted_value <= SEM_VALUE_MAX) {


criu/posix-sem.c

+			bytes_read = read(fd, &uvalue, sizeof(unsigned int));
+			if (bytes_read == sizeof(unsigned int)) {
+				int shifted_value = (int)(uvalue >> SEM_VALUE_SHIFT);
+				if (shifted_value >= 0 && shifted_value <= SEM_VALUE_MAX) {


criu/posix-sem.c

+
+		if (lseek(fd, 0, SEEK_SET) >= 0) {
+			bytes_read = read(fd, &int_value, sizeof(int));
+			if (bytes_read == sizeof(int) && int_value >= 0 && int_value <= SEM_VALUE_MAX) {


fix: unused vma image definition chore: bool opt for posix-sem-migration feature Signed-off-by: Brandon Smith <[email protected]>

avagin · 2025-06-18T16:29:15Z

@bsmithai I think you might be over-complicating this. First, POSIX semaphores are implemented in libc, and their behavior can vary across different libc implementations. In CRIU, we rely solely on kernel behavior. Second, I think it can be a good idea to dump all deleted shm files as ghost files:
https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/files-reg.c#L1022

bsmithai · 2025-06-18T16:34:46Z

@avagin How does ghost file work w/ libc semaphores since they aren't regular files? I'm using the semaphore api to recreate their state. If CRIU doesn't want libc support that's fine I can put this elsewhere but I do want to c/r semaphore state

avagin · 2025-06-18T20:20:50Z

@avagin How does ghost file work w/ libc semaphores since they aren't regular files? I'm using the semaphore api to recreate their state. If CRIU doesn't want libc support that's fine I can put this elsewhere but I do want to c/r semaphore state

Under the hood, POSIX semaphores are a user-space abstraction. sem_open() opens or creates a /dev/shm file and establishes a shared mapping. If you save/restore the content of this mapping, you will save/restore the semaphore state.

Look at this program and its strace output:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>      
#include <sys/mman.h>   
#include <sys/stat.h>   
#include <semaphore.h>  
#include <unistd.h>     

#define SEM_NAME "/my_sem_eяxample"

int main() {
    int shm_fd;    
    volatile int *data;
    sem_t *sem;

    sem = sem_open(SEM_NAME, O_CREAT, 0666, 0);
    if (sem == SEM_FAILED) {
      return 1;
    }
    sem_init(sem, 1, 0);
    //sem_unlink(SEM_NAME);

    if (fork() == 0) {
        sem_post(sem);
        return 0;
    }
    sem_wait(sem);
    return 0;
}

openat(AT_FDCWD, "/dev/shm/sem.my_sem_e\321\217xample", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0640, st_size=32, ...}) = 0
getrandom("\x21\xb4\x29\xb5\x19\x89\x1b\x65", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x55d42208d000
brk(0x55d4220ae000)                     = 0x55d4220ae000
mmap(NULL, 32, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f7bce617000
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLDstrace: Process 934950 attached
, child_tidptr=0x7f7bce408a10) = 934950
[pid 934950] set_robust_list(0x7f7bce408a20, 24 <unfinished ...>
[pid 934949] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 934950] <... set_robust_list resumed>) = 0
[pid 934949] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 934949] futex(0x7f7bce617000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 934950] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 934950] futex(0x7f7bce617000, FUTEX_WAKE, 1 <unfinished ...>
[pid 934949] <... futex resumed>)       = 0
[pid 934950] <... futex resumed>)       = 1
[pid 934949] exit_group(0 <unfinished ...>
[pid 934950] exit_group(0 <unfinished ...>

bsmithai · 2025-06-18T20:46:17Z

@avagin How does ghost file work w/ libc semaphores since they aren't regular files? I'm using the semaphore api to recreate their state. If CRIU doesn't want libc support that's fine I can put this elsewhere but I do want to c/r semaphore state

Under the hood, POSIX semaphores are a user-space abstraction. sem_open() opens or creates a /dev/shm file and establishes a shared mapping. If you save/restore the content of this mapping, you will save/restore the semaphore state.

Look at this program and its strace output:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>      
#include <sys/mman.h>   
#include <sys/stat.h>   
#include <semaphore.h>  
#include <unistd.h>     

#define SEM_NAME "/my_sem_eяxample"

int main() {
    int shm_fd;    
    volatile int *data;
    sem_t *sem;

    sem = sem_open(SEM_NAME, O_CREAT, 0666, 0);
    if (sem == SEM_FAILED) {
      return 1;
    }
    sem_init(sem, 1, 0);
    //sem_unlink(SEM_NAME);

    if (fork() == 0) {
        sem_post(sem);
        return 0;
    }
    sem_wait(sem);
    return 0;
}

openat(AT_FDCWD, "/dev/shm/sem.my_sem_e\321\217xample", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0640, st_size=32, ...}) = 0
getrandom("\x21\xb4\x29\xb5\x19\x89\x1b\x65", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x55d42208d000
brk(0x55d4220ae000)                     = 0x55d4220ae000
mmap(NULL, 32, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f7bce617000
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLDstrace: Process 934950 attached
, child_tidptr=0x7f7bce408a10) = 934950
[pid 934950] set_robust_list(0x7f7bce408a20, 24 <unfinished ...>
[pid 934949] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 934950] <... set_robust_list resumed>) = 0
[pid 934949] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 934949] futex(0x7f7bce617000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 934950] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 934950] futex(0x7f7bce617000, FUTEX_WAKE, 1 <unfinished ...>
[pid 934949] <... futex resumed>)       = 0
[pid 934950] <... futex resumed>)       = 1
[pid 934949] exit_group(0 <unfinished ...>
[pid 934950] exit_group(0 <unfinished ...>

got it thank you

avagin · 2025-06-20T18:24:07Z

@bsmithai why did you close this PR? Right now, deleted files on tmpfs are not dumped as ghost files because the kernel allows us to use linkat on them. So, we need to figure out the right user interface to explain to CRIU how to properly dump /shm files.

bsmithai · 2025-06-20T18:37:04Z

Ahh okay my bad, I misunderstood. Honestly I didn’t read the ghost file code so I just assumed it worked for tmpfs files. This makes sense

…

On Fri, Jun 20, 2025 at 1:24 PM Andrei Vagin ***@***.***> wrote: *avagin* left a comment (checkpoint-restore/criu#2683) <#2683 (comment)> @bsmithai <https://github.com/bsmithai> why did you close this PR? Right now, deleted files on tmpfs are not dumped as ghost files because the kernel allows us to use linkat on them. So, we need to figure out the right user interface to explain to CRIU how to properly dump /shm files. — Reply to this email directly, view it on GitHub <#2683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BBXBXMCQDJB4TPJCL6KKTGD3ERGV3AVCNFSM6AAAAAB7IXOSTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJSGQ3DMMZYGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

github-actions · 2025-07-24T00:10:17Z

A friendly reminder that this PR had no activity for 30 days.

Daz-3ux and others added 30 commits October 26, 2024 22:17

limit the field width of 'scanf'

810f52e

Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>

zdtm: Check pidfd fdinfo entry is consistent

f9fdfca

Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same. Signed-off-by: Bhavik Sachdev <[email protected]>

zdtm: Check pidfd can send signal after C/R

487853f

Ensure `pidfd_send_signal()` syscall works as expected after C/R. Signed-off-by: Bhavik Sachdev <[email protected]>

zdtm: Check pidfd can kill descendant processes

032a822

Validate that pidfds can been used to send signals to different processes after C/R using the `pidfd_send_signal()` syscall. Signed-off-by: Bhavik Sachdev <[email protected]>

zdtm: Check fd from pidfd_getfd is C/Red correctly

98c49d0

We get the read end of a pipe using `pidfd_getfd` and check if we can read from it after C/R. signed-off-by: Bhavik Sachdev <[email protected]>

zdtm: Check pidfd for thread is valid after C/R

b88d40e

We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`. signed-off-by: Bhavik Sachdev <[email protected]>

make/lint: use 'ruff check <path>'

0e78014

The command `ruff <path>` has been deprecated and removed: https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features Signed-off-by: Radostin Stoyanov <[email protected]>

zdtm: add inventory test plugins

e6ce8f4

This patch adds two test plugins to verify that CRIU plugins listed in the inventory image are enabled, while those that are not listed can be disabled. Signed-off-by: Radostin Stoyanov <[email protected]>

images: add riscv64 core image

1a42f63

Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>

criu: add riscv64 support to parasite and restorer

35b3077

Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>

zdtm: add riscv64 support

6636782

Signed-off-by: Haorong Lu <[email protected]>

ci: add workflow for riscv64

9863769

Signed-off-by: Haorong Lu <[email protected]>

include: don't use GCC's __builtin_ffs on riscv64

f6baf81

Link: SerenityOS/serenity@e300da4 Signed-off-by: PukNgae Cryolitia <[email protected]> --- - cherry-picked Signed-off-by: Alexander Mikhalitsyn <[email protected]>

ci: test interrupt-only mode with frozen cgroup

31b38d6

Signed-off-by: Radostin Stoyanov <[email protected]>

bsmithai added 3 commits June 13, 2025 18:44

init: posix sem state

191c488

Signed-off-by: Brandon Smith <[email protected]>

feat: --posix-sem-migration, proper c/r and vma handling of glibc pos…

3aa881d

…ix semaphores for migration Signed-off-by: Brandon Smith <[email protected]>

fix: proper support of vma mapping for restored posix semaphore

3c67c0d

Signed-off-by: Brandon Smith <[email protected]>

bsmithai force-pushed the feat/posix-sem-migration branch from b794348 to 3c67c0d Compare June 13, 2025 23:44

bsmithai added 2 commits June 13, 2025 19:21

fix: broken aioring

843f7de

Signed-off-by: Brandon Smith <[email protected]>

fix: proper filesystem name of original semaphore (not the temp)

2f2046e

Signed-off-by: Brandon Smith <[email protected]>

bsmithai mentioned this pull request Jun 14, 2025

Feat/posix sem migration cedana/criu#4

Closed

bsmithai force-pushed the feat/posix-sem-migration branch 2 times, most recently from b66e06f to 30492d6 Compare June 14, 2025 04:47

adrianreber reviewed Jun 14, 2025

View reviewed changes

criu/config.c Outdated Show resolved Hide resolved

criu/include/image.h Outdated Show resolved Hide resolved

criu/pie/restorer.c Show resolved Hide resolved

github-advanced-security bot found potential problems Jun 14, 2025

View reviewed changes

bsmithai force-pushed the feat/posix-sem-migration branch from 30492d6 to 2303327 Compare June 16, 2025 20:02

feat: add zdtm test for posix sem c/r

350d9dc

fix: unused vma image definition chore: bool opt for posix-sem-migration feature Signed-off-by: Brandon Smith <[email protected]>

bsmithai force-pushed the feat/posix-sem-migration branch from 2303327 to 350d9dc Compare June 16, 2025 20:30

bsmithai requested a review from adrianreber June 16, 2025 20:32

bsmithai mentioned this pull request Jun 17, 2025

fix: link_remap files get left over on failed dumps, unique file names to avoid collisions #2681

Open

rst0git self-assigned this Jun 17, 2025

avagin self-assigned this Jun 18, 2025

avagin marked this pull request as draft June 18, 2025 05:45

bsmithai closed this Jun 18, 2025

bsmithai reopened this Jun 23, 2025

github-actions bot added the stale-pr label Jul 24, 2025

avagin force-pushed the criu-dev branch from a1fd7e6 to 9204372 Compare November 14, 2025 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/posix sem migration #2683

Feat/posix sem migration #2683

bsmithai commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Check warning

Check warning

Check warning

Check warning

avagin commented Jun 18, 2025

Uh oh!

bsmithai commented Jun 18, 2025

Uh oh!

avagin commented Jun 18, 2025

Uh oh!

bsmithai commented Jun 18, 2025

Uh oh!

avagin commented Jun 20, 2025

Uh oh!

bsmithai commented Jun 20, 2025 via email

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

25 participants

Feat/posix sem migration #2683

Are you sure you want to change the base?

Feat/posix sem migration #2683

Conversation

bsmithai commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Detection:

Value extraction

Serialize

Restore Recreation

VMA Restore

Usage:

Tested:

CRIU before the patch:

CRIU after the patch:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Check warning

Check warning

Check warning

Check warning

avagin commented Jun 18, 2025

Uh oh!

bsmithai commented Jun 18, 2025

Uh oh!

avagin commented Jun 18, 2025

Uh oh!

bsmithai commented Jun 18, 2025

Uh oh!

avagin commented Jun 20, 2025

Uh oh!

bsmithai commented Jun 20, 2025 via email

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

25 participants

bsmithai commented Jun 13, 2025 •

edited

Loading