-
Notifications
You must be signed in to change notification settings - Fork 681
restore: do mount proc in usernsd on certain conditions #2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: criu-dev
Are you sure you want to change the base?
Conversation
Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>
This patch fixes the following warnings that appear when building an RPM package: + /usr/lib/rpm/redhat/brp-mangle-shebangs *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit Signed-off-by: Radostin Stoyanov <[email protected]>
By default, CRIU uses the path "/usr/lib/criu" to install and load plugins at runtime. This path is defined by the `PLUGINDIR` variable in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`. However, some distribution packages might install the CRIU plugins at "/usr/lib64/criu" instead. This patch updates the makefile to align the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`. Signed-off-by: Radostin Stoyanov <[email protected]>
We only use the last pid from the list in NSpid entry (from /proc/<pid>/fdinfo/<pidfd>) while restoring pidfds. The last pid refers to the pid of the process in the most deeply nested pid namespace. Since CRIU does not currently support nested pid namespaces, this entry is the one we want. After Linux 6.9, inode numbers can be used to compare pidfds. pidfds referring to the same process will have the same inode numbers. We use inode numbers to restore pidfds that point to dead processes. Signed-off-by: Bhavik Sachdev <[email protected]>
Process file descriptors (pidfds) were introduced to provide a stable handle on a process. They solve the problem of pid recycling. For a detailed explanation, see https://lwn.net/Articles/801319/ and http://www.corsix.org/content/what-is-a-pidfd Before Linux 6.9, anonymous inodes were used for the implementation of pidfds. So, we detect them in a fashion similiar to other fd types that use anonymous inodes by calling `readlink()`. After 6.9, pidfs (a file system for pidfds) was introduced. In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with 6.10. (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285) After this change, pidfs inodes have no file type in st_mode in userspace. We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9 Hence, check for pidfds occurs before the check for regular files. For pidfds that refer to dead processes, we lose the pid of the process as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1. So, we create a temporary process for each unique inode and open pidfds that refer to this process. After all pidfds have been opened we kill this temporary process. This commit does not include support for pidfds that point to a specific thread, i.e pidfds opened with `PIDFD_THREAD` flag. Fixes: checkpoint-restore#2258 Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same. Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R. Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different processes after C/R using the `pidfd_send_signal()` syscall. Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might change. But if two pidfds point to same dead process they should continue to do so after C/R. This test ensures that this happens by calling `statx()` on pidfds after C/R and then comparing their inode numbers. Support for comparing pidfds by using `statx()` and inode numbers was introduced alongside pidfs. So if `f_type` of pidfd is not equal to `PID_FS_MAGIC` then we skip this test. signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can read from it after C/R. signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`. signed-off-by: Bhavik Sachdev <[email protected]>
The command `ruff <path>` has been deprecated and removed: https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features Signed-off-by: Radostin Stoyanov <[email protected]>
This patch fixes the following errors reported by ruff:
lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
305 | elif field.type in _basic_cast:
306 | cast = _basic_cast[field.type]
307 | if pretty and (cast == int):
| ^^^^^^^^^^^ E721
308 | if is_hex:
309 | # Fields that have (criu).hex = true option set
|
lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
377 | elif field.type in _basic_cast:
378 | cast = _basic_cast[field.type]
379 | if (cast == int) and is_string(value):
| ^^^^^^^^^^^ E721
380 | if _marked_as_dev(field):
381 | return encode_dev(field, value)
|
Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the inventory image with a `plugins` field that contains an array of plugins which were used during checkpoint, for example, to save GPU state. In particular, the CUDA and AMDGPU plugins are added to this field only when the checkpoint contains GPU state. This allows to disable unnecessary plugins during restore, show appropriate error messages if required CRIU plugin are missing, and migrate a process that does not use GPU from a GPU-enabled system to CPU-only environment. We use the `optional plugins_entry` for backwards compatibility. This entry allows us to distinguish between *unset* and *missing* field: - When the field is missing, it indicates that the checkpoint was created with a previous version of CRIU, and all plugins should be *enabled* during restore. - When the field is empty, it indicates that no plugins were used during checkpointing. Thus, all plugins can be *disabled* during restore. Signed-off-by: Radostin Stoyanov <[email protected]>
This patch adds two test plugins to verify that CRIU plugins listed in the inventory image are enabled, while those that are not listed can be disabled. Signed-off-by: Radostin Stoyanov <[email protected]>
This patch blocks SIGCHLD during temporary process creation to prevent a race condition between kill() and waitpid() where sigchld_handler() causes `criu restore` to fail with an error. Fixes: checkpoint-restore#2490 Signed-off-by: Bhavik Sachdev <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - imported a page_size() type fix (authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - added a membarrier() to syscall table (fix authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Link: SerenityOS/serenity@e300da4 Signed-off-by: PukNgae Cryolitia <[email protected]> --- - cherry-picked Signed-off-by: Alexander Mikhalitsyn <[email protected]>
After a fork, both the child and parent processes may trigger a page fault (#PF) at the same virtual address, referencing the same position in the page image. If deduplication is enabled, the last process to trigger the page fault will fail. Therefore, deduplication should be disabled after a fork to prevent this issue. Signed-off-by: Liu Hua <[email protected]>
When restoring dumps in new mount + pid namespaces where multiple dumps share the same network namespace, CRIU may fail due to conflicting unix socket names. This happens because the service worker creates sockets using a pattern that includes criu_run_id, but util_init() is called after cr_service_work() starts. The socket naming pattern "crtools-fd-%d-%d" uses the restore PID and criu_run_id, however criu_run_id is always 0 when not initialized, leading to conflicts when multiple restores run simultaneously either in the same CRIU process or because of multiple CRIU processes doing the same operation in different PID namespaces. Fix this by: - Moving util_init() before cr_service_work() starts - Adding a second util_init() call in the service worker fork to ensure unique IDs across multiple worker runs - Making sure that dump and restore operations have util_init() called early to generate unique socket names With this fix, socket names always include the namespace ID, preventing conflicts when multiple processes with the same pid share a network namespace. Fixes checkpoint-restore#2499 [ avagin: minore code changes ] Signed-off-by: Lorenzo Fontana <[email protected]> Signed-off-by: Andrei Vagin <[email protected]>
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>
Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different [1]. This patch changes the logic to check if a GPU device is available in `/proc/driver/nvidia/gpus/`. This approach is similar to `torch.cuda.is_available()` and it is a more accurate indicator. The subsequent check for support of the `cuda-checkpoint --action` option would confirm if the driver supports checkpoint/restore. [1] https://github.com/NVIDIA/gpu-operator Fixes: checkpoint-restore#2509 Signed-off-by: Radostin Stoyanov <[email protected]>
Currently, the `waitpid()` call on the tmp process can be made by a process which is not its parent. This causes restore to fail. This patch instead selects one process to create the tmp process and open all the fds that point to it. These fds are sent to the correct process(es). Fixes: checkpoint-restore#2496 Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Bhavik Sachdev <[email protected]>
On aarch64 the test cmdlinenv00 was failing with:
FAIL: cmdlinenv00.c:120: auxv corrupted on restore (errno = 11 (Resource temporarily unavailable))
Starting with Linux kernel version 6.3 the size of AUXV was changed:
commit 28c8e088427ad30b4260953f3b6f908972b77c2d
Author: Mathieu Desnoyers <[email protected]>
Date: Wed Jan 4 14:20:54 2023 -0500
rseq: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries
Two new auxiliary vector entries are introduced for rseq without
matching increment of the AT_VECTOR_SIZE_BASE, which causes failures
with CONFIG_HARDENED_USERCOPY=y.
Fixes: 317c8194e6ae ("rseq: Introduce feature size and alignment ELF auxiliary vector entries")
With this change AT_VECTOR_SIZE increases from 40 to 50 on aarch64. CRIU
uses AT_VECTOR_SIZE to read the content of /proc/PID/auxv
auxv_t mm_saved_auxv[AT_VECTOR_SIZE];
ret = read(fd, mm_saved_auxv, sizeof(mm_saved_auxv));
Now the tests works again on aarch64.
Signed-off-by: Adrian Reber <[email protected]>
Right now, CRIU skips timers non-periodic timers. This change addresses this issue. Signed-off-by: Austin Kuo <[email protected]>
It creates a few timers with log expiration intervals, waites for C/R and check that timers are armed and their intervals have been restored. Signed-off-by: Austin Kuo <[email protected]>
* Switch to v4 actions/checkout (from v3) * Use our apt wrapper to gracefully handle temporary repository errors Signed-off-by: Adrian Reber <[email protected]>
criu_run_id will be used in upcoming changes to create and remove network rules for network locking. Instead of trying to come up with a way to create unique IDs, just use an existing library. libuuid should be installed on most systems as it is indirectly required by systemd (via libmount). Signed-off-by: Adrian Reber <[email protected]>
Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:
Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86
The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.
Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.
With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.
Signed-off-by: Adrian Reber <[email protected]>
When creating a checkpoint of large models, the `checkpoint` action of `cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state: cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202 Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0 Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with net: Unlock network cuda_plugin: finished cuda_plugin stage 0 err -1 cuda_plugin: resuming devices on pid 84145 cuda_plugin: Restore thread pid 84202 found for real pid 84145 Unfreezing tasks into 1 Unseizing 84145 into 1 Error (criu/cr-dump.c:2111): Dumping FAILED. To fix this, we set `task_info->checkpointed` before invoking the `checkpoint` action to ensure that the CUDA task is resumed even if CRIU times out. Signed-off-by: Radostin Stoyanov <[email protected]>
Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to ensure that the function's sole responsibility is to use the cgroup freezer for the process tree. This allows us to avoid a time-out error when checkpointing applications with large GPU state. v2: This patch calls `checkpoint_devices()` only for `criu dump`. Support for GPU checkpointing with `pre-dump` will be introduced in a separate patch. Suggested-by: Andrei Vagin <[email protected]> Suggested-by: Jesus Ramos <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>
Temporarily disable CUDA plugin for `criu pre-dump`. pre-dump currently fails with the following error: Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0 [heap] Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0 Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895 /dev/nvidia0 Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000 Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1 We plan to enable support for pre-dump by skipping nvidia mappings in a separate patch. Signed-off-by: Radostin Stoyanov <[email protected]>
For two cases libcriu was setting the RPC protobuf field `has_*` before checking if the given parameter is valid. This can lead to situations, if the caller doesn't check the return value, that we pass as RPC struct to CRIU which has the `has_*` protobuf field set to true, but does not have a verified value (or non at all) set for the actual RPC entry. Signed-off-by: Adrian Reber <[email protected]>
CI pulls in a newer version of codespell. This fixes complaints from that codespell version. Signed-off-by: Adrian Reber <[email protected]>
Also add an explanation of the logic behind this calculation. Signed-off-by: Pavel Tikhomirov <[email protected]>
It is per net namespace, we need it to allow creation of unprivileged ICMP sockets. Note: in case this sysctl was disabled after unprivileged ICMP socket was created we still need to somehow handle it on restore. Signed-off-by: Pavel Tikhomirov <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
Trying to run latest CRIU on CentOS Stream 10 or Ubuntu 24.04 (aarch64)
fails like this:
# criu/criu check -v4
[...]
(00.096460) vdso: Parsing at ffffb2e2a000 ffffb2e2c000
(00.096539) vdso: PT_LOAD p_vaddr: 0
(00.096567) vdso: DT_STRTAB: 1d0
(00.096592) vdso: DT_SYMTAB: 128
(00.096616) vdso: DT_STRSZ: 8a
(00.096640) vdso: DT_SYMENT: 18
(00.096663) Error (criu/pie-util-vdso.c:193): vdso: Not all dynamic entries are present
(00.096688) Error (criu/vdso.c:627): vdso: Failed to fill self vdso symtable
(00.096713) Error (criu/kerndat.c:1906): kerndat_vdso_fill_symtable failed when initializing kerndat.
(00.096812) Found mmap_min_addr 0x10000
(00.096881) files stat: fs/nr_open 1073741816
(00.096908) Error (criu/crtools.c:267): Could not initialize kernel features detection.
This seems to be related to the kernel (6.12.0-41.el10.aarch64). The
Ubuntu user-space is running in a container on the same kernel.
Looking at the kernel this seems to be related to:
commit 48f6430505c0b0498ee9020ce3cf9558b1caaaeb
Author: Fangrui Song <[email protected]>
Date: Thu Jul 18 10:34:23 2024 -0700
arm64/vdso: Remove --hash-style=sysv
glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
for more than one decade in many Linux distributions. Using
--hash-style=sysv might imply unaddressed issues and confuse readers.
Just drop the option and rely on the linker default, which is likely
"both", or "gnu" when the distribution really wants to eliminate sysv
hash overhead.
Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").
The commit basically does:
-ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv \
+ldflags-y := -shared -soname=linux-vdso.so.1 \
Which results in only a GNU hash being added to the ELF header. This
change has been merged with 6.11.
Looking at the referenced x86 commit:
commit 6b7e26547fad7ace3dcb27a5babd2317fb9d1e12
Author: Andy Lutomirski <[email protected]>
Date: Thu Aug 6 14:45:45 2015 -0700
x86/vdso: Emit a GNU hash
Some dynamic loaders may be slightly faster if a GNU hash is
available. Strangely, this seems to have no effect at all on
the vdso size.
This is unlikely to have any measurable effect on the time it
takes to resolve vdso symbols (since there are so few of them).
In some contexts, it can be a win for a different reason: if
every DSO has a GNU hash section, then libc can avoid
calculating SysV hashes at all. Both musl and glibc appear to
have this optimization.
It's plausible that this breaks some ancient glibc version. If
so, then, depending on what glibc versions break, we could
either require COMPAT_VDSO for them or consider reverting.
Which is also a really simple change:
-VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) \
+VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=both) \
The big difference here is that for x86 both hash sections are
generated. For aarch64 only the newer GNU hash is generated. That is why
we only see this error on kernel >= 6.11 and aarch64.
Changing from DT_HASH to DT_GNU_HASH seems to work on aarch64. The test
suite runs without any errors.
Unfortunately I am not aware of all implication of this change and if a
successful test suite run means that it still works.
Looking at the kernel I see following hash styles for the VDSO:
aarch64: not specified (only GNU hash style)
arm: --hash-style=sysv
loongarch: --hash-style=sysv
mips: --hash-style=sysv
powerpc: --hash-style=both
riscv: --hash-style=both
s390: --hash-style=both
x86: --hash-style=both
Only aarch64 on kernels >= 6.11 is a problem right now, because all
other platforms provide the old style hashing.
Signed-off-by: Adrian Reber <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Co-authored-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
This way, - Makefile is less cluttered; - one can run codespell from the command line. Fixes: fd7e97f ("lint: exclude tags file from codespell") Signed-off-by: Kir Kolyshkin <[email protected]>
Currently Fedora rawhide based CI runs fail with: /bin/sh: line 1: awk: command not found Let's install it. Signed-off-by: Adrian Reber <[email protected]>
82068d3 to
cbe9e92
Compare
| int fd, ret; | ||
|
|
||
| if ((root_ns_mask & CLONE_NEWUSER) && !(root_ns_mask & CLONE_NEWNS)) | ||
| fd = ret = userns_call(userns_mount_proc, UNS_FDOUT, NULL, sizeof(NULL), -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it be mounted in the target pid namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No i don't think it will be mounted and it's gonna be rejected. I fixed this now. The corner case is only when the user namespace is isolated while keeping both mount namespace and pid namespace be shared with host. If either of two namespaces are isolated, the proc fs will be mounted successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a target workload is restored in just separate userns and pid namespaces, userns_mount_proc will fail, because the current process doesn't have CAP_SYS_ADMIN in the current mount namespace. The only way to workaround that is to mount proc from usernsd, and the userns mount callback has to enter the target pid namespace before mounting proc.
cbe9e92 to
89d1e1c
Compare
43dbd40 to
eda5ab0
Compare
|
Rebased to criu-dev |
When the user namespace is separate from both pid and mount namespace in the given task, cr-restore will fail to mount the procfs in the new user namespace because of the loss of privileges. Delegate usernsd to do the mount job if that corner case arises. Signed-off-by: Yiyang Wu <[email protected]>
|
A friendly reminder that this PR had no activity for 30 days. |
When only the user namespace is isolated in the given task, cr-restore will fail to mount the procfs in the new user namespace because of the loss of privileges.
Delegate usernsd to do the mount job if that corner case arises.
Fixes: #2597