Skip to content

Conversation

@asafpamzn
Copy link

Summary

I'm implementing a COW-based live migration feature for CRIU that uses userfaultfd write-protection to track memory modifications while the process continues running. The goal is to combine it with the lazy support in order to be able to duplicate a process to remote instance while minimizing downtime compared to traditional dump modes.

Overview

  1. Write-protecting all writable memory pages using userfaultfd
  2. Resuming the process immediately after protection
  3. Capturing page contents on write faults before they're modified
  4. Transferring pages to destination while process continues running

High level flow

  1. Instead of dumping the entire memory mark VMAs with write protection
    In https://github.com/asafpamzn/criu/blob/criu-cow/criu/cr-dump.c#L1720

A new parasite to do the job
https://github.com/asafpamzn/criu/blob/criu-dev/criu/cow-dump.c#L197C1-L198C1
https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/pie/parasite.c#L963

Question: I want to dump small VMAs and mark in write protect only large VMAs - How can I do it? I don't fully understand how I can combine VMAs as they are all pushed to the same page image file.

  1. Next, a new thread is getting the page faults and transfer the process.
    https://github.com/asafpamzn/criu/blob/criu-cow/criu/cr-dump.c#L1728
    https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/cow-dump.c#L423
    https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/cow-dump.c#L444

  2. Awake the source process
    https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/cow-dump.c#L414

I'm in the early stages of learning the code. I will be happy to some guidance and advice.
Please let me know if it makes sense. I'm most concern about how I combine the memory areas as I want to write protect only large vmas

rst0git and others added 30 commits May 23, 2025 08:33
The `criu cpuinfo check` command calls cpu_validate_cpuinfo(), which
attempts to open the cpuinfo.img file using `open_image()`. If the
image file is not found, `open_image()` returns an "empty image"
object. As a result, `cpu_validate_cpuinfo()` tries to read from it
and fails with the following error:

(00.002473) Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)

This patch adds a check for an empty image and appropriate error message.

Signed-off-by: Radostin Stoyanov <[email protected]>
Fixes a clang compile-time error:
"argument unused during compilation: '-c'".

Signed-off-by: Andrei Vagin <[email protected]>
Use shared first error buffer to return correct
first error in rpc.

Fixes: checkpoint-restore#338

Signed-off-by: Ivan Pravdin <[email protected]>
Having CTL_FLAGS_IPC_EACCES_SKIP == (CTL_FLAGS_OPTIONAL |
CTL_FLAGS_READ_EIO_SKIP) is probably not what we want. So let's make it
a real distinct flag.

Fixes: 840735a ("ipc_sysctl: Prioritize restoring IPC variables using non usernsd approach")
Signed-off-by: Pavel Tikhomirov <[email protected]>
Fixes: f38e588 ("net/sysctl: c/r ipv4/ping_group_range value")
Signed-off-by: Pavel Tikhomirov <[email protected]>
We have ability to skip sysctl if there is no value, but we still give
n requests to sysctl_op, that is not correct and probably can segfault
on nullptr access. Fix it by adding ri to count non skipped requests.

To be on the safe side, let's add a check that ri == n on read, as we
should not do any skips there.

While on it lets fix bad error message prefix: s/unix/ipv4/.

Remove excess has_iarg set, and add sarg reset to NULL for the case
sysctl_op skipped it.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
We dump sysctls from criu user namespace, but restore from restored user
namespace. So group id values should be mapped to the restored user
namespace gid space to restore correctly.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
net/unix/max_dgram_qlen can't be tuned from non-root userns before:
v5.17-rc1~170^2~215 ("net: Enable max_dgram_qlen unix sysctl to be
configurable by non-init user namespaces")

Signed-off-by: Andrei Vagin <[email protected]>
Currently there is no option to checkpoint/restore programs that use
ICMP sockets, such as `ping`. This patch adds support for the same.

Fixes checkpoint-restore#2557

Signed-off-by: समीर सिंह Sameer Singh <[email protected]>
Add ZDTM static tests for IP4/ICMP and IP6/ICMP
socket feature.

Signed-off-by: समीर सिंह Sameer Singh <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
E.g. I have a /etc/hosts in workspace mounted from the host, and get the following message.

(00.141008)      1: mnt-v2: Create plain mountpoint /tmp/.criu.mntns.K1biY1/mnt-0000000938 for 938
(00.141546)      1: mnt-v2:     Mounting unsupported @938 (0)
(00.141887)      1: mnt-v2:     Bind /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/ to /tmp/.criu.mntns.K1biY1/mnt-0000000938
(00.142179)      1: Error (criu/mount-v2.c:319): mnt-v2: Failed to open_tree /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/: Not a directory
(00.143774) Error (criu/cr-restore.c:2320): Restoring FAILED.

Signed-off-by: Chuan Qiu <[email protected]>
The test creates a file bindmount in criu mntns and binds it into test
mntns, this external file bindmount is autodetected and restored via
"--external mnt[]" criu option.

Note: In previous patch we fix the problem on this code path where file
bindmount restore fails as there is excess "/" in source path.

Signed-off-by: Pavel Tikhomirov <[email protected]>
Currently the build scripts create the following symlink:

  criu-4.1/images/google/protobuf/descriptor.proto -> /usr/include/google/protobuf/descriptor.proto

This symlink points to a system-wide absolute-path target. Also,
this symlink ends up in the release tarball. The tarball may later be
downloaded and unpacked by e.g. OS distributions. If unpacking is
done using Python 3.14+, it will fail.

This happens because Python 3.14 will switch the default behavior of
extractall() from "fully trusting the content of archive" to
"disallow common attack vectors while extracting the archive".
With this new behavior, extractall() raises an exception when at
least one file in the archive extracts or points to outside of the
extraction directory (these are called path traversal attacks and
zip slip attacks).

Reported-by: Dmitrii Kuvaiskii <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
Commit 68f92b5 used `$$(Q)` instead of `$(Q)` in the Makefile target,
which resulted in the following error:

$(Q) echo "Generating descriptor.pb-c.c"
/bin/sh: 1: Q: not found
Generating descriptor.pb-c.c
$(Q) protoc --proto_path=/usr/include --proto_path=images/ --c_out=images/ /usr/include/google/protobuf/descriptor.proto
/bin/sh: 1: Q: not found

as well as:

$(Q) rm -rf images/google
/bin/sh: line 1: Q: command not found

Fix it.

Signed-off-by: Kir Kolyshkin <[email protected]>
Commit 68f92b5 removed images/google/protobuf directory, so it is
re-created each time during the build process.

This resulted in a weird behavior change. Previously, one could do
something like this:

	git clone $CRURL criu
	(cd criu && sudo make install-criu)
	rm -rf criu

This worked fine, including running rm -rf as a non-root user, since no
new directories were created under criu -- all directories were still
owned by the original user.

Since commit 68f92b5 the same sequence fails:

	rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.c': Permission denied
	rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.d': Permission denied
	rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.h': Permission denied

A workaround is to keep empty images/google/protobuf directory,
which is what this commit does.

Signed-off-by: Kir Kolyshkin <[email protected]>
In general, we use "$(E)" instead of "$(Q) echo", but we also have
a msg-gen macro which can be used here.

Signed-off-by: Kir Kolyshkin <[email protected]>
After the CRIU process saves the parasite code for the target thread in
the shared mmap, it is necessary to call __clear_cache before the target
thread executes the code.

Without this step, the target thread may not see the correct code to
execute, which can result in a SIGILL signal.

For the specific arm64 case. this is important so that the newly copied
code is flushed from d-cache to RAM, so that the target thread sees the
new code.

The change is based on commit 6be10a2 by @fu.lin and on input received
from @adrianreber.

[ avagin: tweak code comment ]

Signed-off-by: Ignacio Moreno Gonzalez <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
See the previous commit for rationale and architecture-specific details.

[ avagin: tweak code comment ]

Signed-off-by: Ignacio Moreno Gonzalez <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
A kernel change (commit 12f147ddd6de, "do_change_type(): refuse to
operate on unmounted/not ours mounts") modified how mount propagation
properties can be changed. Previously, these properties could be changed
from any mount namespace. Now, they can only be modified from the
specific mount namespace where the target mount is actually mounted

This commit addresses this new restriction by ensuring that CRIU enters the
correct mount namespace before attempting to restore mount propagation
properties (MS_SLAVE or MS_SHARED) for a mount.

Signed-off-by: Andrei Vagin <[email protected]>
Installing this package currently fails with the following message:

  Package qemu is not available, but is referred to by another package.
  This may mean that the package is missing, has been obsoleted, or
  is only available from another source

  E: Package 'qemu' has no installation candidate

Signed-off-by: Radostin Stoyanov <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
The tar command was failing with the following message:

  $ tar cf criu.tar ../../../criu
  tar: Removing leading `../../../' from member names
  tar: ../../../criu/scripts/ci/criu.tar: archive cannot contain itself; not dumped

In addition, the /vagrant no-longer exist in the new Fedora images.

  bash: line 1: cd: /vagrant: No such file or directory

Signed-off-by: Radostin Stoyanov <[email protected]>
Send large chunks to fill socket buffers.

Signed-off-by: Andrei Vagin <[email protected]>
The arm64 tests are currently being executed on both actuated and GitHub
runners. This change removes the actuated runner to avoid redundancy and
streamline our CI process.

Signed-off-by: Andrei Vagin <[email protected]>
Make should_dump_page to return int to indicate failure, also
return useful data back through the struct page_info structure
passed as a pointer.

Also, correspondingly convert all call sites.

No functional changes intended, except fixing a bug in
should_dump_page() as it could return (-1) when pmc_fill()
fails, while caller didn't expect that before.

Signed-off-by: Alexander Mikhalitsyn <[email protected]>
@rst0git
Copy link
Member

rst0git commented Nov 8, 2025

@asafpamzn There are too many patches in this pull request and it would be difficult for someone to comment on the changes.

The following document provides more information on how to contribute to CRIU:
https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md

I'm implementing a COW-based live migration feature for CRIU that uses userfaultfd write-protection to track memory modifications while the process continues running. The goal is to combine it with the lazy support in order to be able to duplicate a process to remote instance while minimizing downtime compared to traditional dump modes.

I believe Mike Rapoport (@rppt) might be able to provide some advice about the idea.

@asafpamzn
Copy link
Author

Thanks @rst0git ,

Since it is a big change I would like to get advice about the general direction before starting to implement. I can provide a design doc if it works better. What is the best path going forward?
Should I consult with @rppt ?

@rst0git
Copy link
Member

rst0git commented Nov 8, 2025

Since it is a big change I would like to get advice about the general direction before starting to implement. I can provide a design doc if it works better. What is the best path going forward?

Creating a GitHub issue with more information about the use-case and why this functionality is important will help us to understand the proposed design.

Should I consult with @rppt ?

There are multiple people in the community that can provide feedback. Mike is a MM maintainer for the Linux kernel and contributed many of the patches that enable post-copy migration with userfaultfd.

@avagin
Copy link
Member

avagin commented Nov 9, 2025

Since it is a big change I would like to get advice about the general direction before starting to implement. I can provide a design doc if it works better. What is the best path going forward?

Let's start with a design doc.

@asafpamzn
Copy link
Author

Ack, working on a design doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.