Skip to content

Conversation

@umanwizard
Copy link
Contributor

Data retrieved from the mmap-backed store must not outlive f, or reading it can cause segfaults.

Data retrieved from the mmap-backed store must not outlive `f`, or
reading it can cause segfaults.
@umanwizard umanwizard requested review from a team as code owners October 22, 2025 18:31
@fabled
Copy link
Contributor

fabled commented Oct 22, 2025

Is the problem that some mmapped data is leaked from these functions? If so, this might not be a full fix.

But also, we are currently looking into reverting the mmaping completely. While it solved the problem of allowing to access large Golang execulables without running out of memory, it introduced other subtle issues issues and as side effect still increased the RSS usage so that process resource usage limits trigger anyway. So hopefully within few weeks time we (or I) get time to draft a PR that removes mmap, and fixes the Golang large binary issues by other means.

Meanwhile, workarounds like this are ok if they fix the root cause of the segfault correctly.

@umanwizard
Copy link
Contributor Author

Calling a method on a struct does not prevent that struct from being garbage-collected/finalized.

I.e. if you have some

func (f *Foo) Bar() {
  // ...
}

f can die while Bar is still running. It is ONLY kept alive until the last time f is used within the function.

So if you have code like:

ef, err := pfelf.Open("foo")
if err != nil { return err }
syms, err := ef.ReadSymbols()
if err != nil { return err }

And then ef is never mentioned again, it can be cleaned up during the call to ReadSymbols, because that has the form:

// 1. get a pointer to some data from `f`, backed by mmap
// 2. process the data (without mentioning `f`)

So, the mmap can be cleaned up between steps 1 and 2, and since the finalizer calls munmap, this leads to a segfault.

@umanwizard
Copy link
Contributor Author

And in fact I am actually seeing this segfault if I write a test of that form, if the file is big enough (thus ReadSymbols takes long enough for the finalizer to actually run)

@christos68k
Copy link
Member

christos68k commented Oct 23, 2025

Calling a method on a struct does not prevent that struct from being garbage-collected/finalized.

I.e. if you have some

func (f *Foo) Bar() {
  // ...
}

f can die while Bar is still running. It is ONLY kept alive until the last time f is used within the function.

Interesting, I was aware of the behavior described in KeepAlive but that doesn't specifically mention pointer receivers and I've so far assumed that a pointer receiver would keep an instance alive for the duration of a method call.

The only relevant reference to the behavior you described that I could find is here (lines 46-51).

Good to know.

@christos68k
Copy link
Member

christos68k commented Oct 23, 2025

@umanwizard For completeness, could you post a stacktrace / panic log of the crash (or a reproducible test if you have it?)

@umanwizard
Copy link
Contributor Author

Interesting, I was aware of the behavior described in KeepAlive but that doesn't specifically mention pointer receivers and I've so far assumed that a pointer receiver would keep an instance alive for the duration of a method call.

I also thought that, but here is Keith Randall from the Go team confirming the behavior: https://groups.google.com/g/golang-nuts/c/LIWj6Gl--es?pli=1

@umanwizard
Copy link
Contributor Author

@christos68k

Here is a branch with a test that segfaults reliably on my machine: https://github.com/parca-dev/opentelemetry-ebpf-profiler/tree/pfelf-crash-example

I haven't submitted the test as part of this PR, because it's a bit weird to be downloading random binaries from the internet in a test, and this particular binary is too large to embed in the repo.

@umanwizard
Copy link
Contributor Author

For posterity in case that branch gets deleted someday, here is the patch:

diff --git a/libpf/pfelf/file_test.go b/libpf/pfelf/file_test.go
index 85faebbe1..f9d7a8882 100644
--- a/libpf/pfelf/file_test.go
+++ b/libpf/pfelf/file_test.go
@@ -109,3 +109,16 @@ func TestGetGoBuildID(t *testing.T) {
 	expectedBuildID := strings.TrimRight(string(out), "\n")
 	assert.Equal(t, expectedBuildID, buildID)
 }
+
+func getNodeBinary(t *testing.T) {
+	cmd := exec.Command("sh", "-c", "curl https://nodejs.org/download/nightly/v26.0.0-nightly20251023f819aec288/node-v26.0.0-nightly20251023f819aec288-linux-x64.tar.gz | tar xzf - --wildcards --strip-components=2 -C testdata node'*'/bin/node")
+	_, err := cmd.Output()
+	require.NoError(t, err)
+}
+
+func TestReadSymbols(t *testing.T) {
+	getNodeBinary(t)
+	ef := getPFELF("testdata/node", t)
+	_, err := ef.ReadSymbols()
+	require.NoError(t, err)
+}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to place runtime.KeepAlive in the selected public API functions but leave other functions of *File without it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were the only functions I happened to notice where we are getting a pointer to mmap'ed space, then possibly dropping the file (causing an munmap), then trying to access the mmap'ed pointer.

Is there another such function I missed?

@fabled
Copy link
Contributor

fabled commented Nov 2, 2025

Also, note that #909 will remove the mmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants