-
Notifications
You must be signed in to change notification settings - Fork 350
Prevent finalization of pfelf files while using their data.
#877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Data retrieved from the mmap-backed store must not outlive `f`, or reading it can cause segfaults.
|
Is the problem that some mmapped data is leaked from these functions? If so, this might not be a full fix. But also, we are currently looking into reverting the mmaping completely. While it solved the problem of allowing to access large Golang execulables without running out of memory, it introduced other subtle issues issues and as side effect still increased the RSS usage so that process resource usage limits trigger anyway. So hopefully within few weeks time we (or I) get time to draft a PR that removes mmap, and fixes the Golang large binary issues by other means. Meanwhile, workarounds like this are ok if they fix the root cause of the segfault correctly. |
|
Calling a method on a struct does not prevent that struct from being garbage-collected/finalized. I.e. if you have some func (f *Foo) Bar() {
// ...
}
So if you have code like: ef, err := pfelf.Open("foo")
if err != nil { return err }
syms, err := ef.ReadSymbols()
if err != nil { return err }And then So, the mmap can be cleaned up between steps 1 and 2, and since the finalizer calls |
|
And in fact I am actually seeing this segfault if I write a test of that form, if the file is big enough (thus |
Interesting, I was aware of the behavior described in KeepAlive but that doesn't specifically mention pointer receivers and I've so far assumed that a pointer receiver would keep an instance alive for the duration of a method call. The only relevant reference to the behavior you described that I could find is here (lines 46-51). Good to know. |
|
@umanwizard For completeness, could you post a stacktrace / panic log of the crash (or a reproducible test if you have it?) |
I also thought that, but here is Keith Randall from the Go team confirming the behavior: https://groups.google.com/g/golang-nuts/c/LIWj6Gl--es?pli=1 |
|
Here is a branch with a test that segfaults reliably on my machine: https://github.com/parca-dev/opentelemetry-ebpf-profiler/tree/pfelf-crash-example I haven't submitted the test as part of this PR, because it's a bit weird to be downloading random binaries from the internet in a test, and this particular binary is too large to embed in the repo. |
|
For posterity in case that branch gets deleted someday, here is the patch: diff --git a/libpf/pfelf/file_test.go b/libpf/pfelf/file_test.go
index 85faebbe1..f9d7a8882 100644
--- a/libpf/pfelf/file_test.go
+++ b/libpf/pfelf/file_test.go
@@ -109,3 +109,16 @@ func TestGetGoBuildID(t *testing.T) {
expectedBuildID := strings.TrimRight(string(out), "\n")
assert.Equal(t, expectedBuildID, buildID)
}
+
+func getNodeBinary(t *testing.T) {
+ cmd := exec.Command("sh", "-c", "curl https://nodejs.org/download/nightly/v26.0.0-nightly20251023f819aec288/node-v26.0.0-nightly20251023f819aec288-linux-x64.tar.gz | tar xzf - --wildcards --strip-components=2 -C testdata node'*'/bin/node")
+ _, err := cmd.Output()
+ require.NoError(t, err)
+}
+
+func TestReadSymbols(t *testing.T) {
+ getNodeBinary(t)
+ ef := getPFELF("testdata/node", t)
+ _, err := ef.ReadSymbols()
+ require.NoError(t, err)
+} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to place runtime.KeepAlive in the selected public API functions but leave other functions of *File without it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those were the only functions I happened to notice where we are getting a pointer to mmap'ed space, then possibly dropping the file (causing an munmap), then trying to access the mmap'ed pointer.
Is there another such function I missed?
|
Also, note that #909 will remove the mmap. |
Data retrieved from the mmap-backed store must not outlive
f, or reading it can cause segfaults.