Skip to content

Commit affc3c7

Browse files
fix(profiling): correctly unwind on-CPU Tasks (#15562)
## What does this PR do? More details and investigation results available in https://datadoghq.atlassian.net/browse/PROF-13137 Related PRs * Dependency: #15552 * Dependent: #15579 * Echion PR: P403n1x87/echion#198 This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one... ``` background_math_function Task-background_wait background_wait_function sleep ``` ... where in practice, `background_math_function` was executed as part of a Task that was completely unrelated to `Task-background_wait` and should never have appeared in the same stack. This by itself doesn't fix everything (see **The fix(es, actually)**), but, as I said, it's a first step. See [this document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g) for more explanations around how this works / is supposed to work, the fix is explained at the bottom here. ## The fix(es, actually) ### Flawed existing logic... The fix logic is the following: * To correctly interleave Task Stacks and Task names, we need to determine the "synchronous Python top-of-stack" that goes on top of the on-CPU Task (if there is an on-CPU Task). * This includes the Python stack that eventually led to doing something like `asyncio.run`, as well as the internal `asyncio` Runner machinery. * We will have to push this on top of every Task Stack we capture (note that the on-CPU Stack will already have it by definition). * To do that, we need to determine which Task is "the on CPU Task" (if any) and process it first, so that we know how "high" the "upper Python stack" is and we can re-use it for subsequent Task unwindings. * To ensure the first Task we unwind is the on-CPU one, we first sort the Leaf Tasks array by on-CPU-ness * Note I'm saying _the_ on CPU Task because, as always in Python, there can only be one thing running at a time. * (Well, that's the case at least in theory – in practice on-CPU-ness may change as we sample and so we could end up with more than one on-CPU Task, but that is fine because it won't change how deep the "upper stack" is) * Once we've done this, we know how high the "top of Stack is" because it is the height of the "pure Python" Stack we get by unwinding the first (top-most) Frame of the on-CPU Task. We need to remember that "height" to push that many items from the Python Thread Stack on top of each Task's Stack. * One final thing to keep in mind is that we "stack" Stacks (hehe) such that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under that for `Task-1`. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that _once per leaf Task_ after we've recursively unwound awaiting Tasks for the leaf Task. ### ... and a race condition? On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually _are not_ in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks. Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off. In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas/echion#43) ### Caveats I realised after the fact that an assumption we implicitly make (that _the on-CPU Task – if it exists – is a Leaf Task_) is actually not generally true. In the case of certain `asyncio` utils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles). We will also need to figure something out for that – it probably isn't _that_ hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks). ## Testing The PR adds new tests around on-CPU Tasks and what their Stacks look like The PR also updates a lot of existing tests to output their `DataSummary` as JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping. After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means _mission failed successfully_?
1 parent e34ab9d commit affc3c7

File tree

3 files changed

+81
-29
lines changed

3 files changed

+81
-29
lines changed

ddtrace/internal/datadog/profiling/stack_v2/echion/echion/tasks.h

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -182,27 +182,26 @@ class TaskInfo
182182
PyObject* origin = NULL;
183183
PyObject* loop = NULL;
184184

185-
GenInfo::Ptr coro = nullptr;
186-
187185
StringTable::Key name;
186+
bool is_on_cpu = false;
187+
GenInfo::Ptr coro = nullptr;
188188

189189
// Information to reconstruct the async stack as best as we can
190190
TaskInfo::Ptr waiter = nullptr;
191-
bool is_on_cpu = false;
192191

193192
[[nodiscard]] static Result<TaskInfo::Ptr> create(TaskObj*);
194193
TaskInfo(PyObject* origin, PyObject* loop, GenInfo::Ptr coro, StringTable::Key name, TaskInfo::Ptr waiter)
195194
: origin(origin)
196195
, loop(loop)
197-
, coro(std::move(coro))
198196
, name(name)
197+
, is_on_cpu(coro && coro->is_running)
198+
, coro(std::move(coro))
199199
, waiter(std::move(waiter))
200-
, is_on_cpu(this->coro && this->coro->is_running)
201200
{
202201
}
203202

204203
[[nodiscard]] static Result<TaskInfo::Ptr> current(PyObject*);
205-
inline size_t unwind(FrameStack&);
204+
inline size_t unwind(FrameStack&, size_t& upper_python_stack_size);
206205
};
207206

208207
inline std::unordered_map<PyObject*, PyObject*> task_link_map;
@@ -344,7 +343,7 @@ inline std::vector<std::unique_ptr<StackInfo>> current_tasks;
344343
// ----------------------------------------------------------------------------
345344

346345
inline size_t
347-
TaskInfo::unwind(FrameStack& stack)
346+
TaskInfo::unwind(FrameStack& stack, size_t& upper_python_stack_size)
348347
{
349348
// TODO: Check for running task.
350349
std::stack<PyObject*> coro_frames;
@@ -355,14 +354,37 @@ TaskInfo::unwind(FrameStack& stack)
355354
coro_frames.push(py_coro->frame);
356355
}
357356

358-
int count = 0;
357+
// Total number of frames added to the Stack
358+
size_t count = 0;
359359

360360
// Unwind the coro frames
361361
while (!coro_frames.empty()) {
362362
PyObject* frame = coro_frames.top();
363363
coro_frames.pop();
364364

365-
count += unwind_frame(frame, stack);
365+
auto new_frames = unwind_frame(frame, stack);
366+
367+
// If we failed to unwind the Frame, stop unwinding the coroutine chain; otherwise we could
368+
// end up with Stacks with missing Frames between two coroutines Frames.
369+
if (new_frames == 0) {
370+
break;
371+
}
372+
373+
// If this is the first Frame being unwound (we have not added any Frames to the Stack yet),
374+
// use the number of Frames added to the Stack to determine the size of the upper Python stack.
375+
if (count == 0) {
376+
// The first Frame is the coroutine Frame, so the Python stack size is the number of Frames - 1
377+
upper_python_stack_size = new_frames - 1;
378+
379+
// Remove the Python Frames from the Stack (they will be added back later)
380+
// We cannot push those Frames now because otherwise they would be added once per Task,
381+
// we only want to add them once per Leaf Task, and on top of all non-leaf Tasks.
382+
for (size_t i = 0; i < upper_python_stack_size; i++) {
383+
stack.pop_back();
384+
}
385+
}
386+
387+
count += new_frames;
366388
}
367389

368390
return count;

ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h

Lines changed: 47 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -264,31 +264,45 @@ ThreadInfo::unwind_tasks()
264264
}
265265
}
266266

267+
// Make sure the on CPU task is first
268+
for (size_t i = 0; i < leaf_tasks.size(); i++) {
269+
if (leaf_tasks[i].get().is_on_cpu) {
270+
if (i > 0) {
271+
std::swap(leaf_tasks[i], leaf_tasks[0]);
272+
}
273+
break;
274+
}
275+
}
276+
277+
// The size of the "pure Python" stack (before asyncio Frames), computed later by TaskInfo::unwind
278+
size_t upper_python_stack_size = 0;
279+
// Unused variable, will be used later by TaskInfo::unwind
280+
size_t unused;
281+
282+
bool on_cpu_task_seen = false;
267283
for (auto& leaf_task : leaf_tasks) {
284+
on_cpu_task_seen = on_cpu_task_seen || leaf_task.get().is_on_cpu;
285+
268286
auto stack_info = std::make_unique<StackInfo>(leaf_task.get().name, leaf_task.get().is_on_cpu);
269287
auto& stack = stack_info->stack;
288+
270289
for (auto current_task = leaf_task;;) {
271290
auto& task = current_task.get();
272291

273-
size_t stack_size = task.unwind(stack);
274-
292+
// The task_stack_size includes both the coroutines frames and the "upper" Python synchronous frames
293+
size_t task_stack_size = task.unwind(stack, task.is_on_cpu ? upper_python_stack_size : unused);
275294
if (task.is_on_cpu) {
276-
// Undo the stack unwinding
277-
// TODO[perf]: not super-efficient :(
278-
for (size_t i = 0; i < stack_size; i++)
279-
stack.pop_back();
280-
281-
// Instead we get part of the thread stack
282-
FrameStack temp_stack;
283-
size_t nframes = (python_stack.size() > stack_size) ? python_stack.size() - stack_size : 0;
284-
for (size_t i = 0; i < nframes; i++) {
285-
auto python_frame = python_stack.front();
286-
temp_stack.push_front(python_frame);
287-
python_stack.pop_front();
288-
}
289-
while (!temp_stack.empty()) {
290-
stack.push_front(temp_stack.front());
291-
temp_stack.pop_front();
295+
// Get the "bottom" part of the Python synchronous Stack, that is to say the
296+
// synchronous functions and coroutines called by the Task's outermost coroutine
297+
// The number of Frames to push is the total number of Frames in the Python stack, from which we
298+
// subtract the number of Frames in the "upper Python stack" (asyncio machinery + sync entrypoint)
299+
// This gives us [outermost coroutine, ... , innermost coroutine, outermost sync function, ... ,
300+
// innermost sync function]
301+
size_t frames_to_push =
302+
(python_stack.size() > task_stack_size) ? python_stack.size() - task_stack_size : 0;
303+
for (size_t i = 0; i < frames_to_push; i++) {
304+
const auto& python_frame = python_stack[frames_to_push - i - 1];
305+
stack.push_front(python_frame);
292306
}
293307
}
294308

@@ -317,8 +331,21 @@ ThreadInfo::unwind_tasks()
317331
}
318332

319333
// Finish off with the remaining thread stack
320-
for (auto p = python_stack.begin(); p != python_stack.end(); p++)
321-
stack.push_back(*p);
334+
// If we have seen an on-CPU Task, then upper_python_stack_size will be set and will include the sync entry
335+
// point and the asyncio machinery Frames. Otherwise, we are in `select` (idle) and we should push all the
336+
// Frames.
337+
338+
// There could be a race condition where relevant partial Python Thread Stack ends up being different from the
339+
// one we saw in TaskInfo::unwind. This is extremely unlikely, I believe, but failing to account for it would
340+
// cause an underflow, so let's be conservative.
341+
size_t start_index = 0;
342+
if (on_cpu_task_seen && python_stack.size() >= upper_python_stack_size) {
343+
start_index = python_stack.size() - upper_python_stack_size;
344+
}
345+
for (size_t i = start_index; i < python_stack.size(); i++) {
346+
const auto& python_frame = python_stack[i];
347+
stack.push_back(python_frame);
348+
}
322349

323350
current_tasks.push_back(std::move(stack_info));
324351
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
fixes:
2+
- |
3+
profiling: This fix makes stack sampling more accurate for on-CPU asyncio Tasks.

0 commit comments

Comments
 (0)