Looking for practical guidance on interpreting Pathway benchmark results for real-world cluster sizing #2

nholuongut · 2025-11-17T10:39:15Z

nholuongut
Nov 17, 2025

Hi Pathway team,

I’ve been going through the benchmarking repo to understand how Pathway behaves under different workloads compared to Spark, Flink, and Kafka Streams. The examples (PageRank iterative runs and the streaming wordcount) were helpful for getting a sense of the engine’s characteristics, but I’m trying to bridge the gap between these controlled benchmarks and what happens in an actual production cluster.

To be more specific, I’m trying to answer a few practical questions before I lock down the sizing strategy for a real-time setup:

When interpreting the benchmark results, is there a set of “anchor points” you usually rely on — for example, CPU saturation vs memory pressure vs I/O boundaries — to decide whether a workload is compute-bound or memory-bound in Pathway?
In your experience, do the micro-batch-like patterns in benchmark workloads translate cleanly into real deployments, or do real pipelines usually introduce variance (spiky ingestion, uneven key distribution, backpressure) that shifts the scaling behavior?
For long-running jobs (like iterative graph workloads), is there a recommended way to map the benchmark numbers to actual pod sizes or node pool types in Kubernetes? I’m not looking for exact formulas, just how you normally reason about the translation.
Have you seen consistent differences in how Pathway scales horizontally vs vertically when compared to something like Flink? I’m trying to understand whether Pathway prefers “many moderate pods” or “fewer bigger pods” under real-world load.
Finally, do you have any internal heuristics for interpreting the latency/throughput trade-offs in this repo? For example, when is it worth spending more CPU to shave latency vs when throughput is the dominant factor?

The reason I’m digging into this is that I’m preparing a Pathway-based pipeline to run in a multi-pool Kubernetes cluster (compute-optimized vs memory-optimized nodes), and I’d like to start from patterns that are known to behave well instead of discovering limits the hard way.

Any insights on how you interpret these benchmarks when designing real deployments would be extremely helpful.

Thanks for the great work on the repo — it’s been very valuable for getting a feel for the engine.

Best Regards,
Nho Luong

nholuongut · 2025-11-17T10:40:35Z

nholuongut
Nov 17, 2025
Author

One more thing I wanted to dig into, since it ties directly to how I plan the cluster setup:

When you evaluate these benchmarks internally, how do you usually separate the effects of Pathway’s own execution model from external bottlenecks such as Python/Rust GIL behavior, connector throughput, or the underlying I/O layer? I’m asking because in some streaming engines, once you hit a certain throughput threshold, the engine isn’t the limiting factor anymore — it’s serialization, network shuffle, or even the data source itself.

If you have any rough guidance on how to tell “Pathway is the bottleneck” vs “the environment around it is the bottleneck,” that would help me a lot. I’m trying to avoid misreading benchmark numbers and blaming the wrong component when tuning.

My end goal is to build a setup where I can clearly see:

how much headroom a Pathway worker has before latency jumps,
when horizontal scaling starts giving diminishing returns,
and at what point external systems (storage, vector DB, ingestion layer) become the dominant constraint.

Even a high-level way you reason about this internally would be very useful for me.

Thanks again — I really appreciate the insight.

Thank you,
Nho Luong

2 replies

zxqfd555 Nov 18, 2025
Collaborator

I would recommend tracking the two main sets of signals:

The high-level generic statistics of the container and the server. This may help to detect the network exhaustion, bad neighbors (in case something else runs without proper isolation or quotation), and hardware failures. I recommend that you use max_backlog_size in connectors in all production deployments, but even with that, the memory monitoring will also help you detect anomalies in the execution.
The I/O latency of the Pathway program. This latency is the difference between the current time and the time of the last ingested minibatch, and if it grows, it means that Pathway fails to catch up with the stream of data.

As you've mentioned, the Python/Rust border and GIL can cause issues, but we've applied certain optimizations to reduce this effect: if you use the native operations filter/groupby/..., and basically, everything but async transformers or user-defined functions, the constructed graph doesn't leave the Rust runtime, hence GIL isn't an issue. If you explicitly define the Python code to be executed, the engine has to execute it, but even in that case, these invocations will be batched (as we generally split input data into batches), and that allows us to capture the GIL only once to call the Python method for the bulk of inputs. So, unless there are millions of invocations per second, and the batches are frequent, I wouldn't expect it to be a problem.

Finally, going through the bullet points above:

Everything is pipeline-specific, as it can be both a stateless ETL or a full-join or a UDF-heavy code. I'd recommend using the CPU as a generic indicator, and when a spike that can't be processed occurs, you will also see it on the I/O latency stats.
The horizontal scaling, in terms of the number of workers, may quickly saturate if you use the thread-based workers along with the Python code parts (then the GIL will enforce a single-thread execution of a part of the code, and it will become a bottleneck), and in this case, I suggest using the process-based workers. The other kinds of horizontal scaling are pretty much sharding and, in a way, microservices, and for them, the common practices apply.
Pathway is connected to external systems with connectors, either input or output ones. You can detect this kind of problem by logs: it monitors how many entries have been written into or read from any external system and counts the time elapsed, and if this number doesn't correspond to the expectations, you can check the particular external system. Checking the external system then becomes system-specific, as for PostgreSQL, you may need to add an index, for rdkafka, you may need to adjust the buffering settings, etc. All in all, the best way to track it is to use the real-time logs to see the elapsed write time.

nholuongut Nov 19, 2025
Author

Hi @zxqfd555,

Thanks a ton for the detailed breakdown — this actually clears up a lot of things I’ve been trying to reason through while mapping the benchmarks to a real cluster layout.

Let me reflect back a bit to see if I’m understanding your suggestions correctly, and also share how I’m planning to apply them in practice:

Tracking container-level signals + Pathway’s internal latency
What you said about watching both the “generic container stats” and the “Pathway I/O latency” makes complete sense.
From what I’m seeing:

Container-level metrics give me early warnings when the environment is the actual bottleneck (network exhaustion, noisy neighbor, connector lag, bad isolation, etc.).
Pathway I/O latency gives me the “truth signal” about whether the engine is keeping up or starting to lag behind ingestion.

If I see I/O latency creeping up while container stats look healthy, that’s basically a red flag that I’m hitting Pathway’s internal limits rather than something outside.

This gives me a pretty straightforward first rule:
“If Pathway’s I/O latency grows but container metrics are stable → Pathway is the bottleneck.”
Super helpful.

The Python/Rust boundary + GIL behavior
Your explanation about how Pathway handles GIL internally actually lines up with something I suspected but couldn’t confirm before.

Since native operations (filter, groupby, etc.) stay inside Rust and don’t cross the GIL boundary, it explains why some workloads scale almost linearly while others hit weird flat spots.

The batching mechanism you described also matches what I’ve seen in small experiments:
When I force Python UDFs or async transformers, throughput suddenly behaves differently — not because Pathway slows down, but because every batch triggers a GIL hop.

So my takeaway is:

If the pipeline stays mostly in Rust-land, GIL basically disappears as a concern.
If I introduce Python parts, I should expect single-threaded sections unless I switch to process-based workers.

Good to finally get clarity on this — I would’ve easily misread the benchmark results without understanding this internal behavior.

Horizontal scaling patterns
The bit about thread-based workers saturating quickly in Python-heavy pipelines is exactly the kind of subtle detail I was trying to confirm.

Using process-based workers for those pipelines makes sense.
I’ve seen the exact same pattern in translation engines and data-cleaning pipelines before — horizontal scaling “works” but only up to the point where GIL becomes the global lock.

What you said basically becomes my second rule of thumb:
“If the pipeline touches Python a lot → don’t trust thread-based scaling, switch to process-based workers.”

This is super actionable.

External system boundaries
Your comment on external system integration (vector DB, storage, connectors) also resonates with what I’ve seen in other streaming engines.

Once throughput reaches a certain threshold, the bottleneck often jumps completely out of the engine and into:

serialization
write-path latency
network shuffle
the DB’s write queue
connector lag

The idea of using elapsed write-count logs to detect this is exactly what I need.
That will make it much easier for me to separate:

“Pathway is falling behind”
vs.
“External system is back-pressuring Pathway”

This is something I’ve been burned by before when tuning stateful systems, so I’m glad Pathway exposes this clearly.

Mapping benchmark results to real cluster sizing
After reading your clarifications, here’s how I’m planning to interpret the benchmarks more confidently:

Use CPU spikes + I/O latency to map which part of the pipeline is compute-bound vs memory-bound.
Use latency stability across minibatches to detect when batching is no longer sufficient for the workload shape.
Use horizontal scaling curves to spot diminishing returns and decide when to switch strategies (e.g., fewer larger pods vs more smaller ones).
Use external write-latency logs to catch the moment when the bottleneck moves outside Pathway.

This gives me a much clearer mental model for how to size the mixed node pools (compute vs memory) and how Pathway behaves when we push it at higher throughput.

Final thoughts
Honestly, this is exactly the level of insight I was hoping to get from the benchmarks but couldn’t derive alone.
Your explanation tied everything together in a way that maps directly to real-world tuning.

If you or the team ever publish a “general heuristics for interpreting Pathway performance” doc, I think a lot of people will find it extremely helpful — especially folks coming from Flink/Spark where scaling rules behave quite differently.

Thanks again — really appreciate the depth here.
This helps me design the cluster with way more confidence.

Best Regards,
Nho Luong

nholuongut · 2025-11-17T10:45:18Z

nholuongut
Nov 17, 2025
Author

To add a bit more detail on what I’m actually trying to solve:

A lot of the real-time systems I’ve worked with in the past (Flink, Spark Structured Streaming, Kafka Streams, and a few custom engines) tend to hit performance ceilings in completely different ways. Sometimes the engine saturates CPU linearly, sometimes GC becomes the cliff, sometimes the connector layer becomes the dominant cost, and sometimes the state backend or shuffle layer is the real bottleneck. The tricky part is that each framework “fails” differently, so you have to learn how to read the signals.

What I’m trying to understand with Pathway is what those early warning signs usually look like.
For example:

Does Pathway’s latency climb gradually under pressure, or does it hold steady and then spike once some internal queue or operator boundary gets overwhelmed?
Are throughput drops typically tied to worker saturation, or more often to downstream sinks (like vector DB or storage) not keeping up?
When running iterative jobs like PageRank, is the iteration cost mostly compute-bound in practice, or do you see memory bandwidth / cache effects showing up before CPU maxes out?

These differences matter a lot when building autoscaling logic. If the engine fails “softly” (gradual slowdown), CPU-based HPA is often enough. If it fails “hard” (sharp latency cliffs), then you need earlier custom metrics and more conservative thresholds.

So the heart of my question is: when Pathway starts approaching its limits, what does that typically look like from your experience?

I’m trying to build the kind of mental model that lets me anticipate the bottleneck before it becomes a problem, instead of reacting after it’s already visible in the logs.

Any insight along those lines — even rough intuition — would be extremely helpful.

Thank you,
Nho Luong

2 replies

zxqfd555 Nov 18, 2025
Collaborator

I believe some of the points have already been clarified in other answers. Please refer to them, as they already contain some information.

As the first line for detecting the pipeline failure, I would take the input latency, as if it increases, it means that the engine stops catching up with the data.

It may increase quickly if a large batch of data has been offloaded to the input source.
It may also grow slowly if the stream of events is stable, but is higher than the engine capacities on the given configuration.

In the first case, if the burst is one-time, the backlog will be processed, and eventually the computation will catch up. In the second case, the latency will just grow indefinitely, unless something changes in the input source.

As for the throughput drops case, I would recommend consulting the logs that contain the aggregated write time. It will help you to understand if the communication with an external system takes too long. Overall, we've used Delta, Postgres, and Kafka in the heaviest production setups. I can't say that we've had any serious problems with the latency of the downstream system. I can note that rdkafka requires special care if you set it up for performance, and the configuration will depend on whether you're message-heavy or data-heavy. But this part is unrelated to Pathway, and another good practice for detecting that would be to have the dashboards for the data storage as well: a common example for Postgres falling back is not having an index in the table, and this can be easily detected with a dashboard, as CPU growth.

Finally, as mentioned above, given the nature of the tasks, we've experienced more CPU saturation than memory saturation, but if there's any example of the memory saturation in the Pathway pipeline, it would be interesting to review it.

nholuongut Nov 19, 2025
Author

Hi @zxqfd555
Thanks again for the extra clarification — this really helps me tie things together with the earlier answers.

Here’s how I’m thinking about it now, and let me know if I’m interpreting your points correctly:

Input latency as the earliest “truth signal”
From what you described, it sounds like Pathway’s input latency is basically the first thing I should watch when the system starts getting stressed.

If it jumps quickly, it usually means a burst larger than the engine can absorb at that moment.
If it climbs slowly and consistently, then the incoming rate is simply above what the current config can sustain.

This actually gives me a very clear mental model:
“If input latency grows, Pathway isn’t keeping up — doesn’t matter whether it’s a slow climb or a spike.”

And that’s extremely useful for autoscaling logic, because I can use it to distinguish “temporary backlog” vs “the engine has hit its real ceiling.”

Throughput drops & external sinks
Your point about checking the aggregated write-time logs makes a lot of sense.
I’ve run into this in other engines (Flink, Kafka Streams):
once the write-path drifts past a certain threshold, the engine looks slow even when it’s not the actual bottleneck.

So the rule becomes:

If write-time grows → the sink is the bottleneck.
If write-time is stable but input latency grows → Pathway is the bottleneck.

That’s exactly the kind of separation I needed.

Also good to know that in your heavy production workloads (Delta, Postgres, Kafka), you haven’t seen major latency issues — except that Kafka needs more tuning when it’s message-heavy, and Postgres needs proper indexing.

That tracks with my experience too.

CPU saturation vs memory saturation
It's interesting that in your internal workloads, CPU saturation appears much more often than memory saturation.
This lines up with what I suspected after reading the benchmark repo — compute tends to be the limiting factor unless I explicitly push very large state into memory.

So my rough takeaway is:
Expect CPU to be the primary signal for stress, unless a pipeline explicitly leans on large in-memory structures.

If you ever run into real-world cases where memory saturation became dominant, I’d actually love to hear about it — it would help me plan for the edge scenarios.

Connecting this back to early warning patterns
Based on everything you’ve said across the threads, I think I finally understand what Pathway “failure patterns” usually look like:

It doesn’t hit GC cliffs like the JVM engines.
It doesn’t show GIL stalls unless I cross into Python-heavy UDFs.
It doesn’t fall off a sharp cliff by default — the slowdown is mostly visible through input latency and backpressure.

That’s helpful because it means I can rely on gradual signals, not sudden catastrophic jumps like some stateful systems.

This makes autoscaling far cleaner:
CPU + input-latency combo should be enough to catch issues before they spill over.

Where I’m landing now
Putting everything together, here’s the mental model I’m going to use in my cluster design:

Input latency = the earliest indicator the engine is falling behind.
Write-time logs = the best way to detect if a sink is throttling the pipeline.
CPU saturation is the most common limit, so CPU-based HPA is still a good baseline.
Memory saturation is rare, but I’ll keep an eye on it when workloads involve big iterative ops or in-memory state.
Pathway usually “fails softly" (gradual slowdown), unless the input rate far exceeds capacity.

This aligns really well with how I want to size real-time pipelines on multi-pool clusters.

If you happen to have a small real-world example of a pipeline that hit memory saturation, I’d honestly be very curious — it would help me calibrate the upper bounds better.

Thanks again for sharing all this — your insights have been extremely helpful while I’m building out the mental model.

Best,
Nho Luong

zxqfd555 · 2025-11-18T16:17:10Z

zxqfd555
Nov 18, 2025
Collaborator

Hey @nholuongut,

I'll start responding to the questions gradually. Please don't hesitate to ask for clarification if you need more details.

Although it depends on the pipeline, I would expect the saturation to happen due to CPU resources, and as the first line of monitoring, I'd recommend standard container monitoring tools. Going further, the built-in monitoring tools provide you with the I/O latency measurements. The latency in terms of those tools is the difference between the current time and the ingestion time for the minibatch that's currently being processed. It may help you to distinguish the I/O problems, as if the engine is not falling behind, but processes the data at a speed that's lower than you expect, it may correspond to the slow I/O. We haven't noticed the failures for Pathway programs, based on memory pressure: while, without proper configuration, the memory may spike (I'll precise that in the next point), and it indeed may cause the container to terminate due to OOM, we haven't observed problems that would have been caused by slow memory or inefficient memory interactions. Worth noticing that the implementation overall optimizes the work with memory by using a custom allocator, profiling out vector extensions, and, in some cases, even reusing the capacities of certain vectors without giving them back to the allocator.
The resource consumption in the case of real deployment can be uneven, as there are spikes and there are cold restarts. In both cases, the problem is that we have more data that can be processed in a time unit, a the backlog is formed. This backlog can be processed, and the computation will catch up without any extra care. However, to deal with the spikes, there's a parameter called max_backlog_size that can be specified in all input connectors. With this parameter, you limit the size of the queue of the unprocessed updates and even out the resources. Specifying max_backlog_size is strongly encouraged in the case of any real production deployment, as it allows you to bound the resource consumption in the case of a spike. Several intuitions can be given on this parameter: the lower you set it, the more even the graph of resource consumption in case of a spike. The heavier your pipeline is, the smaller the value that you need to set.
The basic recommendation is to allocate as many CPUs as you the number of workers you have you can add 1-2 more in case you expect the graph to be computation-heavy. One Pathway worker uses one CPU, but each connector spawns an additional thread to track the data source and not mix the external system communication with the internal computation. So these 1-2 extra CPUs may be useful to ensure that even when all workers are under a heavy computing load, the I/O to and from Pathway is also done correctly. In terms of memory, you need to take your computation into account, as a stateless ETL job will essentially take much less than a full-join pipeline, or a pipeline using memory-heavy UDFs.
With the publicly available licenses, Pathway currently runs without a single machine, which is naturally pushing us into the vertical scaling spirit. For setting up the pods, I would suggest that you use the considerations from the point above. In order to achieve horizontal scaling, you may consider sharding: then, the sizes of the pods you use will be proportional to the loads within each particular shard. Besides, when having to separate the pipeline in production, we've used the microservice architecture, as described in our blog. You can use a Delta Lake to save Pathway tables from one service, and read them in another service (you can also use Kafka, but oftentimes it's more expensive). This is another way for you to split a heavy pipeline into several lighter ones.
This tradeoff can be achieved with the autocommit_duration_ms parameter that is present across connectors. The input works as follows: it reads the entries, and then, every autocommit_duration_ms, it forms a batch, which is then sent into a computational engine. This way, by increasing the autocommit duration, you make some of the entries wait longer for the moment the batch is formed, and making the tail latency worse, as an exchange, you get a smaller number of batches which are easier for the engine to digest.

Don't hesitate to follow up on these points or to describe your case in more detail, if you think it's necessary.

1 reply

nholuongut Nov 19, 2025
Author

Hey @zxqfd555,

Thanks a lot for breaking this down — this is exactly the kind of internal reasoning I was hoping to get from the team. Let me reflect on a few points to see if I’m following everything correctly.

CPU saturation vs slow I/O — and how to read the signals
Your explanation about relying on the built-in I/O latency (current time vs ingestion time of the minibatch) actually gives me a clear way to separate CPU pressure from slow I/O.
If I understand correctly:

If Pathway keeps the ingestion gap stable → engine is fine, CPU probably not saturated.
If the gap increases steadily → compute is falling behind.
If the gap increases but throughput looks fine → this might be slow I/O or connector delays.

And the note you made about memory spikes only happening without proper configuration (custom allocator, reused vectors, etc.) makes sense. Good to know that in your internal tests, OOM due to the engine itself hasn’t been a real issue.

This is super helpful because now I finally have a concrete “early indicator” to watch before the pipeline blows past capacity.

Uneven real-world patterns + max_backlog_size
I appreciate the clarification here, because this is exactly the kind of scenario I was preparing for — bursty ingestion, cold restarts, uneven traffic patterns.

From what you described:

Both spikes and cold restarts cause temporary backlogs.
Pathway will eventually catch up as long as the backlog is bounded.
The max_backlog_size parameter is basically the safety valve, limiting how big that queue can get.

This is a very actionable guideline. I can already see how setting a low backlog size forces tighter resource usage, while a larger value smooths out spikes but risks a heavier footprint.

This aligns perfectly with what I’m trying to do: keep the autoscaling predictable without having pods balloon uncontrollably during load spikes.

Worker count + connector overhead
The rule of thumb you gave (“count the number of workers, then add 1–2 CPUs for the connectors”) is super useful, especially for planning mixed compute workloads.

Your point that connectors spawn their own threads, and that those should not compete with the internal compute threads, finally explains a few weird CPU utilization patterns I was seeing during early tests.

And the reminder that stateless ETL can take much less memory than UDF-heavy pipelines is something I’ll keep in mind when tuning the runtime node pool.

Vertical scaling by default + how to achieve horizontal scaling cleanly
This is probably the part that clicked the most for me.

By default, Pathway under the public license runs inside a single machine, so the natural scaling direction is vertical. But if I want to spread the load horizontally (which I do), the approach you suggested makes sense:

Split the heavy pipeline into multiple lighter shards.
Or use a microservice-style architecture like your Delta Lake + Kafka example.
Or run several pods, each handling its own slice of the stream.

This matches the intuition I had from other real-time engines — sometimes you scale the engine itself, but sometimes you scale around it by dividing the work upstream.

autocommit_duration_ms and batching behavior
The explanation here was surprisingly insightful.
So increasing autocommit_duration_ms means:

The connector forms bigger batches.
Larger batches → more entries wait longer → worse latency during high-load scenarios.
But the benefit is: fewer batches → easier for the engine to digest.

This trade-off is exactly the kind of thing I need to think about for real-time vs cost-sensitive pipelines.
Super useful.

Where I’m landing after your explanations

Putting everything together, the model I have now looks like this:

Watch the ingestion latency first — it’s the most honest signal of stress.
Bound the backlog using max_backlog_size so the pipeline can’t run away during spikes.
Budget 1–2 additional CPUs beyond worker count for connector overhead.
Expect CPU saturation long before memory saturation, unless I’m running UDF-heavy or deep iterative pipelines.
Horizontal scaling is possible, but often by splitting/sharding rather than pure engine scaling.
Batch size tuning (autocommit_duration_ms) is a real lever for balancing latency vs throughput.

This gives me enough confidence to start tuning the cluster with a much clearer mental model.

Thanks again for laying all this out — really appreciate you taking the time to explain the internal reasoning behind these behaviors.

Best,
Nho Luong

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Looking for practical guidance on interpreting Pathway benchmark results for real-world cluster sizing #2

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Looking for practical guidance on interpreting Pathway benchmark results for real-world cluster sizing #2

Uh oh!

nholuongut Nov 17, 2025

Replies: 3 comments · 5 replies

Uh oh!

nholuongut Nov 17, 2025 Author

Uh oh!

zxqfd555 Nov 18, 2025 Collaborator

Uh oh!

nholuongut Nov 19, 2025 Author

Uh oh!

nholuongut Nov 17, 2025 Author

Uh oh!

zxqfd555 Nov 18, 2025 Collaborator

Uh oh!

nholuongut Nov 19, 2025 Author

Uh oh!

zxqfd555 Nov 18, 2025 Collaborator

Uh oh!

nholuongut Nov 19, 2025 Author

nholuongut
Nov 17, 2025

Replies: 3 comments 5 replies

nholuongut
Nov 17, 2025
Author

zxqfd555 Nov 18, 2025
Collaborator

nholuongut Nov 19, 2025
Author

nholuongut
Nov 17, 2025
Author

zxqfd555 Nov 18, 2025
Collaborator

nholuongut Nov 19, 2025
Author

zxqfd555
Nov 18, 2025
Collaborator

nholuongut Nov 19, 2025
Author