Looking for practical guidance on interpreting Pathway benchmark results for real-world cluster sizing #2
Replies: 3 comments 5 replies
-
|
One more thing I wanted to dig into, since it ties directly to how I plan the cluster setup: When you evaluate these benchmarks internally, how do you usually separate the effects of Pathway’s own execution model from external bottlenecks such as Python/Rust GIL behavior, connector throughput, or the underlying I/O layer? I’m asking because in some streaming engines, once you hit a certain throughput threshold, the engine isn’t the limiting factor anymore — it’s serialization, network shuffle, or even the data source itself. If you have any rough guidance on how to tell “Pathway is the bottleneck” vs “the environment around it is the bottleneck,” that would help me a lot. I’m trying to avoid misreading benchmark numbers and blaming the wrong component when tuning. My end goal is to build a setup where I can clearly see:
Even a high-level way you reason about this internally would be very useful for me. Thanks again — I really appreciate the insight. Thank you, |
Beta Was this translation helpful? Give feedback.
-
|
To add a bit more detail on what I’m actually trying to solve: A lot of the real-time systems I’ve worked with in the past (Flink, Spark Structured Streaming, Kafka Streams, and a few custom engines) tend to hit performance ceilings in completely different ways. Sometimes the engine saturates CPU linearly, sometimes GC becomes the cliff, sometimes the connector layer becomes the dominant cost, and sometimes the state backend or shuffle layer is the real bottleneck. The tricky part is that each framework “fails” differently, so you have to learn how to read the signals. What I’m trying to understand with Pathway is what those early warning signs usually look like.
These differences matter a lot when building autoscaling logic. If the engine fails “softly” (gradual slowdown), CPU-based HPA is often enough. If it fails “hard” (sharp latency cliffs), then you need earlier custom metrics and more conservative thresholds. So the heart of my question is: when Pathway starts approaching its limits, what does that typically look like from your experience? I’m trying to build the kind of mental model that lets me anticipate the bottleneck before it becomes a problem, instead of reacting after it’s already visible in the logs. Any insight along those lines — even rough intuition — would be extremely helpful. Thank you, |
Beta Was this translation helpful? Give feedback.
-
|
Hey @nholuongut, I'll start responding to the questions gradually. Please don't hesitate to ask for clarification if you need more details.
Don't hesitate to follow up on these points or to describe your case in more detail, if you think it's necessary. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Pathway team,
I’ve been going through the benchmarking repo to understand how Pathway behaves under different workloads compared to Spark, Flink, and Kafka Streams. The examples (PageRank iterative runs and the streaming wordcount) were helpful for getting a sense of the engine’s characteristics, but I’m trying to bridge the gap between these controlled benchmarks and what happens in an actual production cluster.
To be more specific, I’m trying to answer a few practical questions before I lock down the sizing strategy for a real-time setup:
When interpreting the benchmark results, is there a set of “anchor points” you usually rely on — for example, CPU saturation vs memory pressure vs I/O boundaries — to decide whether a workload is compute-bound or memory-bound in Pathway?
In your experience, do the micro-batch-like patterns in benchmark workloads translate cleanly into real deployments, or do real pipelines usually introduce variance (spiky ingestion, uneven key distribution, backpressure) that shifts the scaling behavior?
For long-running jobs (like iterative graph workloads), is there a recommended way to map the benchmark numbers to actual pod sizes or node pool types in Kubernetes? I’m not looking for exact formulas, just how you normally reason about the translation.
Have you seen consistent differences in how Pathway scales horizontally vs vertically when compared to something like Flink? I’m trying to understand whether Pathway prefers “many moderate pods” or “fewer bigger pods” under real-world load.
Finally, do you have any internal heuristics for interpreting the latency/throughput trade-offs in this repo? For example, when is it worth spending more CPU to shave latency vs when throughput is the dominant factor?
The reason I’m digging into this is that I’m preparing a Pathway-based pipeline to run in a multi-pool Kubernetes cluster (compute-optimized vs memory-optimized nodes), and I’d like to start from patterns that are known to behave well instead of discovering limits the hard way.
Any insights on how you interpret these benchmarks when designing real deployments would be extremely helpful.
Thanks for the great work on the repo — it’s been very valuable for getting a feel for the engine.
Best Regards,
Nho Luong
Beta Was this translation helpful? Give feedback.
All reactions