-
Notifications
You must be signed in to change notification settings - Fork 929
Open
Description
Hi, I am facing an issue when running metaflow_ray(Jobset) through Argo Workflow. The Template.retry_strategy() method only adds the retryStrategy payload when times > 0. This means when total_retries is 0,{{retries}} becomes None which can cause issues in the jobset template. Does anyone has any idea? Should the condition be >= ?
metaflow/metaflow/plugins/argo/argo_workflows.py
Lines 4218 to 4225 in 9b98f32
| def retry_strategy(self, times, minutes_between_retries): | |
| if times > 0: | |
| self.payload["retryStrategy"] = { | |
| "retryPolicy": "Always", | |
| "limit": times, | |
| "backoff": {"duration": "%sm" % minutes_between_retries}, | |
| } | |
| return self |
Jobset version : 0.9.1
Argo Workflow: 3.5.8
metaflow_ray: 0.1.4
Metaflow: 3.18.9
Here is my error
time="2025-10-07T05:54:16 UTC" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
The JobSet "js-82bf73{{retries}}" is invalid: metadata.name: Invalid value: "js-82bf73{{retries}}": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
time="2025-10-07T05:54:17 UTC" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1
Here is my code
from metaflow import FlowSpec, project, step, load_config, kubernetes, retry, pip, metaflow_ray
@load_config
@project(name="retryflow")
class RetryFlow(FlowSpec):
def _do_ray_job(self):
import ray
import time
from counter import Counter
ray.init()
memory = ray.cluster_resources().get("memory")
print("memory: %sGB" % (round(int(memory) / (1024 * 1024 * 1024), 2)))
c = Counter.remote()
for _ in range(10):
time.sleep(1)
c.incr.remote(1)
print(ray.get(c.get.remote()))
@step
def start(self):
self.next(self.execute, num_parallel=2)
@retry(times=0)
@kubernetes
@metaflow_ray
@pip(libraries={"ray": "2.49.1","metaflow-ray": "0.1.4"})
@step
def execute(self):
self._do_ray_job()
self.next(self.join)
@step
def join(self, inputs):
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
RetryFlow()
Metadata
Metadata
Assignees
Labels
No labels