Skip to content

Issue when using Jobset with @retry(times=0) #2632

@zylim-ml

Description

@zylim-ml

Hi, I am facing an issue when running metaflow_ray(Jobset) through Argo Workflow. The Template.retry_strategy() method only adds the retryStrategy payload when times > 0. This means when total_retries is 0,{{retries}} becomes None which can cause issues in the jobset template. Does anyone has any idea? Should the condition be >= ?

def retry_strategy(self, times, minutes_between_retries):
if times > 0:
self.payload["retryStrategy"] = {
"retryPolicy": "Always",
"limit": times,
"backoff": {"duration": "%sm" % minutes_between_retries},
}
return self

Jobset version : 0.9.1
Argo Workflow: 3.5.8
metaflow_ray: 0.1.4
Metaflow: 3.18.9

Here is my error

time="2025-10-07T05:54:16 UTC" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
The JobSet "js-82bf73{{retries}}" is invalid: metadata.name: Invalid value: "js-82bf73{{retries}}": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
time="2025-10-07T05:54:17 UTC" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

Here is my code

from metaflow import FlowSpec, project, step, load_config, kubernetes, retry, pip, metaflow_ray

@load_config
@project(name="retryflow")
class RetryFlow(FlowSpec):

    def _do_ray_job(self):
        import ray
        import time
        from counter import Counter

        ray.init()

        memory = ray.cluster_resources().get("memory")
        print("memory: %sGB" % (round(int(memory) / (1024 * 1024 * 1024), 2)))

        c = Counter.remote()

        for _ in range(10):
            time.sleep(1)
            c.incr.remote(1)

        print(ray.get(c.get.remote()))

    @step
    def start(self):
        self.next(self.execute, num_parallel=2)
        
    @retry(times=0)
    @kubernetes
    @metaflow_ray
    @pip(libraries={"ray": "2.49.1","metaflow-ray": "0.1.4"})
    @step
    def execute(self):
        self._do_ray_job()
        self.next(self.join)

    @step
    def join(self, inputs):
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    RetryFlow()

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions