[k8s] parameterize wait timeout in k8s pipe client #32689
      
        
          +5
        
        
          −0
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Summary & Motivation
After setting timeouts in my k8s jobs spec, those aren't respected on the dagster wait side. So the k8s pipe client at this point will always kill jobs at 24hrs without a way to override or modify this behavior. Currently, I inherit the behavior of the underlying wait api. Debatably, this should be either be hard set to 0 or overridable and default to 0. The amount of toggles we have for workload timeouts is pretty confusing and this layer in dagster just adds to that confusion. I think everyone will agree that hard killing after 24hrs without notifying the user explicitly ahead of time or giving a way to override is pretty confusing especially to new users trying out pipes for the first time.
The current behavior preserves the default behavior, but make it overridable.
How I Tested These Changes
TBD. I want to first gather feedback on a what we'd like the default behavior to be. In fact, I think we should parameterize all Dagster based wait timeouts (launch_timeout, terminate_timeout, and poll_rate). Maybe parameterize it as a
TimeoutConfigparameter.On our local deployment, I inherited and overrode the run method with an extra param. It seems to work well.
Changelog