checkpointer: add 5m grace period flag.

diegs · diegs · commit 2ab502e5d643 · 2018-02-15T16:31:13.000-08:00
When nodes reboot, such as in the TestReboot e2e test case, it can take a while for the cluster to get stable due to the dependency chain between the apiserver, flannel, and the controller manager and so on. If the controller manager was in the middle of doing something (e.g. rolling the apiserver) while a reboot occurs, we need to ensure that the controller manager gets healthy again. This requires keeping the checkpointed apiserver up. The downside is that this may run pods considerably longer than they ought to. However, this is a failure recovery scenario, and running an old pod is not a huge violation of k8s semantics (daemonsets strive for 1-at-a-time semantics but don't guarantee it). This should alleviate the flakes observed in #824.
diff --git a/pkg/asset/internal/templates.go b/pkg/asset/internal/templates.go
@@ -293,6 +293,7 @@ spec:
         - /checkpoint
         - --lock-file=/var/run/lock/pod-checkpointer.lock
         - --kubeconfig=/etc/checkpointer/kubeconfig
+        - --checkpoint-grace-period=5m
         env:
         - name: NODE_NAME
           valueFrom: