Konubinix' opinionated web of thoughts

K8s Auto Scaling and CI

Fleeting

k8s, auto scaling and continuous integration

We started using karpenter and keda to try to scale dynamically our testing cluster. We want to use only one replica per deployment and let karpenter deal with providing extra computer resources when we want to rollout a new release for testing.

This led to some strange behavior.

When installing the version to be tested,

  1. we run argocd get –hard-refresh and argocd wait,
  2. then argocd performs a rollout of the deployment to update
  3. during this rollout, a new pod is scheduled, but has no room in the current set of nodes,
  4. karpenter finds out that it needs to provision a new node and then does so,
  5. the rollout ends, therefore the old pod in the old node is terminated,
  6. argocd says that everything is ok,
  7. the tests start
  8. during the tests, karpenter starts the consolidation phase, it finds out that there is enough room without the new node, so it starts draining it,
  9. the rolled out pod is then killed
  10. some tests fail because of the downtime,
  11. a new pod is created on the old node
  12. the last tests pass

It would be much nicer if the deployment whose pod is going to be stopped would be rolled out instead of simply stopping the pod.

We are in a kind of grey area of responsibility.

  1. karpenter says that it is not their purpose to have knowledge about the resources, like whether the pod in the drained node are part of a deployment that can be rolled out. It simply asks the node to be drained.
  2. k8s eviction process does the same. It simply kindly asks the pod to stop and then kills it if need be,
  3. they both suggest to use PodDisruptionBudget to indicate whether the pod should be left alone or not,

Actually, PodDisruptionBudget is kind of a false solution. Either we put the minAvailable=1 and therefore the node would never be drained and neither would be the consolidation phase, or we put minAvailable=0 and accept the downtime.

We could increase the replicate to be at least 2, but that would be doubling the resource even in the idle case, a useless waste of resources for testing.

Draining a node without downtime appears to be something quite common, and there exists some workaround to do this1. Also, some discussion have taken place at karpenter side2, unfortunately without much follow up so far.

The work around ideas we found so far are, from the one I like the best to the one I dislike the most:

  1. add a PodDisruptionBudget with minAvailable=1 before running the tests and remove it afterwards. It still wastes a bit of resources during the tests, but only during that time,
    1. same, but with a custom operator that detects the following situation and rollout those
      • a tainted node
      • containing the only pod
      • of a deployement with replica 1
      • and with a matching PodDisruptionBudget
      • that has minAvailable>0 or maxUnavailable<1
  2. disable karpenter and increase the node size so that everything gets into it: in the worst case, all the deployment would need a rollout and therefore the number of pods would be at least twice as big.
    • either we provisioned a node twice as big as the need: waste of resources
    • or we provision a smaller one but still bigger than what we need to leave room for rollouts. In that cas we would have to wait for the rollout to happen one after another.
  3. make consolidation wait for some time for the tests to run
  4. increase the replica to 2 only during the tests, wasting the resources only during that time: not very practical, as this is dealt with gitops and that would need a new commit before the tests and another after the tests.
  5. start from scratch before running the tests: it will take long to delete and then start again all the resources,
  6. use a preStop hook that starts a rollout: not practical as we would need to give some extra control over the cluster to pods
  7. delete all the resources after the tests: not practical, as keeping the testing cluster alive in between tests is very useful for:
    1. finding out what happens,
    2. one shot tests,
    3. look closer to what is done in production, with blue-green using gitops.

I like a lot the option 1.1, because it does exactly what I want, but it needs some work to be done. Also, the second option is nice, because the whole issue comes from the use of a tool that we may actually simply not need that bad, so keep it simple and remove it.