Intro

Imagine that, you have a large computation to perform, and once the computation is done, you want Kubernetes to stop Pods automatically. Simply put, we are talking about running Pods temporarily until a Job is completed.
This blog post is about the Job resource in Kubernetes which is part of the batch processing API and can be thought of as being another orchestration layer around Pod resources. We will examine Kubernetes Jobs and the problems that occur when you run them with sidecars.

The Setup

Let's imagine you have a Pod with two containers in it — one which runs a Job container and then terminates, and another which isn't designed to ever explicitly exit but provides some sort of supporting functionality like logs or metrics collection.
What options does Kubernetes provide for doing something like this?

The Event

The problem arises when in your Job Pods, you have the main Job container and you also have Sidecar containers. It could be either one or multiple sidecar containers that your main Job container uses, communicates, or relies on in some fashion.
This setup is becoming more and more popular especially with things like Envoy and Istio becoming more popular. With this setup, there are actually many issues caused by the lack of first class support for Sidecar containers.

The Challenge

Jobs have been GA in Kubernetes for over 4 years and there is a large GitHub Issue, titled: "Better support for sidecar containers in batch jobs", that has been open for over 3 years now.
Sidecar containers do not play well with Kubernetes Jobs. The Job will continue running for as long as the sidecar proxy is running. The problem manifests when you run this kind of setup.

Sidecar awareness would make management of this edge case much more pleasant. The sidecar would need to watch the main Job process and exit gracefully once that the Job has run to completion.

The Root Cause

In Kubernetes, Pods will not be marked as complete until all containers of the Pod are also complete - and so you end up with Job Pods and Jobs that never complete. Therefore, your main Job container of your Job Pod will complete but your Sidecar container will keep running.

The Fix

For this issue, there are a variety of workarounds and the most popular one involves using Shared Volume Communications. 


Example solution where when the main Job container terminates the kubelet would kill envoy:

containers:
  - name: main
    image: <your job image>
    command: ["/bin/bash", "-c"]
    args:
      - |
        trap "touch /tmp/pod/main-job-terminated" EXIT
        /your-batch-job/bin/main --config=/config/your-job-config.yaml
    volumeMounts:
      - mountPath: /tmp/pod
        name: tmp-pod
  - name: envoy
    image: <your envoy image>
    command: ["/bin/bash", "-c"]
    args:
      - |
        /usr/local/bin/envoy --config-path=/your-batch-job/etc/envoy.json &
        CHILD_PID=$!
        (while true; do if [[ -f "/tmp/pod/main-job-terminated" ]]; then kill $CHILD_PID; fi; sleep 1; done) &
        wait $CHILD_PID
        if [[ -f "/tmp/pod/main-job-terminated" ]]; then exit 0; fi
    volumeMounts:
      - mountPath: /tmp/pod
        name: tmp-pod
        readOnly: true
volumes:
  - name: tmp-pod
    emptyDir: {}

Basically what this entails is that your main Job container and your Sidecar container mount a shared volume. Your main Job container on completion will basically write a file to that shared volume. Your Sidecar container in the meantime is constantly looking for the existence of that file and when it sees it, it knows that the main Job container has completed and that it should also complete.

Conclusion

We have seen that the restart policy is applied to a Pod not to a Job. By design, Kubernetes’ main role is to run a Pod to successful completion. The control plane cannot judge based on the exit code whether a Job failure was expected and it should re-run or it was not and should stop re-trying. It's the user’s responsibility to make that distinction based on the output from Kubernetes.
We have also learnt that the OnFailure restart policy means that the container will only be restarted if something goes wrong and the Never policy means that the container won't be restarted in spite of why it exited.
Finally, we’ve seen that you can protect you Kubernetes cluster from human error by leveraging Kalc’s AI-first solution. This solution enables you to replicate your current cluster environment in AI. Our new tool, kubectl-val, will allow your developers to run autonomous checks and config validations, thus helping you to minimize outages in your Kubernetes cluster.