Blue Matador is a platform that monitors your AWS infrastructure and compute resources, understands the baselines, manages thresholds, and sends actionable alerts. It is considered as a check engine light for your public cloud infrastructure and keeps a pulse effortlessly on everything in the cloud environment.

Handling issues in the production setup is never easy, especially with the Kubernetes cluster environment. Blue Matador recently faced issues in their production Kubernetes cluster environment where nodes were running out of memory (OOM). They recovered it quickly, but we’ll see in this blog what happened with the cluster, what the impact was, and how they recovered.

The Setup

Blue Matador connects the AWS accounts with the help of read-only credentials that optionally install the agents on client’s servers, integrates the notification providers, such as Slack or Pagerduty, and takes about 15 to 20 minutes to get thousands of events monitored automatically.

They monitor their Kubernetes cluster environment themselves, and recently, they faced  out of memory (OOM) issues twice in their production Kubernetes cluster. The first occurrence was at 5:12 pm - Saturday, June 15, 2019, when an alert was generated by one of the Kubernetes nodes with a SystemOOM event.

The Event

At 5:16 pm on the same day, another warning was generated by Blue Matador about low EBS Burst Balance on the root volume where a SystemOOM event occurred, while the Burst Balance event was generated after the occurrence of SystemOOM event. The low Burst Balance event was shown on CloudWatch at 5:02 pm due to the delay of EBS metrics which were constantly behind 10-15 minutes.

At 5:18 pm, one of Blue Matador’s engineers did a quick kubectl get pods to see the impact and he was surprised to see about zero applications pods were killed. Then, he used kubectl top nodes to find any memory issues, it had already recovered, and the memory usage at that time was about 60 percent.

The second occurrence happened at 6:02 pm - Sunday, June 16, 2019, when the alert was generated by Blue Matador about a different node with a SystemOOM event. Everything was fine. Once again, none of the pods had been killed, and memory usage was again below 70 percent.

At 6:06 pm, a warning was generated once again by Blue Matador about EBS Burst Balance, and it was the second time that the same issue occurred. The behavior of CloudWatch looked the same as before with the declining of EBS Burst Balance about 2.5 hours before the occurrence of the issue.

The Root Cause

The engineer logged into Datadog to look at memory consumption and he noticed that the high memory was consumed before the occurrence of SystemOOM event, and he was quickly able to track it down to their fluentd-sumologic pods.

A sharp decline was seen in memory usage right around the time when SystemOOM events occurred. He assumed that these were the pods that were responsible for consuming all of the memory when SystemOOM occurred. Kubernetes had done the job beautifully, by killing the pods and reclaiming all the memory without affecting other pods.

The Fix

On Monday, the engineer dedicated some time and found that:

  • The fluentd-sumologic containers consumed a lot of memory
  • A lot of disk activity was seen before the occurrence of SystemOOM event

The engineer Googled, which led him to the GitHub issue in which Ruby configuration may be used to keep the lower usage of the memory. He added the environment variable to the pod and deployed it.

      env:
- name: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
       value: "0.9"

While looking at the fluentd-sumologic manifest, he noticed the non-specification of any resource requests or limits. Using kubectl top pods | grep fluentd-sumologic, he got a good idea of the resources to request.

The above issue was fixed, the usage of the memory was stable in their production setup, and he realized the importance of setting up resource requests and limits for all of the pods that they run.

Conclusion

Config issues in production environments are never easy, especially with the Kubernetes clusters. Kubernetes node OOM seems to happen at any time, and the reason always seems to be trivial.

The other way to prevent such an issue, is to use Kalc Kubernetes config validator. With Kalc, you can minimize your Kubernetes issues by running autonomous checks and config validations and predicting other risks before they affect the production environment. The AI-first Kubernetes Guard navigates through a model of possible actions of events that could lead to an outage, by modeling the outcomes of changes and applying growing knowledge of config scenarios.