Moonlight is a professional community of software developers and designers where you can find and work with quality candidates based on their experience and location.

Moonlight uses Kubernetes for hosting their applications, where applications are run on a group of servers and copies of applications run on the nodes (pods) using Kubernetes clusters, where the scheduler is responsible to run pods on different nodes and dynamically decides that which pods will run on which nodes.


The Setup

From January 18 ~ 22, 2019, Moonlight’s website went down due to intermittent outages of API. Moonlight is offering its services using Google’s cloud platform with managed Kubernetes cluster. These outage issues were initially started as occasional connectivity errors with a Redis database. The Redis database is being used by Moonlight’s API for providing authentication to users for establishing valid sessions. The issue was resolved later on, but we’ll discuss here why the website went down and what the reasons are behind it.

The Challenge

Moonlight developers noticed that their Kubernetes monitoring system was reporting some unresponsive behavior of nodes and pods. To solve this, they contacted their service provider. In this case, the service provider was Google, and they reported the networking service disruptions. The outage issue was quickly escalated to the support engineering team by Google Cloud Support. Moonlight’s developers assumed that the root cause was the outage of the network.

The Event

On Tuesday morning, 22nd January 2019, the Moonlight website went fully down with zero external traffic reaching out to the managed Kubernetes cluster and found another person on Twitter who claimed with a similar issue, which led Moonlight to believe that the Google Kubernetes Engine (GKE) was experiencing a network outage issue. Google was contacted in this regard and the issue was quickly escalated to their support engineering team and they identified the nodes pattern in Moonlight’s managed Kubernetes cluster. The CPU usage was sustained to 100%, which caused the virtual machine (VM) to kernel panic and crash.

The Root Cause

The following cycle was the root cause which was responsible for the outage of the managed cluster:

  • Multiple pods have been assigned high CPU usage on the same node by the Kubernetes scheduler. 
  • Each pod in the cluster consumed 100% CPU resources on the shared node. 
  • A kernel panic was experienced by the node in the Kubernetes cluster which resulted in a period of downtime and the node became unresponsive to the Kubernetes scheduler. 
  • The Kubernetes scheduler then reassigned a new node to the crashed pods using the same repeated process which compounded the effects. 

In the beginning, the crashing error was experienced by the Redis Database pod only. But later on, all pods which were offering web services went offline, resulting in a complete outage. And, longer periods of downtime were resulted due to exponential back-off rules.

The Fix

To solve the identified issue, Moonlight developers have added anti-affinity rules to all major deployments, which automatically spread the pods across nodes. By doing this, it helped to increase fault-tolerance and improve performance.

There are three different server nodes run by Moonlight to manage their fault-tolerance. Kubernetes deployment, along with three copies of web-based applications, where one copy is kept on each node which allows managing two node failures. However, sometimes, Kubernetes cluster scheduled all three pods on the same node, which caused a single failure point. And, other CPU intensive apps, such as server-side rendering, were also scheduled on the same node rather than scheduling on separate nodes.

To fix all such types of issues, a proper working Kubernetes cluster should be deployed to manage constant usage of high CPUs, and pods should be moved efficiently to different nodes of the cluster to properly utilize available resources. Moonlight’s management is continuously working with the Google Cloud support team to identify the root causes of the issue and fix the server kernel panics.

This problem can be prevented by validating Kubernetes config with Kalc before pushing changes to cluster. Kalc offers AI-first Kubernetes Guard which verifies and measures the changes in your Kubernetes environment before risking it in production environment.

Conclusion

Anti-affinity rules play a very important role in making web-based Kubernetes services more fault-tolerant. If you are using user-oriented services in your Kubernetes deployments, you should add anti-affinity rules in your Kubernetes environment. 

The other way to prevent such issue, is to use Kalc Kubernetes config validator. Kalc helps in minimizing the Kubernetes issues by running autonomous checks, config validations and predicting other risks before they affect the production environment. The AI-first Kubernetes Guard navigates through a model of possible actions of events that could lead to an outage by modeling the outcomes of changes and by applying growing knowledge of config scenarios. 

They are continuously working with Google Cloud support team to identify the root causes and fix the server kernel panics of the node.