Intro

Datadog is a monitoring service for cloud-based workflows offering Kubernetes insights through metrics, traces, logs, dashboards, etc. They are positioned as a Cloud Native service provider.
Through their SaaS-based analytics platform, Devs and Ops teams can leverage metrics and events from servers to avoid downtime, resolve performance problems, and ensure that development and deployment cycles finish without a hitch.

The Setup

Datadog  has been working under the Kubernetes project for about a year and a half, and they’ve had production workloads for about a year now. In production, they run several thousand nodes managed by Kubernetes. To give you an idea, their biggest cluster is about 2,000 nodes, and a typical cluster would be about 1,000 - 1,500 nodes.

The Challenge

As you probably can imagine, managing a cluster this size is not a walk in the park, and they’ve had many outages (mostly caused by human intervention).

DNS Issues

There are a lot of ways to break DNS in Kubernetes and they’ve faced quite a few. You see, the way Kubernetes manages DNS is that the Kubelet is going to inject a resolv.conf file and you will see 3+ search domains.

search         <namespace>.svc.cluster.local
              svc.cluster.local
              cluster.local
              ec2.internal    (Search domain inherited from your host)

options  ndots : 5

I want you to imagine that you’re in a Pod and you want to lookup www.google.com. Since this name has less than 5 dots (this is what the ndots option is for), it’s going to try to add all the search domains to it like so:

  1. www.google.com.<namespace>.svc.cluster.local
  2. www.google.com.svc.cluster.local
  3. www.google.com.cluster.local
  4. www.google.com.google.internalwww.google.com

The DNS resolver is going to go through all this list, fail in the first 4 tries and finally resolve the correct IP address on the 5th attempt. This is inefficient because of course, you are making 5 queries for something very simple. It’s very bad on the client-side because it’s much slower and it’s also bad on the CoreDNS server because you’re making a lot of useless queries.

There is a pretty interesting feature in CoreDNS called autopath which allows for server-side search path completion. The way it works is that when you make the query for www.google.com with autopath enabled, CoreDNS is going to strip “<namespace>.svc.cluster.local” and it will try to be clever to find the proper answer.
In this case, it will answer with the CNAME for www.google.com and then add the IP address as an A record like so:

CNAME: www.google.com,
A: x.x.x.x

It’s much better because it’s a single DNS query instead of 5 which is much faster.

The Event

DNS Failure For Some Applications

One day some teams started complaining that DNS was broken and that DNS queries were timing out. They drilled into the CoreDNS metrics and they saw that certainly the amount of  SERVFAIL errors to upstream was very high. After investigating the issue, they found out that they were rate limited by the upstream.

The Root Cause

The number of queries had increased and since they had autopath enabled, autopath was trying to be clever by always making a name resolution to match an upstream query and doing so without caching the requests.
So at one point they just reached this limit where nothing was working anymore because they couldn’t reach the CoreDNS server anymore and so any resolution outside the cluster was failing.

Cause Summary

  • Autopath wasenabled
  • A sudden surge in upstream queries
  • Autopath prevents caching
  • Rate limited by the upstream

The Fix

The OPs team fixed this issue by switching off autopath. By disabling autopath, DNS resolution will work without errors but the number of requests will increase 3-5 times, which in itself might result in higher latency and load on the CoreDNS server. An easier solution would be to throw money to the problem and just use a more powerful CoreDNS server to eliminate all resolution failures.

A third resolution would be to use the Kalc Kubernetes simulator which can check whether a failure condition is reachable within your current cluster setup. Entirely model-based, it does not disrupt cluster operation and produces a report with full root cause event chain.

You will get predicted failure results that will instantly inform you if the proposed change will lead to a DNS resolution error thus helping you mitigate this and other failure events that are just lurking in your cluster waiting to catch you flat-footed.

Conclusion

We approached this problem by first trying to understand how DNS resolution happens inside a Kubernetes cluster. We reviewed a perfect storm context where for a big enough cluster, in an event where it is receiving a lot of DNS queries,

CoreDNS will fail for some applications due to rate limiting.We were able to see that with Kalc’s cluster config validator, devs can test the effect of each change on the AI-replicated Kubernetes environment to help mitigate against this edge case.