Skip to content

Blog

Anatomy of a 15-Minute Outage: The Trap of DNS Negative Caching

When you are rapidly building and deploying scalable infrastructure, a lot of things happen instantly. Edge networks update in milliseconds. Serverless functions deploy in the blink of an eye. But recently, while upgrading the infrastructure for Sampradnya Systems, we were reminded that the internet's underlying plumbing doesn't always move at the speed of thought.

A routine, instantaneous swap of a Cloudflare Worker took our site offline for a specific subset of users for exactly 15 minutes.

The culprit? A lesser-known DNS mechanism called Negative Caching. Here is what happened, the mechanics of why it occurred, and how to avoid it in your own deployments.

The Setup and The Slip

Our architecture heavily leverages edge computing. We needed to transition traffic for our root domain from an older Cloudflare Worker to a newly deployed application. The process seemed simple enough:

  1. Delete the old Worker application.
  2. Spin up the new application.
  3. Re-link the custom domain.

In total, this action took maybe ten seconds. For 99% of the world, the transition was seamless. But shortly after, we noticed something strange. While our preview URLs worked perfectly, and most global DNS checkers showed green checkmarks, some local clients were consistently getting hit with a dead page:

DNS_PROBE_FINISHED_NXDOMAIN

Flushing local DNS caches didn't work. The site wasn't down, but for these specific clients, it practically didn't exist.

What is Negative Caching?

When a browser wants to visit a website, it asks a DNS resolver (usually run by your Internet Service Provider) for the IP address. If the resolver has the answer cached, it returns it instantly. If not, it asks the authoritative nameservers, gets the IP, saves it in its cache for a set Time To Live (TTL), and passes it back to the browser.

But what happens when a domain doesn't exist?

If you query a broken link, the authoritative server returns an NXDOMAIN (Non-Existent Domain) response. Under RFC 2308, DNS resolvers are instructed to cache these failures too. This is called Negative Caching. It exists to protect root nameservers from being accidentally DDOS'd by millions of repetitive queries for dead links.

The 15-Minute Ghost Town

Here is exactly how the trap snapped shut on us:

  1. We deleted the old Cloudflare Worker, which temporarily removed the internal DNS record routing our domain.
  2. In that tiny, 10-second window before we linked the new Worker, a few clients happened to query our domain.
  3. Cloudflare correctly told their ISPs: "That domain has no records right now." (NXDOMAIN).
  4. The Trap: Those ISPs cached that failure.

Even though we fixed the routing 10 seconds later, those specific ISPs refused to check again. They essentially said, "We already checked, and it doesn't exist. We won't check again until the negative cache expires."

Unlike positive DNS records where you can sometimes set a short TTL (like 60 seconds), the TTL for negative caching is dictated by the SOA (Start of Authority) record of your domain. Often, this defaults to 15 minutes to an hour.

For 15 minutes, those clients were stranded in a ghost town entirely created by their own ISPs.

The Engineering Takeaway

As we scale out products like VarityMail, controlling downtime down to the second is critical. This incident was a great reminder that when dealing with DNS, you can never assume instantaneous propagation applies to failures.

How to avoid this: Never leave a gap, not even for a second. When swapping edge infrastructure, deploy the new service alongside the old one. Update your routing rules to point to the new service before you delete the old one. Overlapping your infrastructure ensures that authoritative nameservers never have a reason to issue an NXDOMAIN response.

Building robust software isn't just about what happens when things go right; it's about deeply understanding the protocols that dictate what happens when things briefly go wrong.