Kubernetes Controllers: The Cache-Staleness Bug at Scale

Kubernetes production adoption reached 82% in 2025, up from 66% two years earlier (CNCF Annual Survey 2025, 2026). So more platform teams now ship custom controllers. And one class of bug keeps recurring. The informer cache lies to your reconcile loop. A status you just wrote gets silently overwritten by a stale read. This article walks through the step-based reconciler pattern we run in production. Then it covers three real incidents that pattern exposed. Finally, it shows how the Kubernetes v1.36 cache-staleness fix addresses the root cause.

Key takeaways

Structure reconcilers as ordered, independently testable steps, each reporting its own status condition, over one monolithic Reconcile function.

Informer caches lag the API server by tens of milliseconds at scale; a stale read inside a MergeFrom patch can clobber a condition you set 49ms earlier.

Fix status patching with an uncached reader inside RetryOnConflict; Kubernetes v1.36 (April 2026) now lets controllers detect and skip behind-cache reconciles.

Even a logging change is a production change: a duplicate log key once dropped every log line from the busiest controller in an environment.

Everything below comes from operating a major European automotive manufacturer’s internal Kubernetes platform, running a fleet of custom controllers on managed GKE. Names and identifiers are generic on purpose. The failure modes are exact.

What is the step-based reconciler pattern?

The step-based reconciler pattern splits a controller’s reconcile loop into an ordered sequence of small, composable steps. Each step reports its own status condition, instead of one monolithic function doing everything. In practice, this keeps provisioning logic table-testable and makes partial progress visible. Across our controller suite, it turned a 400-line reconcile body into a readable chain. It is the same reconciliation discipline behind the Kubernetes as a Service control plane we operate, where dozens of resource types share one contract.

A monolithic Reconcile becomes unreadable fast. Picture one custom resource that has to provision networking, IAM, namespaces, quotas, and DNS. Have you ever tried to unit-test a function like that? So we organize each controller as controller.go plus steps/creation/ and steps/deletion/ packages. Every step embeds a shared BaseStep. It holds the controller-runtime client, an uncached reader, the scheme, a logger, and an event recorder. Each step also exposes explicit Ready and Reconciling conditions.

An Environment resource, for example, runs a chain like this: EnsureFinalizer, ResolveClusterConfigRef, ComputeTargetNamespaceName, EnsureNamespace, ApplyResourceQuotas, ApplyLimitRanges, ApplyServiceAccounts, ApplyRoleBindings, ApplyNetworkPolicies, DeleteFinalizer. Each step is small, ordered, and independently testable. As a result, partial-failure states show up as per-step conditions. You can read them straight from kubectl get print columns.

The single most valuable primitive we built was not the step interface itself. It was the shared status-patching helper underneath it. Every step calls one function to update conditions. So that function centralizes the conflict-safe read-then-patch dance in exactly one place. When we later found a cache bug, we fixed it once, not in fifteen call sites.

The shared conflict-safe status patch

A status helper re-fetches the latest object, mutates it, and patches it with client.MergeFrom inside a retry.RetryOnConflict loop. So it avoids two reconciling steps racing each other’s condition writes. And the generation-aware comparison means unchanged state does not re-patch. That keeps the API server and your audit log quiet.

func (r *StepRuntime) UpdateStatusFields(ctx context.Context, obj client.Object,
    mutate func(client.Object) error) error {
    return retry.RetryOnConflict(retry.DefaultRetry, func() error {
        // Read through the UNCACHED reader, not the informer cache.
        fresh := obj.DeepCopyObject().(client.Object)
        if err := r.UncachedClient.Get(ctx, client.ObjectKeyFromObject(obj), fresh); err != nil {
            return err
        }
        base := fresh.DeepCopyObject().(client.Object)
        if err := mutate(fresh); err != nil {
            return err // may be ErrNoOp; caller short-circuits
        }
        return r.Client.Status().Patch(ctx, fresh, client.MergeFrom(base))
    })
}

Two small details carry weight. The ErrNoOp sentinel lets a step’s mutate function say “nothing changed”, distinct from a real error. So callers short-circuit cleanly. And event correlation only emits a Kubernetes event when the visible condition state actually changed, not on every generation-only refresh. Without that guard, a controller reconciling every 20 seconds floods the event stream with duplicates.

Gotcha: The uncached read in the snippet above was not the original design. It was a live-incident fix. The first version read through the informer cache, and that is exactly where the next section starts.

Why does a fresh status condition get overwritten?

A fresh status condition gets overwritten when a step reads the object through the informer cache instead of the API server. The cache can lag by tens of milliseconds within a single reconcile. That stale read then rides along in a MergeFrom patch and clobbers a condition another step set moments earlier (controller-runtime #741, 2021).

Here is the incident, exactly as it bit us. A routine release introduced new condition types on the StandardCluster resource, replacing an older readiness condition with a more specific one. A fresh reconcile did three things in sequence. First, MarkReconciling patched Ready=False at T=0. Second, the next step, seeing no prior value for its new condition type, re-read the object to update status. Third, the informer cache still returned the pre-restart object carrying a stale Ready=True.

That stale Ready=True traveled inside the same MergeFrom patch. So it overwrote the Ready=False written 49 milliseconds earlier. Net effect: clusters never converged to Ready. The maddening part? Running the controller locally fixed it every time. A local run used a fresh client that bypassed the shared cache. It looked non-deterministic. It was not. It was a read-your-own-write violation hiding in the cache.

Gotcha: If a bug disappears when you run the controller locally but persists in-cluster, suspect informer-cache staleness before you suspect a logic error. Local runs often bypass the shared cache path that production hits.

Most articles explain the reconcile loop as a clean read-diff-write cycle and stop there. They never connect the informer cache to a correctness bug. The cache is usually framed as a pure performance optimization. But at scale it becomes a correctness hazard, much like a stale eBPF program on GKE Dataplane V2 can keep enforcing a NetworkPolicy you already deleted. “Read” and “write” no longer target the same view of the world. The upstream root cause is a DeltaFIFO backlog plus read-write mutex contention in the informer cache. Under load, that produces stale reads (kubernetes/kubernetes #130767, 2025).

The fix in our code was to read through an uncached reader inside the retry loop, exactly as the earlier snippet shows. Upstream, the fix arrived in Kubernetes v1.36 in April 2026. Controllers can now check the cache’s resourceVersion. Then they skip reconciliation when the cache is behind their own writes. That closes the read-your-own-write gap at the framework level (kubernetes.io blog, 2026).

When should OperationResultUpdated requeue?

OperationResultUpdated should almost never mean “requeue and come back later.” It means something changed, and you must still fall through to the readiness check before deciding to requeue. Treating Updated as terminal produced an infinite roughly 20-second reconcile loop across seven step files in our service-exposure controller.

The shape of the bug is worth memorizing, because it is easy to reintroduce. A step’s mutate function toggled an annotation on every reconcile. For example, a fast-reconcile-interval hint derived from current readiness. So the annotation changed each pass. As a result, controllerutil.CreateOrPatch returned OperationResultUpdated almost every time. Several steps then treated Updated as “done for now, requeue”. So they never reached the check for whether the underlying cloud resource is actually ready.

switch result {
case controllerutil.OperationResultCreated,
    controllerutil.OperationResultUpdated,
    controllerutil.OperationResultNone:
    // All three fall through to the SAME readiness check.
    if !cloudResourceReady(obj) {
        return ctrl.Result{RequeueAfter: 20 * time.Second}, nil
    }
    return ctrl.Result{}, nil // ready: stop requeuing
}

We fixed this in seven sibling files. We collapsed the Created, Updated, and unchanged cases into one readiness check. Then we requeue only when the resource is still not ready. If you review a new step that wraps CreateOrPatch, look for the anti-pattern case OperationResultUpdated: return ctrl.Result{RequeueAfter: ...}, nil with no readiness check ahead of it. That single line is a self-perpetuating loop waiting to happen.

How do you handle read-after-create NotFound races?

Handle a read-after-create NotFound by retrying the transient error a few times, not by widening a cache window. The object exists on the API server but is not yet visible to your very next Get. This is an API-server-side visibility gap, the mirror image of client-side cache staleness, so the fixes point in opposite directions.

We hit this when a workspace-creation step patched owner references onto namespaces right after creating them. Occasionally the next Get returned NotFound, even though the namespace clearly existed. The instinct is to assume a longer cache-consistency wait would help. But the gap here is eventual server-side visibility, not a stale client cache. So a longer wait does nothing. Instead, the correct pattern is a short bounded retry on the transient NotFound. Treat it as expected, not as step failure.

Gotcha: A stale-cache bug and a read-after-create bug feel identical from a stack trace. One is fixed by reading past the cache; the other by retrying the read. Diagnose which side of the client-server boundary you are on before you reach for a fix.

How do you scale controllers with environment-based sharding?

Environment-based sharding partitions reconciliation by a business dimension: deployment environment. It uses controller-runtime predicates, rather than splitting the CRD, API, or codebase. One controller instance reconciling every environment couples their fate. So a reconcile storm in dev can degrade ope. Sharding decouples them, while keeping a single schema.

The mechanism is a release topology, not a code fork. So we deploy one Helm chart as several releases of the same version. The same fleet-shaped coordination shows up when you promote add-ons across many clusters, which we handle with Sveltos-driven fleet add-on promotion. A common release runs environment-agnostic singletons like the platform API and tenant controller. One release per environment runs only the environment-scoped controllers. Each gets a CLI argument declaring which single environment it owns.

Each environment-scoped controller composes its existing predicates with one extra environment-matching predicate. That predicate admits an object into the work queue only when its environment matches the shard’s. And it derives that environment from the existing source of truth, like a label or namespace. Filtering happens before objects enter the queue. So out-of-scope objects never trigger a reconcile, and the existing logic stays untouched.

The design’s real cost is not code. It is an operational invariant that lives outside the type system: the union of environment releases must cover every environment exactly once, no overlap and no gaps. The type checker cannot enforce this. Your release automation must. We also give each shard an independent leader-election identifier so releases never contend for the same lock. A missing environment argument falls back to non-sharded mode, which makes rollout incremental and reversible.

Why is even a logging change a production change?

A logging change is a production change. Why? Logs flow through parsers and agents that fail in ways your unit tests never see. In March 2026, two upgrades on our control plane both traced back to the same mistake: code duplicating behavior the framework already provided. And one of them silently deleted logs from the busiest controller in an environment.

The first incident was an apimachinery upgrade. A dependency bump removed a generated ProtoMessage() method from core meta types. Our generated proto file embedded those Kubernetes types directly. So the first time the gRPC server tried to marshal any response, the protobuf runtime panicked at lazy descriptor init. That took down the entire process. Because the panic fires at file-descriptor init rather than per field, one bad request crashed everything. The fix moved the HTTP gateway from endpoint-based registration to in-process registration. That bypassed the binary codec entirely, by reusing an existing JSON marshaler.

The second incident is the one I retell most. We added a reconcileID correlation field to all eight controllers with a .WithValues("controller", ..., "reconcileID", uuid.New()) call at the top of every Reconcile. Two days later, every log line from the busiest controller in one environment had silently vanished from Cloud Logging.

The root cause was a duplicate JSON key. Would your test suite have caught that? Ours did not. Controller-runtime already injects controller, name, namespace, and reconcileID into the context logger automatically. So our added call re-added the same keys. That produced JSON objects with duplicate keys. GKE’s fluentbit log agent parses those logs via msgpack. And msgpack’s Go implementation cannot hash a duplicate-key object. So it dropped the entire line, rather than erroring loudly. One environment escaped: its busiest controller used a logger variant that replaced keys instead of appending them. The fix was simple. We removed the redundant call, because the framework already carried every field we were trying to add.

Gotcha: Before you add fields to a structured logger, check what your framework already injects. Duplicate keys are valid JSON to your program and poison to your log pipeline.

What keeps controllers reliable at scale

Custom controllers fail in ways the tutorials rarely mention. And almost all of those failures live at one boundary: between your reconcile logic and the framework’s caching. So structure reconcilers as ordered, independently testable steps. That keeps each unit of provisioning logic small and partial progress visible. Then centralize status patching in one conflict-safe helper. Read through an uncached reader in that path, so a stale cache read cannot overwrite a condition you just set.

Watch the framework, not just your code. OperationResultUpdated is not a signal to requeue. It is a signal to re-check readiness. A read-after-create NotFound wants a retry, not a wider cache window. And a logging change can be as destructive as a schema change. So if you run controllers at scale, adopt the Kubernetes v1.36 cache-staleness detection early. And treat every framework upgrade as the production change it is.

Straight answers

Frequently asked questions

What is the informer cache and why does it go stale?

The informer cache is a local, in-memory replica of API-server objects that controller-runtime keeps to avoid hammering the API server on every read. It goes stale under load when the delta queue feeding it backs up and mutex contention delays updates, so a read can return an object version older than one you just wrote yourself.

Should every controller use an uncached reader?

No. Use the cache for high-volume reads where a slightly stale view is acceptable, which is most reads. Reach for an uncached reader specifically in the read-then-patch path for status, where reading your own recent write matters for correctness. On Kubernetes v1.36 and later, built-in cache-staleness detection reduces how often you need it.

How is step-based reconciliation different from a state machine?

A state machine models one resource moving between named states. Step-based reconciliation runs an ordered chain of idempotent steps every reconcile, each re-checking its own precondition and reporting a condition. Steps are not transitions; they are independently testable units that always run in order. That suits the level-triggered nature of Kubernetes controllers better than edge-triggered transitions.

Does Kubernetes v1.36 remove the need for uncached reads?

Not entirely, but it helps significantly. The v1.36 mitigation lets a controller compare the cache's resourceVersion against its own last write and skip a reconcile when the cache is behind, closing the most common read-your-own-write gap. Uncached reads remain useful for status patches and for clusters not yet on v1.36.