Key takeaways

  • Standing admin rights on shared clusters are a liability. Unit 42 found 99% of cloud identities are over-permissive, and 66% of social-engineering attacks in 2025 went straight for privileged accounts.
  • The fix is a PrivilegedAccessRequest CRD: a time-boxed elevation naming one human, one role, one namespace, with automatic expiry and a hard 48-hour cap.
  • Two bugs from a real build are worth stealing: a cluster-scoped grant that should have been namespace-scoped, and a wire-format mismatch that created access already expired on arrival.
  • Full zero standing privileges stays aspirational. The 2026 pragmatic target is JIT everywhere it matters, with zero standing privileges reserved for the highest-risk segments.

On a mutualized Kubernetes cluster, one over-permissive credential is not one team’s problem. It is a shared blast radius across every tenant on that cluster. So the stakes scale with the number of teams you host. This is a reference architecture for killing standing privileges on shared clusters. And it does that without turning every incident response into a platform-team ticket queue. It comes from a real production build: a large European automotive manufacturer’s multi-tenant Kubernetes platform, where hundreds of application teams share mutualized clusters — the kind of estate we have run for years, as chronicled in our decade with Kubernetes. We keep names and identifiers generic on purpose. The pattern is what travels.

Why do shared clusters make standing admin rights untenable?

Shared clusters multiply the cost of every over-permissive account. Unit 42’s 2025 research found 99% of cloud users, roles, and service accounts are over-permissive (Palo Alto Networks Unit 42, 2025). The same team reported 66% of social-engineering attacks hit privileged accounts (Unit 42 Global Incident Response Report, 2025). On mutualized clusters, that math compounds fast.

So why does sharing raise the stakes so much? Because the walls between tenants are thinner than they look. Standing privilege also has a long tail. GitGuardian found 64% of secrets that were valid in 2022 were still active four years later (GitGuardian, 2026). A credential granted for a one-off incident rarely gets revoked on schedule. It lingers. And on a shared cluster it lingers with reach into namespaces its owner has long forgotten.

The usual culprit is RBAC sprawl. In practice that means wildcard verbs, unbounded ClusterRoleBindings, copy-paste RoleBindings, and workloads leaning on the default service account (vcluster.com, 2025). Each one is small on its own. Together, on a cluster shared by hundreds of teams, they become the attack surface.

Security note: Kubernetes RBAC bindings combine with OR logic, not AND logic. If a user is a subject in two bindings, they get the union of both permission sets, never the intersection. On a multi-tenant cluster, a stray ClusterRoleBinding can silently widen access that a carefully scoped RoleBinding should contain.

What is a PrivilegedAccessRequest, and how does time-boxed elevation work?

A PrivilegedAccessRequest (PAR) is a custom resource that represents a single time-boxed elevation. It names one human, one role, one target scope, and a duration. A controller creates the grant when the window opens. Then it revokes the grant when the window closes. So there is no manual cleanup and no lingering credential, which counters the 64% stale-secret problem above.

Picture a Friday-night incident. Checkout is down, and one engineer needs real production access right now. Instead of paging the platform team for a one-off RoleBinding, she files a PAR for 45 minutes. The controller grants it, she fixes the outage, and the access disappears on its own before she logs off. No ticket. No leftover admin rights waiting to be abused.

The design keeps break-glass access separate from steady-state RBAC on purpose. Everyday least-privilege stays static: development gets editor, production gets viewer, by convention. Two policy invariants make it safe to hand this to tenants. First, the request names a single user email, never a group. Second, an admitting webhook hard-caps the duration at 48 hours.

Here is a generic PAR, sanitized from the production design:

apiVersion: platform.example.io/v1alpha1
kind: PrivilegedAccessRequest
metadata:
  name: par-incident-4821
  namespace: team-checkout-services
spec:
  environmentRef:
    name: checkout-prod
    namespace: team-checkout-services
  user: [email protected]   # exactly one human, never a group
  role: tenant-admin                # tenant-admin | tenant-editor | tenant-viewer
  duration: 45m                     # Go duration string, hard-capped at 48h
status:
  phase: Ready
  expiryTime: "2026-07-04T14:12:00Z"
  remoteRoleBindingRef:
    kind: RoleBinding               # namespace-scoped, see the next section
    namespace: gke-checkout-prod
    name: par-jordan-tenant-admin

The controller does the boring, safety-critical parts. It reconciles the grant idempotently, so a repeated reconcile never double-creates a binding. It also requeues on a tight schedule: the minimum of “time until expiry” and a short safety interval. As a result, it cannot miss a revocation window, even if one intermediate reconcile gets skipped. After the window closes, it garbage-collects the PAR object roughly an hour later. The audit trail lives in logs, not in the Kubernetes object. So expired requests never pile up as cluster clutter.

Duration is mutable, and that detail matters. Extending an active request recomputes the expiry from the original creation time, not from the moment of the edit. It also emits an AccessExtended event. The webhook still enforces the 48-hour ceiling on the recomputed window. In our experience, this is the difference between a clean extension and a responder who accidentally resets their own clock mid-incident.

Gotcha: Order of operations inside the reconciler is not cosmetic. The lifecycle step that provisions the binding must run before the expiry check, because the expiry early-exit depends on the binding reference already being set. Reverse them, and every request loops forever on an unset status, never actually provisioning access. This class of ordering bug passes unit tests and fails only under real reconciliation.

RoleBinding vs ClusterRoleBinding: why does scope quietly break tenancy?

Scope is the single most important field in a JIT grant, and it stays invisible until someone audits it. The design called for a namespace-scoped RoleBinding. Instead, the first implementation shipped a cluster-wide ClusterRoleBinding. Both compile. Both pass a happy-path test where the user does exactly what they intended in exactly one namespace. But only the second one hands that user rights across every namespace on a shared cluster.

Ever shipped a binding that worked perfectly in test and still opened a door you never meant to open? A reviewer caught this one during code review, not a test, and that is the point. A ClusterRoleBinding granting tenant-admin looks almost identical to a RoleBinding granting the same role. The difference is one kind field and one missing namespace. On a single-tenant cluster the mistake is harmless. On a mutualized cluster it becomes a cross-tenant escalation. And no application team would ever notice, because their own workflow keeps working perfectly.

The contrast, written out, is deliberately unremarkable:

# WRONG on a shared cluster: grants the role cluster-wide, across every tenant
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: par-jordan-tenant-admin
roleRef:
  kind: ClusterRole
  name: tenant-admin
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: User
    name: [email protected]
---
# RIGHT: same role, confined to one tenant's namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: par-jordan-tenant-admin
  namespace: gke-checkout-prod        # the scope that makes it safe
roleRef:
  kind: ClusterRole                   # a ClusterRole referenced by a RoleBinding
  name: tenant-admin                  # applies only within this namespace
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: User
    name: [email protected]

The resolved design ships the namespace-scoped form. The controller records a remoteRoleBindingRef in the PAR status. That reference names the exact RoleBinding, its namespace, and its deterministic name. So an auditor can confirm scope from the object itself, rather than trusting the code. If you take one review checklist item from this article, make it this: on any shared cluster, a JIT grant that produces a ClusterRoleBinding is a bug until proven otherwise.

The wire-format bug: when seconds masqueraded as a duration

Contract mismatches between layers are the bugs that survive review, because each layer looks correct on its own. In this build, the CRD and kubectl layer expressed duration as a Go duration string, like "45m" or "4h0m0s". The API server’s wire contract expressed the same field as plain integer seconds. Both forms were valid. They just did not agree, and nothing in either layer flagged the clash.

The failure was subtle and total. Under the hood, the protobuf Duration message carries int64 nanoseconds. And the API server’s JSON encoder does not call Go’s custom duration marshaler the way the standard library does. So a request meaning “300 seconds” arrived as the integer 300. The server then read it as 300 nanoseconds. Every affected PrivilegedAccessRequest landed already expired, roughly three hundred billionths of a second into the past. The object looked healthy. The access never existed.

Gotcha: A grant that is born expired is worse than a grant that fails loudly. The object reports created, the status looks plausible, and mid-incident the responder assumes they hold access they do not have. So the team added an explicit seconds-to-nanoseconds conversion at the API-server boundary. They kept it separate from the CRD’s own duration-string form, so the two contracts never leak into each other again.

A smaller cousin shipped alongside it. The Expiry print column used a date type. But a kubectl regression failed to parse valid RFC3339 timestamps under that type. It rendered <invalid> even though the stored value was always correct. Switching the print column to a string type fixed the display. The stored data was never wrong. Still, for a security control, an operator who cannot trust the expiry column in front of them is a real problem.

How should an admission webhook authorize elevation requests?

The admitting webhook enforces authorization along two independent tracks before it allows a request. It admits transverse, platform-wide roles immediately, with no tenant check, because they legitimately span the whole platform. It admits per-asset roles only after it confirms the caller actually owns the tenant they want access to. This split is the load-bearing security boundary of the whole design.

The per-asset check closes a genuine escalation path. Each per-asset role encodes an asset identifier from the corporate CMDB. So the webhook parses that identifier from the caller’s group, resolves the target namespace to its tenant record, and requires the two to match. For example, a user holding the admin role for asset A cannot open a PrivilegedAccessRequest against a namespace owned by asset B. The first version of the parser skipped this ownership check entirely. The team found and closed that cross-tenant gap the same day they wired up the webhook.

What sits on the other side of that check? A tenant-inventory CRD synced from the CMDB. It carries the authoritative asset identifier, criticality, and ownership for each tenant. The webhook does not trust the request’s claimed scope. Instead, it trusts the inventory, and the inventory mirrors the organizational source of truth.

Security note: Transverse roles are powerful by design. So keep the transverse allowlist short, explicit, and hardcoded rather than pattern-matched. In this build, the webhook treats one asset’s operator and superuser roles as transverse only for that one specific asset identifier. It rejects any other asset’s identically named roles, rather than silently promoting them to platform-wide. A pattern match here would reopen the exact gap the ownership check closes.

What role do criticality gates play on mutualized clusters?

Criticality gates decide whether a lower-criticality workload can run on a higher-criticality cluster. By default the admission webhook enforces strict equality: a tenant’s criticality must match its cluster’s criticality exactly. That default is safe but too rigid for mutualized clusters. There, a cluster rated critical legitimately hosts tenants whose own workloads are merely standard.

The relaxation is opt-in and owner-scoped. An allow-below-criticality: "true" annotation on the cluster loosens the check from strict equality to “tenant criticality is at or below cluster criticality.” The annotation lives on the cluster, not on the tenant requesting access. So the cluster owner decides the policy, not the team asking to land a workload. The webhook validates its value and accepts only true or false, which stops a silent typo from quietly changing enforcement.

A second fix corrected where the config reads criticality from. For mutualized clusters, a cluster can override its own criticality via annotation. In that case, the derived config prefers the override over the value in the tenant inventory, and falls back to the inventory only when the annotation is absent. Enforcement stays entirely in the admission webhook. The API server does no criticality validation of its own. So a single, auditable place decides this policy.

Gotcha: “Below criticality” is numerically counterintuitive. Strategic ranks above critical, critical above standard, so a higher number means less critical. When you write the comparison, a tenant is allowed when its criticality number is greater than or equal to the cluster’s. Get the inequality backwards and you either block every valid workload or, worse, permit the ones you meant to keep out.

Why make the CMDB the source of truth for RBAC?

Access should mirror the organization, not a pile of hand-edited bindings. A controller continuously syncs the corporate CMDB, ownership, group membership, and asset metadata, into tenant records and Kubernetes RBAC. So when a product lead changes or a team member joins, the binding follows automatically. Nobody edits a ClusterRoleBinding by hand, and every managed binding stays auditable by label. It is the same control-plane sync discipline that powers a full Kubernetes-as-a-Service control plane, applied to privilege instead of infrastructure.

Access splits into two tracks that meet at the same role vocabulary. The application-team track governs rights inside a tenant’s own namespaces. Each team’s lead delegates those rights through directory group membership. The directory-group track governs platform-wide, transverse roles, declared once and matched into every tenant. Both resolve to federated identity subjects rather than raw usernames. As a result, a single group-membership change on the directory side propagates into Kubernetes RBAC with no code change.

A subtle safeguard protects the fields everything else keys off. The controller captures identity fields like asset identifier, criticality, and value stream once at tenant-record creation, then freezes them. If a later CMDB fetch diverges, the controller does not overwrite them. Instead, it sets a change-flag label, so a human decides whether to cut over. In our experience, this is what stops a routine CMDB edit from silently re-shaping RBAC and network policy across dozens of live tenants mid-flight.

Reference architecture and build-vs-buy

Assembled, the pattern is compact. A PrivilegedAccessRequest CRD models time-boxed elevation. An admitting webhook enforces the single-user rule, the 48-hour cap, and the two-track ownership check. A controller creates and revokes a namespace-scoped RoleBinding on a deterministic schedule. And a CMDB sync keeps ownership and criticality authoritative. This is the neutral reference the vendor literature rarely draws: RBAC’s OR-logic reality, short-lived access, and an approval flow, connected into one multi-tenant design — the security half of the broader case for Kubernetes-driven platform engineering.

So should you build this or buy it? It comes down to how bespoke your tenancy model is. Commercial platforms such as Teleport, Apono, P0 Security, and CyberArk cover request, approval, and short-lived access well. For a standard estate, they are the faster path. The custom CRD approach earns its keep when access must cross-check against an internal CMDB and a criticality model that no off-the-shelf product knows about. That coupling, not the JIT mechanics, is what usually forces a build.

Set expectations honestly on the destination. Full zero standing privileges remains aspirational for most estates. The 2026 pragmatic target is JIT everywhere it matters, with zero standing privileges reserved for the highest-risk segments (Teleport, 2026). This maps cleanly onto audit drivers on both sides of the Atlantic. In Europe, NIS2 and ISO 27001 expect demonstrable access minimization and review. In North America, SOC 2 expects the same for privileged access. Time-boxed grants with a clean audit trail satisfy those controls far better than a spreadsheet of standing admins.

Closing

Standing privilege is the quiet default that shared clusters can least afford. Unit 42 puts 99% of cloud identities in the over-permissioned camp, and privileged accounts draw two-thirds of social-engineering attacks (Palo Alto Networks Unit 42, 2025). So the cost of leaving admin rights lying around only grows as more teams share the same control plane. A PrivilegedAccessRequest CRD, a two-track ownership webhook, and a CMDB-driven sync turn that default around without slowing anyone down.

The lessons that travel are the unglamorous ones. Scope your grants to the namespace. Validate your wire contracts across every layer. Check ownership at admission, not by inference. And freeze the identity fields the rest of the platform depends on. None of these show up in a demo. Yet all of them decide whether your JIT system is genuinely safe on a cluster shared by hundreds of teams. So aim for JIT everywhere it matters, and reserve zero standing privileges for the segments where the blast radius runs highest.

Straight answers

Frequently asked questions

What is just-in-time privileged access in Kubernetes?

Just-in-time privileged access grants elevated Kubernetes rights only for a bounded window, then revokes them automatically. Instead of standing admin `RoleBinding`s, a responder requests a time-boxed elevation naming one user, one role, and one namespace. It directly targets over-permissioning, which Unit 42 found affects 99% of cloud identities (Palo Alto Networks Unit 42, 2025).

Should a JIT grant use a RoleBinding or a ClusterRoleBinding?

On a shared cluster, use a namespace-scoped `RoleBinding`. A `ClusterRoleBinding` grants the role across every namespace, which becomes a cross-tenant escalation on mutualized clusters. The two forms look nearly identical in YAML, differing by one field, so treat any JIT grant that produces a `ClusterRoleBinding` as a bug until proven otherwise.

How long should just-in-time access last?

Long enough for the task, short enough to limit exposure, with a hard ceiling set by policy. This build caps every request at 48 hours and defaults to minutes for routine work. Automatic expiry matters because teams rarely revoke grants by hand: GitGuardian found 64% of 2022 secrets still active four years later (GitGuardian, 2026).

Does just-in-time access replace RBAC?

No. It complements static RBAC rather than replacing it. Steady-state least-privilege still governs everyday work through normal `RoleBinding`s synced from your source of truth. JIT handles the exceptions: incident response and rare elevated tasks. Keeping the two separate keeps the audit trail and expiry logic isolated from your everyday namespace guardrails.