Deterministic IPAM for Multi-Tenant Kubernetes on GKE

Your VPC does not run out of IPs because you have too many pods. It runs out because GKE hands each node a whole /24 whether the node schedules 8 pods or 108. On a shared, multi-tenant platform that structural rounding compounds fast: every tenant cluster carves fresh ranges from the same finite address space, and the first team to provision a big cluster can starve everyone behind them. This is a playbook for the math nobody put in one place, the tenant blast radius it creates, and a deterministic allocation pattern that makes exhaustion a planning problem instead of an incident.

Key takeaways

GKE sizes each node’s pod CIDR by rounding up, so the default 110 max-pods burns a /24 per node (Google Cloud GKE networking docs, 2026).

Over 95% of GKE clusters run 30 pods per node or fewer yet still consume a /24 each (Google Cloud Blog, 2025).

Multi-tenant platforms hit exhaustion first: shared VPC space, per-tenant ranges, no shared pod CIDR.

Fix it without re-architecting: discontiguous multi-Pod CIDR, Class E space, and right-sized max-pods.

Why does GKE burn a /24 on every node?

GKE does not allocate pod IPs on demand. It reserves a fixed pod CIDR per node, sized by rounding your max-pods setting up to the next power of two and then doubling it. At the default 110 max-pods, that math lands on a /24: 256 addresses per node, regardless of real density (Google Cloud GKE networking docs, 2026).

GKE’s default of 110 pods per node forces the IP allocator to round each node up to a /24, or 256 addresses. A /20 pod range therefore holds just 16 nodes before the cluster cannot scale, per Google Cloud’s GKE networking best-practices documentation (2026).

But the waste is not a tail case, it is the norm. Google’s own telemetry shows more than 95% of GKE clusters run 30 pods per node or fewer, yet each of those nodes still claims a full /24 (Google Cloud Blog, 2025). So you pay for 256 addresses and use a couple dozen. How many pods does your busiest node actually run? For most fleets the honest answer sits well under 40. Multiply that gap across every node and the shared VPC drains for reasons that have nothing to do with real traffic.

Here is the worked math. First, the per-node CIDR follows the power-of-two rounding. Then nodes-per-range is simple division. In our platform work the /20 row is the one that ambushes teams: it looks generous until you divide by 256.

max-pods / node	per-node CIDR	nodes in /20	nodes in /18	nodes in /16
110 (default)	/24 (256)	16	64	256
64	/25 (128)	32	128	512
32	/26 (64)	64	256	1024
16	/27 (32)	128	512	2048

A 100-node cluster at defaults needs at least 25,600 pod IPs. A /18 gives you only 16,384, so it silently fails to scale. As a result you need a /16 to seat the fleet with headroom (Google Cloud Community, 2025). The lever most teams reach for last, dropping max-pods, is also the cheapest. Moving from 110 to 32 pods per node quadruples your node ceiling in the same CIDR, without touching the VPC. Pod ranges are not the only claimant on a shared VPC either: Private Service Connect endpoints across multiple gateways draw from the same address space, so plan them into the same map.

Why do multi-tenant platforms hit IP exhaustion first?

Multi-tenant platforms exhaust address space earlier than single-team clusters because the waste is additive across tenants, not amortized. Every tenant cluster carves its own primary subnet plus separate pod and service secondary ranges from one shared VPC. Then the per-node /24 rounding applies inside each of those ranges, so structural overhead stacks tenant on tenant.

Say you onboard a tenant that needs three clusters across two regions. That is six primary subnets, six pod ranges, and six service ranges carved before a single workload runs. Now picture ten tenants doing the same, and you can watch a /16 evaporate on paper.

The least-privilege rule makes this non-negotiable: never share a pod range across tenants. Why does that matter so much? A shared pod CIDR means one tenant scaling a batch job can exhaust another tenant’s production range. Then IP starvation surfaces as un-schedulable pods with no obvious owner. Isolation is correct, but isolation multiplies the fixed cost. The same isolation discipline governs east-west traffic, where a stale eBPF program can silently break GKE Dataplane V2 NetworkPolicy enforcement between tenants.

Gotcha: Discontiguous does not mean disorganized. Handing each tenant an ad-hoc range from wherever there is space today creates an audit nightmare and near-certain future overlap. You need isolation and a plan, which is where deterministic allocation earns its keep.

The pattern we see most often on European estates is RFC1918 exhaustion inherited from decades of mergers and acquisitions. In our experience the private space, roughly 17.9 million addresses across all three RFC1918 blocks, was carved up by a dozen legacy networks long before Kubernetes arrived (Google Cloud Blog, 2025). So by the time a platform team wants contiguous /16s for GKE, the map is already a patchwork. North American teams hit a different wall: sprawling multi-account AWS estates where nobody owns the global CIDR plan.

What is deterministic SubnetPool IPAM?

Deterministic SubnetPool IPAM replaces on-demand CIDR splitting with pre-planned allocation from architect-declared pools. Instead of an algorithm carving a big block at provision time, architects pre-compute every valid subnet as an explicit catalog entry. A cluster claims a whole entry atomically. Same input always yields the same subnet, so reconciliation is idempotent and every allocation traces to one auditable entry.

Sharing a pod range across tenants lets one team’s scale event starve another’s production workload, so multi-tenant best practice mandates a separate subnet and secondary range per cluster. Deterministic pools enforce that isolation while keeping the entire CIDR map planned and auditable rather than algorithmically derived at provision time.

The core object is a subnet-pool catalog resource, keyed by region, access type, criticality, and cluster type. Each pool holds an ordered list of entries. An entry is a primary CIDR plus named pods and services secondary ranges, and an optional /28 master range for the private control plane.

apiVersion: platform.example.io/v1
kind: SubnetPool
metadata:
  name: europe-west1-internal-standard
spec:
  region: europe-west1
  accessType: Internal        # Internal | External
  criticality: Standard       # Standard | Critical | Strategical
  reclaimPolicy: Quarantine   # Retain | Delete | Quarantine (7d TTL)
  entries:
    - primary: 10.4.0.0/22
      pods:   10.128.0.0/17    # right-sized secondary range
      services: 10.4.4.0/22
      masterCidr: 10.4.8.0/28  # optional, validated /28
    - primary: 10.4.12.0/22
      pods:   10.128.128.0/17
      services: 10.4.16.0/22

That reclaimPolicy line is not decoration. We once watched a reused CIDR resurrect a stale DNS record, and the 7-day quarantine window is what kept it from becoming an outage.

How does deterministic allocation stay race-free?

When a cluster needs a subnet, the controller selects a matching pool, picks the first free entry under a retry-on-conflict loop, then records the claim as a separate allocation object. Early on, two clusters reconciled at the same instant and grabbed overlapping ranges. The retry-on-conflict loop is the fix we shipped to stop it. It then validates the chosen CIDR against existing managed and raw subnets in the same VPC before materializing it. Because pools pre-declare the whole universe, overlap detection is a set-membership check, not interval math. A controller that claims ranges must also survive cache staleness that makes Kubernetes controllers act on outdated state, or two reconciles can still disagree about what is free.

How are pools bootstrapped per environment?

A bootstrap step seeds every new environment with the full cross-product of pools, region by access type by criticality by cluster type, from a catalog of per-region CIDR entries. Reconciliation is additive-only: it never rewrites existing allocations and rejects intra-pool overlap at admission. That master CIDR plan, authored once up front, is the artifact your capacity forecasting reads from.

What about non-primary regions and BYO subnets?

A single catch-all pool covers every region outside your primary footprint, selected by pool name rather than region label, so you extend coverage without duplicating the plan. Our first catalog only covered two European regions. The day a team asked for a third, we added a catch-all pool instead of hand-editing CIDRs. Bring-your-own subnets are the trickier edge case. When a tenant supplies an external subnet, the platform cannot derive the control-plane master range. So it requires an explicit master CIDR block and secondary range name on the spec.

How do you fix exhaustion without re-architecting?

You can reclaim address space on a live platform without rebuilding the VPC, using three moves in increasing order of blast radius. First, right-size max-pods. Then add discontiguous pod CIDR. Only reach for non-RFC1918 space when the private map is genuinely full. None of these require cluster re-creation on modern GKE.

Right-size max-pods first. It is the highest-leverage, lowest-risk change. Do most of your workloads really pack 110 pods onto a node? Almost none do. The table above shows the ceiling multiplying as density drops, so set max-pods to your real p99 density plus margin, not the default.

Add discontiguous multi-Pod CIDR. GKE lets you attach additional, non-adjacent pod ranges to a running cluster, and a subnet supports up to 170 secondary ranges (Google Cloud docs, 2026). Say your only free space is three scattered /22s. You do not need them contiguous. Attach all three as pod ranges, and the cluster stitches capacity from whatever fragments the legacy map left you, without re-creation.

Use Class E as an escape hatch. When RFC1918 is exhausted, the Class E range (240.0.0.0/4) offers roughly 268.4 million usable addresses, against roughly 17.9 million across all of RFC1918 (Google Cloud Blog, 2025). GKE supports it for pod ranges. Treat it as internal-only space and test your on-prem and firewall paths, since some legacy gear still refuses to route it.

Gotcha: Class E is not a free lunch. Older network appliances, some OS stacks, and certain SaaS allow-lists still drop 240.0.0.0/4 as reserved. Validate the full path before you commit a tenant to it, or you will trade IP exhaustion for silent connectivity failures.

What changes when you do this on AWS?

The IP math is EKS-flavored rather than GKE-flavored, but the discipline is identical: plan the global CIDR space before tenants consume it. So on multi-account, multi-VPC AWS estates, use AWS IPAM to allocate and track pools centrally instead of letting each account grab ranges independently. The failure mode, uncoordinated overlap and silent exhaustion, is the same one deterministic pools solve on GCP.

The AWS-specific rule of thumb is headroom: keep 20 to 30% of every subnet free so autoscaling and rolling deployments have burst room. Deterministic pools and AWS IPAM are the same idea wearing different logos. Both replace “grab a range when you need one” with “claim a pre-planned entry from a governed catalog.” As a result, overlap becomes a membership check instead of a post-incident forensics exercise. So if you run both clouds, author one CIDR plan and project it into each provider’s tool rather than maintaining two mental models.

How do you plan capacity before the VPC runs dry?

Capacity planning starts from the node ceiling, not the pod count. Because each node consumes a fixed CIDR, your real constraint is nodes-per-range. For example, a 100-node default cluster already needs a /16 to seat its 25,600 pod IPs with margin (Google Cloud Community, 2025). So size backward from peak node count, then reserve headroom on top.

Run this checklist before you cut the first tenant subnet:

Set max-pods to real density. Base it on measured p99 pods per node, not 110. This single number sets every downstream range size.
Size pod ranges from peak nodes. Multiply peak node count by the per-node CIDR from the table, then add a growth buffer.
One subnet and secondary range per tenant cluster. Never share a pod range across tenants; isolation caps each tenant’s blast radius.
Author a master CIDR plan up front. Pre-declare pools per region and environment so allocation is deterministic and overlap is a membership check.
Keep 20 to 30% subnet headroom. Give autoscaling and rolling deployments room to burst without hitting the wall.
Pre-clear a Class E or secondary block. Validate the routing path for an escape-hatch range before you need it, not during an outage.

The plan is the product

IP exhaustion feels like an infrastructure limit, but on a multi-tenant Kubernetes platform it is almost always a planning gap. GKE’s per-node rounding is predictable, the tenant multiplication is predictable, and the fixes, right-sized max-pods, discontiguous CIDR, and Class E, are all documented and available today. So what is usually missing? A single authoritative CIDR plan that every allocation reads from.

Deterministic SubnetPool IPAM turns that plan into an enforced contract: architects pre-declare the address universe, controllers claim whole entries atomically, and overlap becomes a set-membership check instead of an outage. This SubnetPool controller is one component of a larger pattern, the same one behind building Kubernetes as a Service with a custom control plane. Do the math once, per region and per environment, before the first tenant provisions. The teams that survive their own growth are the ones who treated address space as a designed resource, not a runtime accident.

Straight answers

Frequently asked questions

Why does GKE assign a /24 per node by default?

GKE reserves each node's pod CIDR ahead of time, sized by rounding max-pods up to the next power of two and doubling it. The default 110 max-pods rounds to a /24, or 256 addresses, no matter how few pods actually schedule. Lowering max-pods shrinks the per-node block directly.

Can I add IP capacity to a running GKE cluster?

Yes. GKE supports discontiguous multi-Pod CIDR, letting you attach additional non-adjacent pod ranges to a live cluster, and a single subnet holds up to 170 secondary ranges. You avoid re-creation entirely and stitch capacity from whatever fragments your existing address map still has free.

Is it safe to share a pod range across tenants?

No. A shared pod range means one tenant scaling a workload can exhaust the addresses another tenant's production pods depend on, and the failure appears as un-schedulable pods with no clear owner. Multi-tenant best practice mandates a separate subnet and secondary range per cluster, which deterministic pools enforce by construction.

When should I use Class E address space?

Reach for Class E (240.0.0.0/4) only when RFC1918 is genuinely exhausted. It offers roughly 268.4 million addresses versus 17.9 million across RFC1918. Treat it as internal-only and validate on-prem routing and firewall paths first, since some legacy gear still drops the range as reserved.