Private connectivity stopped being a niche requirement and became the default. Google Cloud reported that Private Service Connect (PSC) traffic grew 4x in 2025 and now fronts more than 40 published Google services (Google Cloud, “What’s new in cloud networking at Next ‘26”, 2026). At the same time, 82% of organizations now run Kubernetes in production (CNCF Annual Survey 2025, 2026). Put those two curves together and you get a hard question most platform teams eventually face: how do you let dozens of internal teams expose their Kubernetes services privately, across projects and VPCs, without every team learning the plumbing? This article is the answer we landed on after building it.

Key takeaways

  • Route all inbound traffic through a small set of purpose-scoped gateways (apps, admin, API, generic), never through ad-hoc public IPs.
  • Use PSC producer/consumer wiring behind a single ServiceExposure abstraction so teams request exposure declaratively.
  • Standardize per-service DNS and TLS conventions up front. PSC traffic grew 4x in 2025 (Google Cloud, 2026), so the naming scheme has to scale.
  • Automate certificates with cert-manager and an ACME DNS-01 CNAME delegation zone to keep IAM footprints tiny.

Everything below comes from running this pattern on a major European automotive manufacturer’s internal cloud platform. Names and identifiers are generic on purpose. The architecture is what matters.

Why does private service exposure get hard at enterprise scale?

Private exposure gets hard because the number of moving parts multiplies with every team. Consider one service: it needs a VIP, a forwarding rule, a PSC NAT subnet, a service attachment, per-consumer endpoints, DNS records, and TLS. So do that by hand across 40+ services and you have a full-time job plus a growing pile of drift. Sound familiar?

The failure mode we saw first was inconsistency. For example, one team terminated TLS at the gateway while another passed it straight through. One published an internal A-record, another leaked a public IP into a spec. Every exposure looked slightly different, which made audits slow and incident response slower. So the fix was not more documentation. Instead, it was a single declarative abstraction that hides PSC’s complexity while enforcing the rules teams kept getting wrong. That abstraction is one custom resource in a broader Kubernetes-as-a-Service control plane.

But the compliance pressure is just as real, and it is regional. Enterprises across Europe and North America increasingly treat private connectivity as a data-residency and governance control, not a nice-to-have. In practice, regulators and internal security teams want proof that internal traffic never traverses the public internet. PSC gives you that proof at the network layer, but only if you make it the paved road.

How does Private Service Connect actually work?

PSC connects a service producer to a service consumer privately, using a NAT layer instead of VPC peering. First, the producer publishes a ServiceAttachment backed by a NAT subnet. Then the consumer creates an endpoint that maps a private IP to that attachment. As a result, traffic never touches a public IP, and the two VPCs need no peering.

That producer/consumer split is the whole trick. In practice, managed services already lean on it: Databricks, Confluent, ClickHouse, and OpenShift Dedicated all offer PSC connectivity into customer VPCs (Google Cloud docs, 2026). But what almost nobody documents is how to apply the same model to your own internal Kubernetes services, fronted by Gateway API. That integration story is the gap this article fills.

Here is the core abstraction we settled on. A single CRD derives region, VPC, and subnets from the referenced cluster. It allocates the gateway VIP through a managed address, never a raw IP, then discovers the resulting forwarding rule and provisions the PSC NAT subnet plus service attachment.

The forwarding-rule discovery bit hides a real trap. In production we found the cloud-created forwarding rule can take a full default reconcile interval, ten minutes, to show up. So the controller temporarily drops the address reconcile interval to 10 seconds while it waits, then restores the 600-second default once the self-link appears. Small hack, big drop in reconcile latency.

apiVersion: platform.example.io/v1alpha1
kind: ServiceExposure
metadata:
  name: checkout-apps
spec:
  gateway:
    purpose: apps            # apps | admin | apigee | generic
    implementation: envoy-gateway
    vip:
      allocation: Auto       # Auto lets the cloud pick; Desired reserves a fixed IP
  pscProducer:
    nat:
      mode: Managed          # Managed | BringYourOwn
      rangeSelector:
        matchNames: [psc-nat-pool]
  managedConsumers:
    - name: exchange-internal
  tls:
    mode: Terminate
    certificate:
      strategy: cert-manager

Note: Consumer endpoints are auto-provisioned only for pre-approved exchange zones where the platform already holds IAM rights. Anything else is skipped and surfaced through a ConsumersSkipped condition rather than failing silently. That deliberate friction point stops ungoverned cross-boundary connectivity before it starts.

Private Service Connect traffic grew fourfold during 2025 and now fronts more than 40 published Google services (Google Cloud, “What’s new in cloud networking at Next ‘26”, 2026). Major data platforms including Databricks, Confluent, and ClickHouse now standardize on PSC for private VPC ingress (Google Cloud docs, 2026).

Why route everything through purpose-scoped gateways?

Purpose-scoped gateways give you one enforcement point per traffic class instead of one per service. We run four: apps for public-facing application traffic, admin for internal-only tooling such as an Argo CD UI, apigee for API traffic that needs policy applied first, and generic for everything the managed model cannot express. Each gets its own VIP and a different backend.

Why not just let each team pick its own gateway? Because the traffic class, not the team, should decide the policy. So an admin UI never rides the same VIP as public app traffic, and API calls always hit Apigee policy before the cluster. In practice, that split alone prevents a whole category of “why is this endpoint reachable from the internet” incidents.

Envoy Gateway is the implementation underneath all four. One year after GA, it saw sharp Helm pull growth and Fortune 500 adoption (CNCF blog, 2025), which is exactly the maturity signal you want before betting a platform on it. It implements Gateway API cleanly, so the same Gateway and route abstractions cover HTTP, HTTPS, and TCP without bespoke glue.

The naming is boringly deterministic, which is the point. Each gateway is named <purpose>-<hash(exposureName)>, and the Gateway plus EnvoyProxy resources are created in the remote workload cluster, not the management cluster. Making generic an explicit enum value rather than a free-form label was a deliberate safety call: a typo’d purpose string should never silently opt a service into a different code path. Small decision, large blast-radius reduction.

How do you name and resolve services with per-service DNS?

Every exposed service gets a fully qualified name from a fixed template, so resolution is predictable and auditable. The convention encodes purpose, cluster identity, tenancy mode, value stream, and environment. Get this right once and DNS becomes self-documenting: you can read any hostname and know exactly which gateway, cluster, and tenant it belongs to.

The template looks like this:

*.<purpose>.<clusterHash>.<shared|dedicated>.<vsNN>.<env>.svc.platform.example.io.

The shared or dedicated segment comes from whether the tenant is mutualized. The vsNN value-stream code is formatted by a single shared helper reused by both the DNS-record step and the gateway listener hostname, so the two can never drift apart. As a result, consumer A-records publish automatically for managed consumers.

We centralized these zones deliberately. Instead of each workload project owning its own internal Cloud DNS, records now live in per-environment common zones in the management project. The old per-VPC approach meant IAM and cleanup logic had to be replicated everywhere, and it drifted constantly. So one shared zone per environment, written by one controller, killed an entire class of “why is this record missing” tickets.

One incident here still stings. We added a new managed-consumer profile to the internal table but forgot the matching branch in the DNS-record step. The result was an empty zone reference that Config Connector rejected with “must validate one and only one schema (oneOf)”. Four separate places, the CRD enum, the proto enum, the validator allowlist, and the profile table, all had to move together. Miss one and the reconcile fails in a way the error message barely explains. So we now treat profile changes as an atomic, four-file edit.

Managed vs bring-your-own NAT: which should you pick?

Pick Managed NAT unless a network boundary forces your hand. In Managed mode the PSC NAT subnet is derived automatically from the cluster’s subnet pool, so teams never touch IP planning. BringYourOwn mode, by contrast, takes caller-supplied subnets. It exists for cases where a separate network team owns the address space and refuses delegation.

So which mode should you reach for by default? Managed, almost always. The two modes are mutually exclusive, enforced with a validation rule so a spec can never set both. That matters more than it sounds. A NAT misconfiguration does not fail loudly; instead it fails as intermittent connection resets weeks later. Making the modes exclusive at admission time turns a debugging nightmare into a rejected apply.

BringYourOwn earns its keep in messy topologies. We hit a case where a non-production environment had no routes to another VPC, a deliberate network-team boundary. So its NAT subnets had to be deleted and recreated under a different VPC with identical names. Because subnet self-links encode the project and region but not the VPC name, the profile table kept working unchanged once the subnets came back. That is a happy accident of the self-link format, and worth knowing before you design around it.

Gotcha: Removing a consumer from a spec did not originally clean up its forwarding rule, DNS record, and address. The creation pipeline only touched consumers currently present, with no diff against the previous spec. Orphaned cloud resources piled up quietly. The fix was an explicit cleanup step that diffs current spec against discovered resources and deletes anything no longer referenced. If you build this pattern, add that diff step from day one.

How do generic gateways handle TCP and TLS passthrough?

Generic gateways handle the protocols the managed HTTP model cannot: raw TCP, and TLS that terminates at the backend. A generic gateway drops all managed-consumer DNS and forwarding-rule steps and instead exposes customListeners, each with a name, port, protocol, and optional TLS. The two are mutually exclusive; a generic gateway never mixes managed consumers with custom listeners.

The use case that drove this was concrete: an event broker speaking MQTTS, where the client and broker negotiate TLS end to end. Terminating at the gateway would break that handshake. So custom listeners support a Passthrough TLS mode, mode-only with no certificate references, alongside normal Terminate. For example, an MQTTS listener on port 8883 just forwards bytes and lets the broker own the certificate.

spec:
  gateway:
    purpose: generic
    implementation: envoy-gateway
  customListeners:
    - name: mqtts
      port: 8883
      protocol: TCP          # TCP listeners forbid a hostname
      tls:
        mode: Passthrough    # broker terminates TLS itself, no certificateRefs
  pscProducer:               # optional for generic TCP gateways
    nat:
      mode: BringYourOwn
      subnets: [projects/example/regions/eu-west1/subnetworks/broker-nat]

Two design notes make this safe. First, TCP listeners forbid a hostname while HTTP and HTTPS require one, validated at admission. Second, pscProducer became optional for generic gateways, since a TCP gateway may be reachable only via its own internal load-balancer VIP. As a result, every controller step that touches the producer needed nil guards added.

Gotcha: The first generic-gateway exposure reconciled forever. One consumer step checked consumer status before checking whether any consumers existed at all, so with zero consumers it never settled. The fix was the same early-return-when-empty guard the other steps already had. Whenever you add a new resource kind to a step-based reconciler, audit every step for the empty-input case.

How do you automate TLS with ACME DNS-01 CNAME delegation?

You automate it by giving cert-manager write access to exactly one DNS zone and delegating every other zone to it with a CNAME. The controller creates the cert-manager Certificate and ClusterIssuer in the Envoy Gateway namespace only when the TLS strategy is set to cert-manager, and deletes the certificate if the strategy changes.

The delegation pattern is the clever bit. Alongside each consumer’s normal DNS record, the controller also writes a CNAME in every consumer zone:

_acme-challenge.<hash>.<shared|dedicated>.<env>.<consumer-zone>.
  CNAME _acme-challenge.<hash>.<shared|dedicated>.<env>.acme.platform.example.io.

So cert-manager only ever writes the real ACME TXT record into the single shared acme.platform.example.io zone. The CNAME then makes that record resolve correctly from every consumer zone too. The payoff is IAM scope: cert-manager’s write footprint stays fixed at one zone no matter how many consumer zones you add. Most teams grow the certificate identity’s permissions linearly with services. Instead, delegation keeps it flat.

Gotcha: Granting roles/dns.admin at the project level does not let you write zone-level IAM. That role includes dns.managedZones.getIamPolicy but not setIamPolicy, so scoping a binding to a single zone fails with a 403 even though the account is already a project-wide DNS admin. The fix is a small custom role with both permissions. Do not fall back to project-wide admin just to make the error go away.

Envoy Gateway reached one year past GA with sharp Helm-chart pull growth and adoption by Fortune 500 companies (CNCF blog, 2025). That production maturity is what lets you anchor certificate automation and multi-protocol routing on a single Gateway API implementation.

What silent failures should you watch for?

The dangerous failures here are the silent ones, and admission-control policy is a prime source. A webhook that never fires, a role that lacks one permission, an orphaned resource nobody deletes: none throw a loud error, so they cost hours before anyone suspects them. Here are the three that hurt us most.

Ever chased a bug for hours only to find the code never ran? We tried to have a Kyverno ClusterPolicy mutate a cert-manager-issued TLS secret to append an intermediate chain. It failed with no PolicyReport, no error, and no mutation. The write just succeeded, unchanged.

The root cause was a global namespaceSelector on Kyverno’s mutating webhook, set once in a ConfigMap rather than per policy. So if the target namespace lacks the required label, the API server never invokes Kyverno at all. Zero admission trace. We lost real hours to this, because every instinct says “check the policy” when the actual problem is that the policy was never consulted.

But there is a deeper lesson too. Mutating a cert-manager secret after issuance is fragile even when it works, because cert-manager rewrites tls.crt on renewal and silently drops your appended chain. Instead, the durable fix is to get the full chain at issuance, using ACME preferredChain or an issuer-level CA bundle, rather than patching the secret afterward. Fix the source, not the symptom.

Bringing it together

Private service exposure at enterprise scale is not one hard problem. It is a dozen small ones, each easy to get subtly wrong. The pattern that worked for us collapses those dozen into a single declarative request: a team says “expose my service,” and the platform handles PSC NAT, service attachments, gateway wiring, DNS, and TLS behind purpose-scoped gateways.

With PSC traffic up 4x in a year (Google Cloud, 2026) and Kubernetes at 82% production adoption (CNCF, 2026), this stops being optional. Start with the abstraction, not the plumbing. Make the safe path the only path: purpose-scoped gateways, deterministic DNS, one delegated ACME zone, and explicit conditions for anything the platform declines to do. The gotchas in this article are the ones that cost us time. Steal the fixes.

Straight answers

Frequently asked questions

Do I need PSC if my services already run in one VPC?

Not for same-VPC traffic. PSC earns its place when services must be consumed across projects or VPCs without peering, which is the norm at enterprise scale. It also gives you a clean audit story: traffic uses private IPs end to end, which regulators increasingly expect for internal connectivity.

Why Envoy Gateway instead of the native GKE Gateway?

Envoy Gateway implements Gateway API portably and supports HTTP, HTTPS, and TCP listeners under one abstraction, which matters for generic gateways. One year past GA it saw sharp adoption, including Fortune 500 users. The native controller works well too; the deciding factor is portability across cluster types and protocols.

How does the ACME CNAME delegation reduce risk?

It caps the certificate identity's DNS write access at a single zone. Every consumer zone delegates its ACME challenge record via CNAME to that one shared zone, where cert-manager writes the TXT record. Add a hundred consumer zones and the IAM footprint stays flat. That is the difference between least-privilege and permission sprawl over time.

What is the biggest operational trap in this pattern?

Silent failures. A missing namespace label skips a Kyverno webhook with no trace, a wrong DNS role fails zone IAM with a misleading 403, and orphaned consumer resources linger after a spec edit. None of these throw loud errors. Build diff-based cleanup and explicit skipped conditions so the system tells you what it chose not to do.