How to Build Kubernetes as a Service with Custom Controllers

Key takeaways

Kubernetes as a Service turns a management cluster into a universal control plane: developers declare infrastructure as custom resources, controllers reconcile the real cloud.

The platform is modelled as cloud-agnostic KRM CRDs — LandingZone, KubernetesCluster, ServiceExposure — each backed by a custom Go controller on controller-runtime.

Crossplane plus the hyperscaler operators (Config Connector, ACK, Azure Service Operator) translate those resources into real cloud APIs, while continuous reconciliation corrects drift automatically.

Multi-tenancy (isolated tenants), aggregated RBAC, and federated identity let teams self-serve without the platform team losing governance.

Operationally, the control plane scales through per-environment controller sharding and an API-first surface (REST, gRPC, SDKs, and a Kubernetes API proxy).

In a previous article we argued why Terraform falls short for platform engineering and why Kubernetes wins as the engine of a self-service platform. That post answered why. This one answers how: the concrete architecture behind a production Kubernetes as a Service platform — the one pattern that lets product teams create and operate managed clusters on their own, while security, cost, and reliability stay centrally governed.

Everything below is drawn from building real control planes. Names and identifiers are generic on purpose; the architecture is what matters.

What “Kubernetes as a Service” actually means

Kubernetes as a Service is a self-service platform where a central management cluster is the control plane for all infrastructure. A developer declares a cluster, a database, or an exposed service as a Kubernetes custom resource; a controller reconciles that declaration into real cloud resources. No tickets, no manual apply, governance enforced in one place.

The shift is subtle but decisive. Instead of treating Kubernetes as a place to run containers, you treat it as the place to describe and reconcile everything else. The Kubernetes Resource Model (KRM) becomes the contract between the people who consume infrastructure and the platform that provisions it. That is the same idea we explored across our decade-long Kubernetes journey — pushed to its logical conclusion.

The management cluster: one control plane to rule them all

The management cluster is a dedicated Kubernetes cluster that owns no application workloads. Its only job is to host the platform API, authentication, custom controllers, and the resources that describe every downstream workload cluster. It is the single point where requests arrive, identities are verified, and reconciliation happens.

Centralising on one control plane pays off in four ways:

One API surface — teams reach the platform through kubectl, a CLI, an API, or an internal portal, all hitting the same endpoint.
One reconciliation engine — every resource is continuously driven toward its declared state by controllers running in the same place.
One policy boundary — RBAC, admission policies (for example Kyverno), and tenancy rules live together, not scattered across clusters.
One lifecycle — upgrades, add-on rollout, and drift correction for many clusters are orchestrated centrally.

Workload clusters — typically managed Kubernetes such as GKE — are outputs of the control plane, registered back into a multi-cluster manager once they become ready.

Modelling the platform as custom resources

The platform’s public API is a small set of cloud-agnostic Custom Resource Definitions. Each CRD is a deliberate abstraction: it exposes the intent a team should control and hides the dozens of provider-specific primitives underneath. Two resources form the backbone of the model — LandingZone and KubernetesCluster.

`LandingZone`: an insulated multi-tenant boundary

A LandingZone is an enterprise-grade structural blueprint. It provisions a secure, insulated, compliant tenant boundary for a project, and its declarative controller loop performs the full multi-cloud reconciliation natively — abstracting away provider differences to stamp out and enforce organizational policy across three concerns at once:

Resource containment — the top-level isolation unit for the target cloud: a GCP Project, an AWS Account, or an Azure Resource Group.
Networking topology — VPCs, isolated subnets, firewall rules, and routing matrices.
Identity & security guardrails — IAM roles, policy attachments, and principal-group bindings.

The spec declares intent, never a vendor SKU. The same manifest targets any provider by changing one field:

apiVersion: platform.example.com/v1alpha1
kind: LandingZone
metadata:
  name: payments-dev
  namespace: proj-1234-tenant-mgmt
spec:
  provider: gcp                    # gcp | aws | azure
  env: dev
  resourceContainer:
    displayName: payments-dev      # → GCP Project / AWS Account / Azure Resource Group
  networking:
    cidrBlock: 10.24.0.0/16
    subnets:
      - name: nodes
        cidr: 10.24.0.0/20
      - name: pods
        cidr: 10.24.16.0/20
    firewall:
      denyAllIngress: true
  identity:
    guardrails:
      - policy: enforce-cmek
      - policy: restrict-public-ip
    roleBindings:
      - principalGroup: payments-platform-admins
        role: platform-admin

The controller reconciles that intent into the right provider objects — a Project plus VPC on GCP, an Account plus VPC on AWS, a Resource Group plus VNet on Azure — and continuously enforces the guardrails as policy drifts.

`KubernetesCluster`: managed clusters as declarative intent

A KubernetesCluster is the headline abstraction — a whole managed cluster reduced to the handful of decisions that matter, with everything else defaulted to a secure baseline. Crucially, the spec is runtime-agnostic: it describes what the team wants, not which managed engine backs it.

apiVersion: platform.example.com/v1alpha1
kind: KubernetesCluster
metadata:
  name: cluster-1
  namespace: proj-1234-dev
spec:
  landingZoneRef:
    name: payments-dev           # inherits provider, network, identity
  exposure: internal             # immutable: security boundary
  location: europe-west1         # immutable: physical placement
  version:
    channel: REGULAR
  gatewayApi:
    channel: standard

The control-plane engine parses this high-level intent and dynamically orchestrates the low-level managed primitive behind the scenes: GKE when the referenced LandingZone targets Google, EKS on AWS, AKS on Azure. The team never writes a provider-shaped cluster manifest — they declare a KubernetesCluster, and the composer picks the engine.

Two design rules keep the abstraction safe. Immutable fields (exposure, location, network) are rejected on update by an admission webhook, because changing them would silently recreate or endanger the cluster. Secure defaults — hardened dataplane networking, shielded nodes, sane maintenance windows — are applied automatically, so a minimal spec still produces a hardened cluster. The user makes five decisions; the controller makes five hundred.

The reconciliation loop: how a controller provisions the cloud

The engine behind every CRD is a custom controller running a reconciliation loop on controller-runtime. The loop watches a resource, compares desired state (spec) against observed state (status), and creates or updates the downstream objects that map to real cloud APIs. Then it runs again — forever — correcting drift.

At its core a reconciler is one idempotent function:

func (r *KubernetesClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var cluster platformv1alpha1.KubernetesCluster
    if err := r.Get(ctx, req.NamespacedName, &cluster); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 1. Render downstream provider objects (Config Connector / ACK / ASO / Crossplane).
    // 2. Wait for readiness; mirror observed state into .status.
    // 3. On ready: mint a kubeconfig secret, register the cluster with the
    //    multi-cluster add-on manager, and trigger add-on delivery.

    if !clusterReady(&cluster) {
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }
    return ctrl.Result{}, r.Status().Update(ctx, &cluster)
}

The controller never calls a cloud SDK to build a cluster directly. Instead it renders infrastructure-operator resources and lets those controllers talk to the cloud. This layering is what makes the platform declarative end to end: the reconcile function is pure Kubernetes bookkeeping, and the actual API calls are delegated to purpose-built providers.

The infrastructure controller ecosystem

Under a LandingZone or KubernetesCluster sits a layer of infrastructure controllers that turn Kubernetes objects into real cloud API calls. There are two families, and a serious multi-cloud platform understands both.

Cloud-specific operators (the hyperscaler-native layer)

Each hyperscaler ships its own operator that maps its services one-to-one to CRDs. These are faithful, high-fidelity representations of a single provider’s API — ideal when the composer needs a direct, provider-native primitive:

Google Cloud — Config Connector (KCC): maps GCP services (Projects, subnets, IAM policies, GKE clusters, BigQuery) to CRDs, reconciled by the cnrm controllers.
AWS — AWS Controllers for Kubernetes (ACK): a family of per-service controllers (EKS, VPC, IAM, S3, RDS) that manage AWS resources directly from Kubernetes.
Azure — Azure Service Operator (ASO): exposes Azure resources (Resource Groups, VNets, AKS, role assignments) as CRDs reconciled against Azure Resource Manager.

Each is authoritative for its own cloud, but each speaks only its own dialect. Build directly on them and your abstractions fragment along provider lines.

Provider-independent engines: Crossplane

Crossplane is the unifying layer. Its open-source provider ecosystem — provider-gcp, provider-aws, provider-azure (and the newer provider families generated from each cloud’s API) — wraps the same underlying cloud APIs the native operators expose. On top of that, Compositions and Composite Resource Definitions (XRDs) let you fold those fragmented, per-cloud managed resources into a single, uniform, high-level abstraction.

This is exactly the mechanism behind our LandingZone and KubernetesCluster models: one composite claim fans out into a Project + VPC + IAM binding on GCP, or an Account + VPC + role on AWS, without the consumer ever seeing the provider seam. The platform controller orchestrates these engines — native operators for direct fidelity, Crossplane for cross-cloud composition — rather than replacing them. That separation keeps each layer testable and lets you onboard a new cloud by adding a provider, not by rewriting reconcile logic.

Multi-tenancy and RBAC without losing control

Multi-tenancy is enforced by the LandingZone boundary described above — the resource container, network topology, and identity guardrails that insulate each project — combined with an in-cluster tenant (for example Capsule) that scopes namespaces, quotas, and network policies. On top of that boundary, RBAC decides who can do what, and the clever part is that it extends itself as the platform grows.

The control plane defines three base roles — platform-admin, platform-editor, platform-viewer — as aggregated ClusterRoles. Every time a new service API is added, it ships its own role labelled to aggregate into the base ones, and Kubernetes merges the rules automatically:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: platform-admin
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.platform.example.com/aggregate-to-platform-admin: "true"
rules: []   # filled automatically by the control plane

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gke-platform-admin
  labels:
    rbac.platform.example.com/aggregate-to-platform-admin: "true"
rules:
  - apiGroups: ["platform.example.com"]
    resources: ["kubernetesclusters"]
    verbs: ["create", "get", "update", "delete", "list"]

Deploy a new service and its permissions flow into the existing personas — no central role edited by hand. Identities themselves come from your OIDC provider (Okta, Entra ID, Keycloak) and are federated into the cloud through Workload and Workforce Identity Federation, then mapped to least-privilege bindings. A project owner administers their own tenant; the platform team never becomes the bottleneck.

Add-ons as a service

A bare cluster is not a useful cluster. Every workload cluster needs security agents, policy engines, observability, cost optimisation, secret management, and ingress before a team can ship. Delivering that consistently across dozens of clusters is a first-class platform concern, not an afterthought.

The pattern is GitOps-driven multi-cluster delivery. When the controller registers a new cluster with an add-on manager such as Sveltos, the manager matches the cluster against a set of ClusterProfiles and rolls out the required Helm charts and manifests — endpoint security, policy enforcement, external secrets, autoscaling, backup, and gateways — in dependency order.

The subtle challenge is visibility. Add-on state lives on the management cluster as controller objects that ordinary users can’t read. A production platform closes that gap by surfacing add-on status through the platform API: a read-only endpoint reads the underlying summaries on demand and returns a clean model — provisioned, provisioning, failed, waiting-for-dependencies — so a UI can render live health and even a dependency graph. The platform doesn’t just install add-ons; it makes their state a supported, self-service API.

Exposing the platform as an API, not just YAML

CRDs are a great internal contract, but not every consumer wants to write YAML. Because a CRD is an API definition, you can generate a typed API surface on top of it: derive protobuf from the resources, then expose REST and gRPC endpoints and ship SDKs in Go, Python, JavaScript, and Java. Teams call the platform like any other product API.

For consuming live cluster state, a powerful addition is a transparent Kubernetes API proxy hosted by the control plane. The user authenticates once to the platform; the platform mirrors a workload cluster’s API server so standard tools work by changing only the server URL:

kubectl --server=https://platform.example.com/tenants/t1/landingzones/lz1/kubernetesclusters/c1/proxy \
  get pods -n t1-lz1-dev

Critically, this is not an open pass-through. A mandatory, Kubernetes-aware policy engine classifies every request (group, version, resource, subresource, verb, namespace) and decides allow or deny before a single byte is forwarded. It runs deny-by-default, with a compiled-in deny floor — secrets, token minting, exec, attach, port-forward — that no misconfiguration can override. Compatibility comes from mirroring the API path; safety comes from the policy engine in front of it. Users get native kubectl ergonomics without ever holding a workload cluster’s credentials.

Scaling the control plane with controller sharding

A single controller reconciling every environment at once couples their fate together: a reconcile storm or bad rollout in dev can degrade int and ope too. As the platform grows, that shared blast radius becomes the main operational risk.

The fix is horizontal partitioning by environment. The same controller binary is deployed as separate releases, each pinned to one environment through a startup flag, and each admits only its own resources using a controller-runtime predicate:

func EnvironmentPredicate(assigned string) predicate.Predicate {
    return predicate.NewPredicateFuncs(func(o client.Object) bool {
        return landingZoneEnvOf(o) == assigned
    })
}

Filtering at the predicate layer keeps out-of-scope objects out of the work queue entirely, so reconcile logic stays untouched. Environment-agnostic components — the API server, the tenant controller — run once in a common release, while each environment gets its own controllers, its own resource limits, and its own independent lifecycle. One environment can be upgraded, paused, or rolled back without touching the others. Leader election is scoped per shard so releases never contend for the same lease.

What the pattern buys you

Put together, these pieces deliver the promise of platform engineering: autonomy for teams, control for the platform. Product teams request a cluster, an exposed service, or a cloud resource with a single declarative call and get it in minutes. The platform team enforces security baselines, cost policy, and reliability standards centrally, and corrects drift automatically through reconciliation rather than through review meetings.

The engineering discipline that makes it work is worth restating: model a small, opinionated CRD surface; back each CRD with an idempotent reconciler; delegate cloud calls to hyperscaler operators (KCC, ACK, ASO) and Crossplane; make RBAC self-extending; deliver add-ons via GitOps and surface their state through the API; and shard controllers so scale never means shared failure. None of these are exotic — they are the boring, durable choices that keep a control plane running for years.

Conclusion: how Edixos can support you

At Edixos, building control planes like this is our core craft. A cloud platform is more than a stack of tools — it must be built to last, to evolve, and to fit your organisation’s real constraints. Our work goes well beyond deploying clusters:

We write custom Kubernetes controllers that model your provisioning workflows and business rules as reconciliation loops.
We integrate composition engines like Crossplane and Kro to turn Kubernetes into a multi-cloud provisioning engine.
We work with existing providers such as Config Connector (KCC) to expose cloud services natively through CRDs.
We build complete, API-first platforms — REST, gRPC, and multi-language SDKs — with multi-tenancy, federated identity, and self-extending RBAC baked in.

This is exactly the kind of platform we described in why Kubernetes beats Terraform for platform engineering, made concrete. If your ambition is a robust, self-service Kubernetes platform that aligns innovation, agility, and governance, we have the building blocks and the field experience to make it real.

Let’s talk about your Kubernetes as a Service platform

Are you looking to give your teams true self-service infrastructure while keeping security and cost under central control? Let’s discuss your challenges and see how a custom control plane can bring you speed, reliability, and scale.

Straight answers

Frequently asked questions

What is Kubernetes as a Service in a platform-engineering context?

It's a self-service platform where a central management cluster acts as a control plane. Developers declare a cluster, database, or service as a Kubernetes custom resource, and a controller provisions and reconciles the real cloud infrastructure — no tickets, no manual Terraform runs, governance enforced centrally.

Why use a management cluster instead of provisioning clusters directly?

A management cluster gives you one API surface, one reconciliation engine, and one policy boundary for every workload cluster. It centralises authentication, RBAC, drift correction, and add-on delivery, so product teams self-serve while the platform team enforces security and standards from a single place.

How do custom controllers provision cloud resources?

A controller runs a reconciliation loop built on controller-runtime. It watches a custom resource, compares desired versus observed state, and creates downstream objects — Config Connector or Crossplane resources — that map to real cloud APIs. The loop re-runs continuously, correcting drift until reality matches the declared spec.

What is the difference between Crossplane and the cloud-specific operators?

Cloud operators — Config Connector (Google), ACK (AWS), and Azure Service Operator — map one hyperscaler's services one-to-one to CRDs. Crossplane is provider-independent: its providers wrap those same APIs, and its compositions bundle many resources into one abstraction like LandingZone or KubernetesCluster across clouds.

How is multi-tenancy enforced on a shared control plane?

A LandingZone custom resource provisions an insulated tenant boundary — resource container, network topology, and identity guardrails — per project. Aggregated ClusterRoles extend platform-admin, editor, and viewer roles automatically as new service APIs are added, and federated identities from your OIDC provider are mapped to least-privilege bindings.

Can the platform be consumed without writing YAML?

Yes. Because CRDs are API definitions, you can generate protobuf and expose REST, gRPC, and multi-language SDKs on top of them. Teams call the platform API from a CLI, an internal portal, or standard Kubernetes clients through an API proxy — YAML becomes optional, not mandatory.