0001 - Cluster Provisioning and Self-Service Infrastructure Portal¶
Status¶
Accepted
Date¶
2026-03-30
Context¶
The platform currently provisions Kubernetes clusters and surrounding infrastructure through Gitea Actions pipelines that invoke Terraform. While functional, this approach requires contributors to interact directly with CI pipelines to create clusters, modify component selections, or inspect infrastructure state â a workflow that is cumbersome and error-prone.
The desired end-state is a self-service portal where operators can:
- Log in and create new clusters
- View live state of existing clusters
- Select and configure components to deploy on a cluster (e.g. ingress, cert-manager, monitoring stacks, storage)
- Decommission clusters
- All without needing to touch Gitea, pipeline YAML, or a CLI
Two primary technologies have been evaluated: Crossplane and Terraform (with a UI layer). The Gitea Actions baseline is retained as the status-quo option for comparison.
Decision Drivers¶
- Operator experience: provisioning must be approachable without deep CLI or GitOps knowledge
- State visibility: live cluster and component state must be queryable at any time
- Extensibility: new cluster types or components should be addable without re-architecting the control plane
- GitOps alignment: changes should ideally be auditable and version-controlled
- Operational overhead: the solution itself must not be harder to maintain than the problem it solves
- Drift detection: the system should reconcile actual state toward desired state automatically
Considered Options¶
- Option A -Crossplane with a self-service UI (e.g. Backstage or custom portal)
- Option B -Terraform with a UI layer (e.g. Terraform Enterprise / OpenTofu + Atlantis / custom)
- Option C -Keep Gitea Actions + Terraform (status quo)
Decision Outcome¶
✅ Decision: Option A — Crossplane with a self-service UI has been accepted as the chosen approach.
Chosen option: Option A â Crossplane, because it provides native Kubernetes-native control-plane semantics, continuous reconciliation, and a clean API surface that a self-service portal can talk to directly. It eliminates the need for a separate pipeline engine for day-2 operations and naturally exposes cluster and component state via standard Kubernetes resources.
Positive Consequences¶
- Cluster provisioning becomes a kubectl/API call; the web UI simply creates Crossplane Composite Resources (XRs)
- Continuous reconciliation means drift is automatically corrected without manual pipeline re-runs
- Component selection maps cleanly to Crossplane Compositions â each optional component is a composable building block
- Live state is available via Kubernetes API at all times (no separate state backend to query)
- One control-plane cluster can manage many downstream clusters
Negative Consequences¶
- Crossplane has a steeper initial learning curve than Terraform for operators unfamiliar with Kubernetes CRDs
- Existing Terraform modules cannot be reused directly (though Crossplane's Terraform provider exists as a bridge)
- The control-plane cluster becomes a critical piece of infrastructure that must be highly available
- Debugging failed Compositions requires understanding Crossplane's event and condition model
Pros and Cons of the Options¶
Option A â Crossplane + Self-Service UI¶
Crossplane runs as a Kubernetes operator and exposes infrastructure abstractions as CRDs. A portal (Backstage, a custom React app, or similar) calls the Kubernetes API to create/update/delete Composite Resources, while Crossplane reconciles them toward the desired state in the cloud provider.
Pros: - Native Kubernetes â no additional state backend; state lives in etcd - Continuous reconciliation: drift auto-corrected without manual trigger - Composable: XRDs and Compositions let you model "cluster + components" as a single declarative unit - RBAC is standard Kubernetes RBAC â no second auth system - Clean API surface for the web portal (just Kubernetes API) - Active CNCF project with growing ecosystem (Upbound, provider-helm, provider-kubernetes)
Cons: - Requires a long-lived management cluster; that cluster must be robust - CRD/Composition authoring has a learning curve - Existing Terraform infrastructure code cannot be directly lifted; migration effort required - Smaller community than Terraform; fewer battle-tested providers for niche infrastructure
Option B â Terraform + UI Layer¶
Terraform (or OpenTofu) manages infrastructure state in a remote backend. A UI layer (Terraform Enterprise, Spacelift, Atlantis, or a custom app) triggers plan/apply runs and surfaces state.
Pros: - Mature ecosystem â wide provider coverage, extensive community modules - Existing Terraform modules can be reused immediately - Remote state backends (S3 + DynamoDB, Terraform Cloud) are well-understood - Spacelift/Atlantis provide a reasonable self-service workflow with minimal custom code
Cons: - Provisioning is imperative and pipeline-driven â apply runs must be triggered; no continuous reconciliation - Drift detection requires explicit scheduled plan runs; not real-time - Self-service portal either requires a third-party SaaS (Terraform Cloud/Spacelift cost) or significant custom engineering on top of Atlantis - Component selection (e.g. enabling cert-manager) means managing additional Terraform workspaces or module flags, which complicates the portal data model - Two state systems (Gitea Actions + Terraform state backend) continue to coexist
Option C â Keep Gitea Actions + Terraform (Status Quo)¶
Cluster and component provisioning is driven by Gitea Action workflows that call Terraform. Operators trigger pipelines by committing to a repo or manually running a workflow.
Pros: - No migration cost â already in place - Team familiar with the current setup
Cons: - No self-service UI â every change requires pipeline interaction - No live state visibility outside of Terraform state files and pipeline logs - Gitea Actions is not designed as a provisioning control plane; workarounds accumulate over time - Operator experience is poor; high friction for routine tasks - Does not scale well as the number of clusters grows