Talos OS Upgrade¶
This runbook covers upgrading Talos Linux across the management cluster using talosctl. Talos upgrades are performed per-node and are separate from Kubernetes version upgrades.
Prerequisites¶
talosctlinstalled locally (viabrew install siderolabs/tap/talosctl)- Access to the cluster's
talosconfig(located interraform/environments/hetzner-mgmt-cluster/1-bootstrap/talosconfig) - Node IPs (see Cluster Topology)
Cluster Topology¶
| Node | Role | Internal IP | Public IP |
|---|---|---|---|
| talos-j1w-iy2 | controlplane | 10.0.0.10 | 167.235.51.134 |
| talos-qtb-3xf | worker-1 | 10.0.0.20 | 78.47.74.196 |
| talos-h6s-ylk | worker-2 | 10.0.0.21 | 116.203.85.92 |
| talos-47q-dw8 | worker-3 | 10.0.0.22 | 91.98.161.213 |
Workers are only reachable on port 50000 via their internal IPs, routed through the control plane endpoint.
Upgrade Rules¶
- One minor version at a time. You cannot skip minor versions (e.g., 1.11 to 1.13 requires going through 1.12).
- Control plane first, then workers. The Kubernetes version skew policy requires the control plane to be at or ahead of workers. This applies even for Talos-only upgrades where the K8s version doesn't change.
- One node at a time. Never pass multiple node IPs to a single upgrade command — it upgrades them in parallel and can break etcd quorum or violate PDBs.
- Talos upgrades do not change the Kubernetes version. Use
talosctl upgrade-k8sseparately for that.
Pre-Upgrade Checklist¶
- Check the release notes for the target version on GitHub. Look for breaking changes, deprecated fields, and kernel version jumps.
- Verify Cilium compatibility with the new kernel version — Talos kernel jumps can be significant between minors.
- Update your local
talosctlto match or exceed the target version: - Verify current cluster state:
Upgrade Procedure¶
All commands assume you are in terraform/environments/hetzner-mgmt-cluster/1-bootstrap/.
Step 1: Upgrade the control plane¶
talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
--nodes 167.235.51.134 upgrade \
--image ghcr.io/siderolabs/installer:v<VERSION> --preserve
The --preserve flag keeps the EPHEMERAL partition intact. It is deprecated (removed in 1.18) but still recommended for control plane nodes on current versions.
Wait for the node to come back and verify:
talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
--nodes 167.235.51.134 version
talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
--nodes 167.235.51.134 etcd status
Step 2: Upgrade workers one at a time¶
talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
--nodes 10.0.0.20 upgrade \
--image ghcr.io/siderolabs/installer:v<VERSION>
Wait for the node to rejoin, then repeat for 10.0.0.21 and 10.0.0.22.
Step 3: Verify¶
talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
--nodes 167.235.51.134,10.0.0.20,10.0.0.21,10.0.0.22 version
kubectl --kubeconfig ./kubeconfig get nodes -o wide
kubectl --kubeconfig ./kubeconfig get pods -A | grep -v "Running\|Completed"
What Happens During a Node Upgrade¶
- Node cordons itself (no new pods scheduled)
- Node drains existing workloads (respects PDBs and grace periods)
- Services stop, filesystems unmount
- New image is written to disk
- Node reboots via kexec into the new version
- Node rejoins the cluster and uncordons itself
The entire cycle per node takes 2-5 minutes. During the control plane upgrade, the Kubernetes API is unavailable but running workloads continue unaffected.
Rollback¶
If a node fails to boot the new version, the bootloader automatically reverts to the previous image. You can also manually roll back:
Single Control Plane Considerations¶
This cluster runs a single control plane node. During its upgrade:
- Kubernetes API is unavailable (2-5 minutes)
- Workloads on workers continue running
- MetalLB continues announcing the floating IP
- Cilium dataplane continues forwarding traffic
- Flux CD reconciliation pauses and resumes automatically after reboot
Upgrade History¶
| Date | From | To | Notes |
|---|---|---|---|
| 2026-05-16 | v1.11.5 | v1.13.2 | Stepped through v1.11.6 and v1.12.7. No issues. |