Skip to content

Talos OS Upgrade

This runbook covers upgrading Talos Linux across the management cluster using talosctl. Talos upgrades are performed per-node and are separate from Kubernetes version upgrades.

Prerequisites

  • talosctl installed locally (via brew install siderolabs/tap/talosctl)
  • Access to the cluster's talosconfig (located in terraform/environments/hetzner-mgmt-cluster/1-bootstrap/talosconfig)
  • Node IPs (see Cluster Topology)

Cluster Topology

Node Role Internal IP Public IP
talos-j1w-iy2 controlplane 10.0.0.10 167.235.51.134
talos-qtb-3xf worker-1 10.0.0.20 78.47.74.196
talos-h6s-ylk worker-2 10.0.0.21 116.203.85.92
talos-47q-dw8 worker-3 10.0.0.22 91.98.161.213

Workers are only reachable on port 50000 via their internal IPs, routed through the control plane endpoint.

Upgrade Rules

  1. One minor version at a time. You cannot skip minor versions (e.g., 1.11 to 1.13 requires going through 1.12).
  2. Control plane first, then workers. The Kubernetes version skew policy requires the control plane to be at or ahead of workers. This applies even for Talos-only upgrades where the K8s version doesn't change.
  3. One node at a time. Never pass multiple node IPs to a single upgrade command — it upgrades them in parallel and can break etcd quorum or violate PDBs.
  4. Talos upgrades do not change the Kubernetes version. Use talosctl upgrade-k8s separately for that.

Pre-Upgrade Checklist

  1. Check the release notes for the target version on GitHub. Look for breaking changes, deprecated fields, and kernel version jumps.
  2. Verify Cilium compatibility with the new kernel version — Talos kernel jumps can be significant between minors.
  3. Update your local talosctl to match or exceed the target version:
    brew upgrade talosctl
    
  4. Verify current cluster state:
    talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
      --nodes 167.235.51.134,10.0.0.20,10.0.0.21,10.0.0.22 version
    

Upgrade Procedure

All commands assume you are in terraform/environments/hetzner-mgmt-cluster/1-bootstrap/.

Step 1: Upgrade the control plane

talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
  --nodes 167.235.51.134 upgrade \
  --image ghcr.io/siderolabs/installer:v<VERSION> --preserve

The --preserve flag keeps the EPHEMERAL partition intact. It is deprecated (removed in 1.18) but still recommended for control plane nodes on current versions.

Wait for the node to come back and verify:

talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
  --nodes 167.235.51.134 version

talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
  --nodes 167.235.51.134 etcd status

Step 2: Upgrade workers one at a time

talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
  --nodes 10.0.0.20 upgrade \
  --image ghcr.io/siderolabs/installer:v<VERSION>

Wait for the node to rejoin, then repeat for 10.0.0.21 and 10.0.0.22.

Step 3: Verify

talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
  --nodes 167.235.51.134,10.0.0.20,10.0.0.21,10.0.0.22 version

kubectl --kubeconfig ./kubeconfig get nodes -o wide

kubectl --kubeconfig ./kubeconfig get pods -A | grep -v "Running\|Completed"

What Happens During a Node Upgrade

  1. Node cordons itself (no new pods scheduled)
  2. Node drains existing workloads (respects PDBs and grace periods)
  3. Services stop, filesystems unmount
  4. New image is written to disk
  5. Node reboots via kexec into the new version
  6. Node rejoins the cluster and uncordons itself

The entire cycle per node takes 2-5 minutes. During the control plane upgrade, the Kubernetes API is unavailable but running workloads continue unaffected.

Rollback

If a node fails to boot the new version, the bootloader automatically reverts to the previous image. You can also manually roll back:

talosctl --talosconfig ./talosconfig --endpoints 167.235.51.134 \
  --nodes <NODE_IP> rollback

Single Control Plane Considerations

This cluster runs a single control plane node. During its upgrade:

  • Kubernetes API is unavailable (2-5 minutes)
  • Workloads on workers continue running
  • MetalLB continues announcing the floating IP
  • Cilium dataplane continues forwarding traffic
  • Flux CD reconciliation pauses and resumes automatically after reboot

Upgrade History

Date From To Notes
2026-05-16 v1.11.5 v1.13.2 Stepped through v1.11.6 and v1.12.7. No issues.