Skip to content

Hetzner Cloud + MetalLB: Networking Challenges and Solutions

Overview

This document explains the networking challenges encountered when deploying MetalLB on Hetzner Cloud and the architectural decisions made to ensure reliable LoadBalancer functionality.

The Problem: Hetzner Cloud Networking Limitations

Floating IP Assignment Constraints

Hetzner Cloud has specific limitations regarding floating IP assignment and Layer 2 networking:

  1. Single Server Assignment: Floating IPs can only be assigned to one server at a time through the Hetzner Cloud API
  2. ARP/Layer 2 Limitations: Hetzner Cloud's network infrastructure has restrictions on ARP announcements from multiple servers
  3. BGP Not Available: Hetzner Cloud doesn't support BGP announcements for customer networks

MetalLB Mode Considerations

MetalLB offers two main modes for IP assignment:

Layer 2 Mode (L2)

  • How it works: Uses ARP/NDP announcements to claim IP addresses
  • Pros: Simple setup, no external dependencies
  • Cons: Single point of failure, limited to one node announcing per IP

BGP Mode

  • How it works: Uses BGP protocol to announce routes
  • Pros: True load balancing, high availability
  • Cons: Requires BGP-capable network infrastructure (not available on Hetzner Cloud)

Our Solution: Single Control Plane with L2 Mode

Architecture Decision

Given Hetzner Cloud's limitations, we implemented the following architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Hetzner Cloud Network                        β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Control Plane   β”‚    β”‚        Worker Nodes              β”‚    β”‚
β”‚  β”‚ (Single Node)   β”‚    β”‚                                  β”‚    β”‚
β”‚  β”‚                 β”‚    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ Primary IP:     β”‚    β”‚  β”‚Worker-1 β”‚ β”‚Worker-2 β”‚ β”‚ ... β”‚ β”‚    β”‚
β”‚  β”‚ 46.224.x.x      β”‚    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚                 β”‚    β”‚                                  β”‚    β”‚
β”‚  β”‚ Floating IP:    β”‚    β”‚                                  β”‚    β”‚
β”‚  β”‚ 116.202.177.24  │◄───┼──── MetalLB L2 Advertisement     β”‚    β”‚
β”‚  β”‚                 β”‚    β”‚                                  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Configuration Details

1. Floating IP Assignment

# Terraform configuration
resource "hcloud_floating_ip_assignment" "cluster_ip_assignment" {
  floating_ip_id = hcloud_floating_ip.cluster_ip.id
  server_id      = hcloud_server.control_plane[0].id  # Single control plane
}

2. MetalLB L2Advertisement Configuration

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: l2-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - ip-address-pool

3. IP Address Pool Configuration

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: ip-address-pool
  namespace: metallb-system
spec:
  addresses:
    - 116.202.177.24/32  # Single floating IP

Why Single Control Plane?

Technical Constraints

  1. Hetzner Floating IP Limitation:
  2. Floating IPs can only be assigned to one server via Hetzner API
  3. Multiple servers cannot claim the same floating IP simultaneously

  4. ARP Announcement Issues:

  5. Hetzner's network may filter or block ARP announcements from "unauthorized" sources
  6. Only the server with the assigned floating IP can reliably announce it

  7. MetalLB L2 Mode Behavior:

  8. In L2 mode, MetalLB elects a single node to announce each IP
  9. This works well with Hetzner's single-server floating IP model

Avoided Issues

By using a single control plane approach, we avoid:

  • Split-brain scenarios where multiple nodes try to claim the same IP
  • ARP conflicts that could cause network instability
  • Unreliable failover due to Hetzner's network filtering
  • Complex BGP setup that isn't supported on Hetzner Cloud

Network Interface Configuration

Talos Cluster Patch Considerations

We experimented with various network configurations in the Talos cluster patch:

# Initially tried explicit interface configuration
machine:
  network:
    interfaces:
      - interface: enp1s0
        addresses:
          - PRIMARY_IP/26      # Primary server IP
          - FLOATING_IP/32     # Floating IP
        routes:
          - network: 0.0.0.0/0
            gateway: GATEWAY_IP

Result: This caused network connectivity issues and DNS resolution problems.

Final Configuration

We simplified to let Hetzner Cloud handle the primary networking:

cluster:
  controlPlane:
    endpoint: https://PRIMARY_IP:6443  # Use primary IP for k8s API
  network:
    dnsDomain: cluster.local
    podSubnets:
      - 10.244.0.0/16
    serviceSubnets:
      - 10.96.0.0/12
# No machine network configuration - let Hetzner/DHCP handle it

Firewall Configuration

Required Ports

The Hetzner Cloud firewall must allow:

# LoadBalancer services (HTTP/HTTPS)
rule {
  direction = "in"
  protocol  = "tcp"
  port      = "80"
  source_ips = ["0.0.0.0/0"]
}

rule {
  direction = "in"
  protocol  = "tcp"
  port      = "443"
  source_ips = ["0.0.0.0/0"]
}

# Add other application-specific ports as needed

Limitations and Trade-offs

Current Limitations

  1. Single Point of Failure: The control plane is a single point of failure for LoadBalancer services
  2. No High Availability: LoadBalancer IPs cannot failover to other nodes automatically
  3. Scaling Constraints: All LoadBalancer traffic flows through one node

Potential Future Solutions

  1. Multiple Floating IPs:
  2. Assign different floating IPs to different services
  3. Use multiple control planes, each with its own floating IP

  4. External Load Balancer:

  5. Use Hetzner Cloud Load Balancer service
  6. Route traffic to NodePort services

  7. Custom CNI Solutions:

  8. Implement custom networking with proper ARP handling
  9. Requires deep understanding of Hetzner's network behavior

Testing and Validation

Connectivity Tests

  1. Internal Connectivity:

    # From inside cluster
    kubectl run test --image=curlimages/curl --rm -i --restart=Never -- \
      curl -v http://FLOATING_IP:PORT
    

  2. External Connectivity:

    # From external network
    curl -v http://FLOATING_IP:PORT --connect-timeout 5
    

Expected Results

  • Before firewall fix: Connection timeout (firewall blocking)
  • After firewall fix: Connection refused (service not ready) or successful connection
  • Internal always works: If MetalLB is announcing correctly

Best Practices

Deployment Order

  1. Deploy Kubernetes cluster with single control plane
  2. Assign floating IP to control plane server via Terraform
  3. Deploy MetalLB with L2 mode configuration
  4. Configure Hetzner Cloud firewall rules
  5. Deploy applications with LoadBalancer services

Monitoring

Monitor the following for network health:

  • MetalLB speaker logs for ARP announcements
  • Service external IP assignment status
  • Connectivity from both internal and external sources

Security Considerations

  • Minimize exposed ports in Hetzner Cloud firewall
  • Use TLS/HTTPS for all external services
  • Consider network policies within the cluster
  • Regularly audit firewall rules

GitOps Architecture with Flux

Component Separation Strategy

Based on recent deployment challenges, we implemented a two-phase component architecture to resolve CRD dependency issues:

Phase 1: MetalLB Installation Component

components/metallb-install/
β”œβ”€β”€ namespace.yaml          # metallb-system namespace
β”œβ”€β”€ helmrepository.yaml     # MetalLB Helm repository
β”œβ”€β”€ helmrelease.yaml        # MetalLB Helm chart installation
└── kustomization.yaml      # Kustomize configuration

Phase 2: MetalLB Configuration Component

components/metallb-config/
β”œβ”€β”€ ipaddresspool.yaml      # IP address pool configuration
β”œβ”€β”€ l2advertisement.yaml    # L2 advertisement settings
└── kustomization.yaml      # Kustomize configuration

Flux Kustomization Structure

The cluster-level Flux Kustomizations ensure proper deployment ordering:

# cluster/hetzner-mgmt/metallb/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: metallb-install
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./components/metallb-install
  prune: true
  sourceRef:
    kind: GitRepository
    name: bootstrap-cluster
  healthChecks:
    - apiVersion: helm.toolkit.fluxcd.io/v2beta2
      kind: HelmRelease
      name: metallb
      namespace: metallb-system
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: metallb-config
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./components/metallb-config
  prune: true
  sourceRef:
    kind: GitRepository
    name: bootstrap-cluster
  dependsOn:
    - name: metallb-install
  postBuild:
    substituteFrom:
      - kind: ConfigMap
        name: cluster-config
        optional: false

Key Benefits of This Architecture

  1. CRD Dependency Resolution: Installation phase creates MetalLB CRDs before configuration phase attempts to use them
  2. Clean Component Separation: Each phase has distinct responsibilities and can be managed independently
  3. Flux Health Checks: Installation phase waits for HelmRelease readiness before proceeding
  4. ConfigMap Substitution: Dynamic IP configuration through Flux's postBuild.substituteFrom
  5. Dependency Ordering: dependsOn ensures installation completes before configuration

Resolved Issues

CRD Validation Errors

Problem: Terraform planning failed with "the server could not find the requested resource" for MetalLB CRDs.

Solution: - Separated installation from configuration components - Used Flux Kustomizations with proper dependsOn relationships - Moved from kubernetes_manifest to kubectl apply via null_resource in Terraform

API Version Compatibility

Problem: Mismatched API versions between different MetalLB versions and Flux components.

Solution: - Inspected actual CRDs using kubectl get crds - Updated to correct API versions: HelmRepository v1beta2, HelmRelease v2beta2 - Used Helm chart deployment for consistent API versions

Template Variable Substitution

Problem: Static IP configuration in YAML files caused Git conflicts with dynamic infrastructure.

Solution: - Used template variables like ${metallb_floating_ip} in configuration files - Leveraged Flux's postBuild.substituteFrom with ConfigMap containing dynamic values - Maintained git-trackable configuration while supporting dynamic IP assignment

Deployment Automation

Terraform Integration

The 2-components terraform configuration creates the necessary ConfigMap with cluster values:

# Create ConfigMap with cluster configuration
resource "kubernetes_config_map" "cluster_config" {
  metadata {
    name      = "cluster-config"
    namespace = "flux-system"
  }

  data = {
    metallb_floating_ip = data.terraform_remote_state.bootstrap.outputs.floating_ip
    cluster_name        = var.cluster_name
    # Add other dynamic values as needed
  }
}

Template-Based Configuration

IP address pools use template variables for dynamic substitution:

# components/metallb-config/ipaddresspool.yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: ip-address-pool
  namespace: metallb-system
spec:
  addresses:
    - ${metallb_floating_ip}/32  # Substituted by Flux from ConfigMap

Conclusion

While Hetzner Cloud's networking limitations require a single control plane architecture for MetalLB L2 mode, this solution provides:

  • Reliable LoadBalancer functionality
  • Predictable network behavior
  • Simple configuration and maintenance
  • Cost-effective solution for many use cases
  • Automated GitOps deployment with proper dependency management
  • Template-based dynamic configuration avoiding Git conflicts
  • Two-phase component architecture resolving CRD timing issues

For high-availability requirements, consider using Hetzner Cloud Load Balancer service or implementing multiple floating IPs with service-specific assignment patterns.