Files
netboot/OPERATIONS.md
Torbjørn Lindahl 3f191d8f93 Add NVMe storage auto-setup, sops secrets, fix SSH permissions
- setup-node-storage service auto-partitions NVMe for containerd/longhorn
- Root password encrypted with sops/age, decrypted during build
- Fix SSH host key permissions (0600) so sshd actually starts
- Disable SSH socket activation for reliable boot
- Add OPERATIONS.md with runbook
- Makefile tracks source dependencies
2026-02-06 00:58:38 +01:00

8.6 KiB

Netboot Operations Guide

This document covers day-to-day operations for the netboot K3s cluster system.

Quick Reference

# Build new image (15-30 min, requires sudo)
cd /home/lindahl/git/netboot
sudo ./build-image.sh
make deploy

# Rebuild initramfs only (faster, ~2 min)
sudo ./rebuild-initramfs.sh
make deploy

# SSH to a node
ssh root@192.168.100.51

# Check node storage
ssh root@192.168.100.51 "lsblk && df -h /var/lib/containerd /var/lib/longhorn"

Architecture Overview

┌─────────────────┐     HTTP (8800)      ┌──────────────────┐
│  Phoenix NAS    │◄────────────────────►│   K3s Nodes      │
│  192.168.100.1  │                      │  192.168.100.5x  │
├─────────────────┤                      ├──────────────────┤
│ /srv/netboot/   │                      │ RAM (overlay)    │
│   http/         │                      │   └─ / (root)    │
│     vmlinuz     │                      │ NVMe (persistent)│
│     initrd-netboot.img                 │   ├─ containerd  │
│     filesystem.squashfs                │   └─ longhorn    │
│     boot.ipxe   │                      └──────────────────┘
└─────────────────┘

Boot sequence:

  1. Node PXE boots → loads iPXE
  2. iPXE fetches boot.ipxe from phoenix
  3. Downloads kernel + initramfs
  4. Initramfs downloads squashfs root over HTTP
  5. Mounts squashfs read-only with tmpfs overlay
  6. setup-node-storage.service partitions/mounts local NVMe
  7. System starts, K3s joins cluster

Building Images

Full Build

Builds everything from scratch: debootstrap, packages, initramfs, squashfs.

cd /home/lindahl/git/netboot
sudo ./build-image.sh
make deploy

Time: 15-30 minutes When to use: Package changes, kernel updates, major configuration changes

Initramfs-Only Rebuild

Faster rebuild when only changing boot/network logic.

sudo ./rebuild-initramfs.sh
make deploy

Time: ~2 minutes When to use: Changes to initramfs/ scripts or hooks

Verify Build

Check that all components are present and valid:

./verify-image.sh

Secret Management

Secrets are encrypted with sops using age encryption. The age key lives on phoenix.

Encrypted Files

File Contents
secrets/netboot.sops.yaml Root password hash for console login

Viewing Secrets

# From any machine with SSH access to phoenix
cat secrets/netboot.sops.yaml | ssh phoenix "sops -d --input-type yaml --output-type yaml /dev/stdin"

Updating Root Password

  1. Generate new password hash:

    ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin"
    
  2. Update the encrypted file:

    ssh phoenix "cd /path/to/netboot && sops secrets/netboot.sops.yaml"
    # Edit root_password_hash value, save
    

    Or recreate entirely:

    NEW_HASH=$(ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin")
    ssh phoenix "echo 'root_password_hash: \"$NEW_HASH\"' | sops --input-type yaml --output-type yaml -e --age age1gausnystsln7fpenw7arw7x79xe22z697jnauj38npy0usayqqxqc7td2y /dev/stdin" > secrets/netboot.sops.yaml
    
  3. Rebuild and deploy:

    sudo ./build-image.sh
    make deploy
    
  4. Reboot nodes to pick up new password

Adding New Secrets

Edit .sops.yaml to add new file patterns, then create encrypted files on phoenix:

ssh phoenix "sops secrets/newfile.sops.yaml"

Node Storage Setup

Local NVMe is automatically partitioned on first boot by setup-node-storage.service.

Partition Layout

Partition Size Label Mount Point Purpose
nvme0n1p1 75GB containerd /var/lib/containerd Container images
nvme0n1p2 Remaining longhorn /var/lib/longhorn Distributed storage

Automatic Behavior

Drive State Action
No partition table Auto-format (no prompt)
Has our labels (containerd/longhorn) Mount silently
Has unknown partitions Prompt on tty1, 120s timeout, skip if no response

Manual Intervention

If a node has an unknown drive and you want to format it:

  1. Connect to physical console (tty1)
  2. Reboot the node
  3. Press ENTER when prompted (within 120 seconds)
  4. Wait 5 seconds (abort window)
  5. Drive is formatted and mounted

Checking Storage Status

# On node
journalctl -u setup-node-storage
cat /var/lib/containerd/.netboot-storage  # marker file with metadata
lsblk /dev/nvme0n1
df -h /var/lib/containerd /var/lib/longhorn

SSH Access

Authorized Keys

Keys are baked into the image at build time. Current keys:

Key Source
ssh-ed25519 AAAAC3...y1J lindahl@lindahl-Legion-5-Pro-16ACH6H
ssh-ed25519 AAAA...0tX lindahl@phoenix.home

To add/remove keys, edit build-image.sh around line 164-167.

Console Access

Root password is set for physical console login only. SSH remains pubkey-only.

# Physical console or IPMI
login: root
Password: <from secrets/netboot.sops.yaml>

Troubleshooting

Node Won't Boot

  1. Check phoenix HTTP server:

    ssh phoenix "curl -I http://localhost:8800/boot.ipxe"
    ssh phoenix "ls -lh /srv/netboot/http/"
    
  2. Check nginx is running:

    ssh phoenix "systemctl status nginx"
    
  3. Verify image integrity:

    ./verify-image.sh
    

Node Boots But No Network

  1. Check if initramfs has network driver:

    lsinitramfs http/initrd-netboot.img | grep -E "r8169|r8125"
    
  2. Check kernel cmdline includes ip=dhcp:

    cat http/boot.ipxe
    

Storage Not Mounting

  1. Check service status:

    ssh root@node "systemctl status setup-node-storage"
    ssh root@node "journalctl -u setup-node-storage"
    
  2. Check if NVMe exists:

    ssh root@node "lsblk"
    
  3. Check labels:

    ssh root@node "blkid -L containerd && blkid -L longhorn"
    

Overlay Filling Up

The root overlay is only 2GB. If it fills:

# Check what's using space
ssh root@node "du -sh /var/* | sort -h"

# Temporary files should go to NVMe or tmpfs mounts
# /tmp, /var/tmp, /var/log are separate tmpfs

File Reference

File Purpose
build-image.sh Main build script
rebuild-initramfs.sh Quick initramfs rebuild
verify-image.sh Validate built image
Makefile Build/deploy automation
initramfs/ Custom initramfs config for mkinitramfs
initramfs/scripts/netboot HTTP root download and overlay mount
files/setup-node-storage NVMe partitioning script
files/setup-node-storage.service Systemd unit for storage setup
secrets/netboot.sops.yaml Encrypted root password
.sops.yaml Sops encryption config
http/boot.ipxe iPXE boot configuration

Network Configuration

IP Address Layout

Range Purpose
.1 phoenix (gateway, DHCP, HTTP)
.2-.19 Reserved (future infrastructure)
.20-.29 Infrastructure devices
.50-.59 Static K3s nodes
.60-.100 Dynamic DHCP pool

Static Assignments

Host IP MAC Role
phoenix 192.168.100.1 - NAS, HTTP server, DHCP
usw-flex-2 192.168.100.21 94:2a:6f:4c:fc:72 Managed switch
k3s-node-01 192.168.100.51 78:55:36:04:e7:c8 K3s worker
k3s-node-02 192.168.100.52 78:55:36:04:e7:1d K3s worker

HTTP server: http://192.168.100.1:8800/

DHCP Reservations

Static IP assignments are configured in /etc/dnsmasq.d/pxe-netboot.conf on phoenix:

dhcp-range=192.168.100.60,192.168.100.100,12h

# Static DHCP reservations for K3s nodes
dhcp-host=78:55:36:04:e7:c8,192.168.100.51,k3s-node-01
dhcp-host=78:55:36:04:e7:1d,192.168.100.52,k3s-node-02

# Infrastructure
dhcp-host=94:2a:6f:4c:fc:72,192.168.100.21,usw-flex-2

To add a new node:

  1. Boot the node once to get its MAC (check leases):

    ssh phoenix "cat /var/lib/misc/dnsmasq.leases"
    
  2. Add reservation:

    ssh phoenix "sudo tee -a /etc/dnsmasq.d/pxe-netboot.conf << EOF
    dhcp-host=XX:XX:XX:XX:XX:XX,192.168.100.5X,k3s-node-0X
    EOF"
    
  3. Restart dnsmasq:

    ssh phoenix "sudo systemctl restart dnsmasq"
    

To change the boot server IP, edit http/boot.ipxe and initramfs/scripts/netboot.