# Netboot Operations Guide This document covers day-to-day operations for the netboot K3s cluster system. ## Quick Reference ```bash # Build new image (15-30 min, requires sudo) cd /home/lindahl/git/netboot sudo ./build-image.sh make deploy # Rebuild initramfs only (faster, ~2 min) sudo ./rebuild-initramfs.sh make deploy # SSH to a node ssh root@192.168.100.51 # Check node storage ssh root@192.168.100.51 "lsblk && df -h /var/lib/containerd /var/lib/longhorn" ``` ## Architecture Overview ``` ┌─────────────────┐ HTTP (8800) ┌──────────────────┐ │ Phoenix NAS │◄────────────────────►│ K3s Nodes │ │ 192.168.100.1 │ │ 192.168.100.5x │ ├─────────────────┤ ├──────────────────┤ │ /srv/netboot/ │ │ RAM (overlay) │ │ http/ │ │ └─ / (root) │ │ vmlinuz │ │ NVMe (persistent)│ │ initrd-netboot.img │ ├─ containerd │ │ filesystem.squashfs │ └─ longhorn │ │ boot.ipxe │ └──────────────────┘ └─────────────────┘ ``` **Boot sequence:** 1. Node PXE boots → loads iPXE 2. iPXE fetches `boot.ipxe` from phoenix 3. Downloads kernel + initramfs 4. Initramfs downloads squashfs root over HTTP 5. Mounts squashfs read-only with tmpfs overlay 6. `setup-node-storage.service` partitions/mounts local NVMe 7. System starts, K3s joins cluster ## Building Images ### Full Build Builds everything from scratch: debootstrap, packages, initramfs, squashfs. ```bash cd /home/lindahl/git/netboot sudo ./build-image.sh make deploy ``` **Time:** 15-30 minutes **When to use:** Package changes, kernel updates, major configuration changes ### Initramfs-Only Rebuild Faster rebuild when only changing boot/network logic. ```bash sudo ./rebuild-initramfs.sh make deploy ``` **Time:** ~2 minutes **When to use:** Changes to `initramfs/` scripts or hooks ### Verify Build Check that all components are present and valid: ```bash ./verify-image.sh ``` ## Secret Management Secrets are encrypted with [sops](https://github.com/getsops/sops) using age encryption. The age key lives on phoenix. ### Encrypted Files | File | Contents | |------|----------| | `secrets/netboot.sops.yaml` | Root password hash for console login | ### Viewing Secrets ```bash # From any machine with SSH access to phoenix cat secrets/netboot.sops.yaml | ssh phoenix "sops -d --input-type yaml --output-type yaml /dev/stdin" ``` ### Updating Root Password 1. Generate new password hash: ```bash ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin" ``` 2. Update the encrypted file: ```bash ssh phoenix "cd /path/to/netboot && sops secrets/netboot.sops.yaml" # Edit root_password_hash value, save ``` Or recreate entirely: ```bash NEW_HASH=$(ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin") ssh phoenix "echo 'root_password_hash: \"$NEW_HASH\"' | sops --input-type yaml --output-type yaml -e --age age1gausnystsln7fpenw7arw7x79xe22z697jnauj38npy0usayqqxqc7td2y /dev/stdin" > secrets/netboot.sops.yaml ``` 3. Rebuild and deploy: ```bash sudo ./build-image.sh make deploy ``` 4. Reboot nodes to pick up new password ### Adding New Secrets Edit `.sops.yaml` to add new file patterns, then create encrypted files on phoenix: ```bash ssh phoenix "sops secrets/newfile.sops.yaml" ``` ## Node Storage Setup Local NVMe is automatically partitioned on first boot by `setup-node-storage.service`. ### Partition Layout | Partition | Size | Label | Mount Point | Purpose | |-----------|------|-------|-------------|---------| | nvme0n1p1 | 75GB | containerd | /var/lib/containerd | Container images | | nvme0n1p2 | Remaining | longhorn | /var/lib/longhorn | Distributed storage | ### Automatic Behavior | Drive State | Action | |-------------|--------| | No partition table | Auto-format (no prompt) | | Has our labels (containerd/longhorn) | Mount silently | | Has unknown partitions | Prompt on tty1, 120s timeout, skip if no response | ### Manual Intervention If a node has an unknown drive and you want to format it: 1. Connect to physical console (tty1) 2. Reboot the node 3. Press ENTER when prompted (within 120 seconds) 4. Wait 5 seconds (abort window) 5. Drive is formatted and mounted ### Checking Storage Status ```bash # On node journalctl -u setup-node-storage cat /var/lib/containerd/.netboot-storage # marker file with metadata lsblk /dev/nvme0n1 df -h /var/lib/containerd /var/lib/longhorn ``` ## SSH Access ### Authorized Keys Keys are baked into the image at build time. Current keys: | Key | Source | |-----|--------| | `ssh-ed25519 AAAAC3...y1J` | lindahl@lindahl-Legion-5-Pro-16ACH6H | | `ssh-ed25519 AAAA...0tX` | lindahl@phoenix.home | To add/remove keys, edit `build-image.sh` around line 164-167. ### Console Access Root password is set for physical console login only. SSH remains pubkey-only. ```bash # Physical console or IPMI login: root Password: ``` ## Troubleshooting ### Node Won't Boot 1. Check phoenix HTTP server: ```bash ssh phoenix "curl -I http://localhost:8800/boot.ipxe" ssh phoenix "ls -lh /srv/netboot/http/" ``` 2. Check nginx is running: ```bash ssh phoenix "systemctl status nginx" ``` 3. Verify image integrity: ```bash ./verify-image.sh ``` ### Node Boots But No Network 1. Check if initramfs has network driver: ```bash lsinitramfs http/initrd-netboot.img | grep -E "r8169|r8125" ``` 2. Check kernel cmdline includes `ip=dhcp`: ```bash cat http/boot.ipxe ``` ### Storage Not Mounting 1. Check service status: ```bash ssh root@node "systemctl status setup-node-storage" ssh root@node "journalctl -u setup-node-storage" ``` 2. Check if NVMe exists: ```bash ssh root@node "lsblk" ``` 3. Check labels: ```bash ssh root@node "blkid -L containerd && blkid -L longhorn" ``` ### Overlay Filling Up The root overlay is only 2GB. If it fills: ```bash # Check what's using space ssh root@node "du -sh /var/* | sort -h" # Temporary files should go to NVMe or tmpfs mounts # /tmp, /var/tmp, /var/log are separate tmpfs ``` ## File Reference | File | Purpose | |------|---------| | `build-image.sh` | Main build script | | `rebuild-initramfs.sh` | Quick initramfs rebuild | | `verify-image.sh` | Validate built image | | `Makefile` | Build/deploy automation | | `initramfs/` | Custom initramfs config for mkinitramfs | | `initramfs/scripts/netboot` | HTTP root download and overlay mount | | `files/setup-node-storage` | NVMe partitioning script | | `files/setup-node-storage.service` | Systemd unit for storage setup | | `secrets/netboot.sops.yaml` | Encrypted root password | | `.sops.yaml` | Sops encryption config | | `http/boot.ipxe` | iPXE boot configuration | ## Network Configuration ### IP Address Layout | Range | Purpose | |-------|---------| | .1 | phoenix (gateway, DHCP, HTTP) | | .2-.19 | Reserved (future infrastructure) | | .20-.29 | Infrastructure devices | | .50-.59 | Static K3s nodes | | .60-.100 | Dynamic DHCP pool | ### Static Assignments | Host | IP | MAC | Role | |------|-----|-----|------| | phoenix | 192.168.100.1 | - | NAS, HTTP server, DHCP | | usw-flex-2 | 192.168.100.21 | 94:2a:6f:4c:fc:72 | Managed switch | | k3s-node-01 | 192.168.100.51 | 78:55:36:04:e7:c8 | K3s worker | | k3s-node-02 | 192.168.100.52 | 78:55:36:04:e7:1d | K3s worker | HTTP server: `http://192.168.100.1:8800/` ### DHCP Reservations Static IP assignments are configured in `/etc/dnsmasq.d/pxe-netboot.conf` on phoenix: ``` dhcp-range=192.168.100.60,192.168.100.100,12h # Static DHCP reservations for K3s nodes dhcp-host=78:55:36:04:e7:c8,192.168.100.51,k3s-node-01 dhcp-host=78:55:36:04:e7:1d,192.168.100.52,k3s-node-02 # Infrastructure dhcp-host=94:2a:6f:4c:fc:72,192.168.100.21,usw-flex-2 ``` To add a new node: 1. Boot the node once to get its MAC (check leases): ```bash ssh phoenix "cat /var/lib/misc/dnsmasq.leases" ``` 2. Add reservation: ```bash ssh phoenix "sudo tee -a /etc/dnsmasq.d/pxe-netboot.conf << EOF dhcp-host=XX:XX:XX:XX:XX:XX,192.168.100.5X,k3s-node-0X EOF" ``` 3. Restart dnsmasq: ```bash ssh phoenix "sudo systemctl restart dnsmasq" ``` To change the boot server IP, edit `http/boot.ipxe` and `initramfs/scripts/netboot`.