- setup-node-storage service auto-partitions NVMe for containerd/longhorn - Root password encrypted with sops/age, decrypted during build - Fix SSH host key permissions (0600) so sshd actually starts - Disable SSH socket activation for reliable boot - Add OPERATIONS.md with runbook - Makefile tracks source dependencies
8.6 KiB
Netboot Operations Guide
This document covers day-to-day operations for the netboot K3s cluster system.
Quick Reference
# Build new image (15-30 min, requires sudo)
cd /home/lindahl/git/netboot
sudo ./build-image.sh
make deploy
# Rebuild initramfs only (faster, ~2 min)
sudo ./rebuild-initramfs.sh
make deploy
# SSH to a node
ssh root@192.168.100.51
# Check node storage
ssh root@192.168.100.51 "lsblk && df -h /var/lib/containerd /var/lib/longhorn"
Architecture Overview
┌─────────────────┐ HTTP (8800) ┌──────────────────┐
│ Phoenix NAS │◄────────────────────►│ K3s Nodes │
│ 192.168.100.1 │ │ 192.168.100.5x │
├─────────────────┤ ├──────────────────┤
│ /srv/netboot/ │ │ RAM (overlay) │
│ http/ │ │ └─ / (root) │
│ vmlinuz │ │ NVMe (persistent)│
│ initrd-netboot.img │ ├─ containerd │
│ filesystem.squashfs │ └─ longhorn │
│ boot.ipxe │ └──────────────────┘
└─────────────────┘
Boot sequence:
- Node PXE boots → loads iPXE
- iPXE fetches
boot.ipxefrom phoenix - Downloads kernel + initramfs
- Initramfs downloads squashfs root over HTTP
- Mounts squashfs read-only with tmpfs overlay
setup-node-storage.servicepartitions/mounts local NVMe- System starts, K3s joins cluster
Building Images
Full Build
Builds everything from scratch: debootstrap, packages, initramfs, squashfs.
cd /home/lindahl/git/netboot
sudo ./build-image.sh
make deploy
Time: 15-30 minutes When to use: Package changes, kernel updates, major configuration changes
Initramfs-Only Rebuild
Faster rebuild when only changing boot/network logic.
sudo ./rebuild-initramfs.sh
make deploy
Time: ~2 minutes
When to use: Changes to initramfs/ scripts or hooks
Verify Build
Check that all components are present and valid:
./verify-image.sh
Secret Management
Secrets are encrypted with sops using age encryption. The age key lives on phoenix.
Encrypted Files
| File | Contents |
|---|---|
secrets/netboot.sops.yaml |
Root password hash for console login |
Viewing Secrets
# From any machine with SSH access to phoenix
cat secrets/netboot.sops.yaml | ssh phoenix "sops -d --input-type yaml --output-type yaml /dev/stdin"
Updating Root Password
-
Generate new password hash:
ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin" -
Update the encrypted file:
ssh phoenix "cd /path/to/netboot && sops secrets/netboot.sops.yaml" # Edit root_password_hash value, saveOr recreate entirely:
NEW_HASH=$(ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin") ssh phoenix "echo 'root_password_hash: \"$NEW_HASH\"' | sops --input-type yaml --output-type yaml -e --age age1gausnystsln7fpenw7arw7x79xe22z697jnauj38npy0usayqqxqc7td2y /dev/stdin" > secrets/netboot.sops.yaml -
Rebuild and deploy:
sudo ./build-image.sh make deploy -
Reboot nodes to pick up new password
Adding New Secrets
Edit .sops.yaml to add new file patterns, then create encrypted files on phoenix:
ssh phoenix "sops secrets/newfile.sops.yaml"
Node Storage Setup
Local NVMe is automatically partitioned on first boot by setup-node-storage.service.
Partition Layout
| Partition | Size | Label | Mount Point | Purpose |
|---|---|---|---|---|
| nvme0n1p1 | 75GB | containerd | /var/lib/containerd | Container images |
| nvme0n1p2 | Remaining | longhorn | /var/lib/longhorn | Distributed storage |
Automatic Behavior
| Drive State | Action |
|---|---|
| No partition table | Auto-format (no prompt) |
| Has our labels (containerd/longhorn) | Mount silently |
| Has unknown partitions | Prompt on tty1, 120s timeout, skip if no response |
Manual Intervention
If a node has an unknown drive and you want to format it:
- Connect to physical console (tty1)
- Reboot the node
- Press ENTER when prompted (within 120 seconds)
- Wait 5 seconds (abort window)
- Drive is formatted and mounted
Checking Storage Status
# On node
journalctl -u setup-node-storage
cat /var/lib/containerd/.netboot-storage # marker file with metadata
lsblk /dev/nvme0n1
df -h /var/lib/containerd /var/lib/longhorn
SSH Access
Authorized Keys
Keys are baked into the image at build time. Current keys:
| Key | Source |
|---|---|
ssh-ed25519 AAAAC3...y1J |
lindahl@lindahl-Legion-5-Pro-16ACH6H |
ssh-ed25519 AAAA...0tX |
lindahl@phoenix.home |
To add/remove keys, edit build-image.sh around line 164-167.
Console Access
Root password is set for physical console login only. SSH remains pubkey-only.
# Physical console or IPMI
login: root
Password: <from secrets/netboot.sops.yaml>
Troubleshooting
Node Won't Boot
-
Check phoenix HTTP server:
ssh phoenix "curl -I http://localhost:8800/boot.ipxe" ssh phoenix "ls -lh /srv/netboot/http/" -
Check nginx is running:
ssh phoenix "systemctl status nginx" -
Verify image integrity:
./verify-image.sh
Node Boots But No Network
-
Check if initramfs has network driver:
lsinitramfs http/initrd-netboot.img | grep -E "r8169|r8125" -
Check kernel cmdline includes
ip=dhcp:cat http/boot.ipxe
Storage Not Mounting
-
Check service status:
ssh root@node "systemctl status setup-node-storage" ssh root@node "journalctl -u setup-node-storage" -
Check if NVMe exists:
ssh root@node "lsblk" -
Check labels:
ssh root@node "blkid -L containerd && blkid -L longhorn"
Overlay Filling Up
The root overlay is only 2GB. If it fills:
# Check what's using space
ssh root@node "du -sh /var/* | sort -h"
# Temporary files should go to NVMe or tmpfs mounts
# /tmp, /var/tmp, /var/log are separate tmpfs
File Reference
| File | Purpose |
|---|---|
build-image.sh |
Main build script |
rebuild-initramfs.sh |
Quick initramfs rebuild |
verify-image.sh |
Validate built image |
Makefile |
Build/deploy automation |
initramfs/ |
Custom initramfs config for mkinitramfs |
initramfs/scripts/netboot |
HTTP root download and overlay mount |
files/setup-node-storage |
NVMe partitioning script |
files/setup-node-storage.service |
Systemd unit for storage setup |
secrets/netboot.sops.yaml |
Encrypted root password |
.sops.yaml |
Sops encryption config |
http/boot.ipxe |
iPXE boot configuration |
Network Configuration
IP Address Layout
| Range | Purpose |
|---|---|
| .1 | phoenix (gateway, DHCP, HTTP) |
| .2-.19 | Reserved (future infrastructure) |
| .20-.29 | Infrastructure devices |
| .50-.59 | Static K3s nodes |
| .60-.100 | Dynamic DHCP pool |
Static Assignments
| Host | IP | MAC | Role |
|---|---|---|---|
| phoenix | 192.168.100.1 | - | NAS, HTTP server, DHCP |
| usw-flex-2 | 192.168.100.21 | 94:2a:6f:4c:fc:72 | Managed switch |
| k3s-node-01 | 192.168.100.51 | 78:55:36:04:e7:c8 | K3s worker |
| k3s-node-02 | 192.168.100.52 | 78:55:36:04:e7:1d | K3s worker |
HTTP server: http://192.168.100.1:8800/
DHCP Reservations
Static IP assignments are configured in /etc/dnsmasq.d/pxe-netboot.conf on phoenix:
dhcp-range=192.168.100.60,192.168.100.100,12h
# Static DHCP reservations for K3s nodes
dhcp-host=78:55:36:04:e7:c8,192.168.100.51,k3s-node-01
dhcp-host=78:55:36:04:e7:1d,192.168.100.52,k3s-node-02
# Infrastructure
dhcp-host=94:2a:6f:4c:fc:72,192.168.100.21,usw-flex-2
To add a new node:
-
Boot the node once to get its MAC (check leases):
ssh phoenix "cat /var/lib/misc/dnsmasq.leases" -
Add reservation:
ssh phoenix "sudo tee -a /etc/dnsmasq.d/pxe-netboot.conf << EOF dhcp-host=XX:XX:XX:XX:XX:XX,192.168.100.5X,k3s-node-0X EOF" -
Restart dnsmasq:
ssh phoenix "sudo systemctl restart dnsmasq"
To change the boot server IP, edit http/boot.ipxe and initramfs/scripts/netboot.