- setup-node-storage service auto-partitions NVMe for containerd/longhorn - Root password encrypted with sops/age, decrypted during build - Fix SSH host key permissions (0600) so sshd actually starts - Disable SSH socket activation for reliable boot - Add OPERATIONS.md with runbook - Makefile tracks source dependencies
334 lines
8.6 KiB
Markdown
334 lines
8.6 KiB
Markdown
# Netboot Operations Guide
|
|
|
|
This document covers day-to-day operations for the netboot K3s cluster system.
|
|
|
|
## Quick Reference
|
|
|
|
```bash
|
|
# Build new image (15-30 min, requires sudo)
|
|
cd /home/lindahl/git/netboot
|
|
sudo ./build-image.sh
|
|
make deploy
|
|
|
|
# Rebuild initramfs only (faster, ~2 min)
|
|
sudo ./rebuild-initramfs.sh
|
|
make deploy
|
|
|
|
# SSH to a node
|
|
ssh root@192.168.100.51
|
|
|
|
# Check node storage
|
|
ssh root@192.168.100.51 "lsblk && df -h /var/lib/containerd /var/lib/longhorn"
|
|
```
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────┐ HTTP (8800) ┌──────────────────┐
|
|
│ Phoenix NAS │◄────────────────────►│ K3s Nodes │
|
|
│ 192.168.100.1 │ │ 192.168.100.5x │
|
|
├─────────────────┤ ├──────────────────┤
|
|
│ /srv/netboot/ │ │ RAM (overlay) │
|
|
│ http/ │ │ └─ / (root) │
|
|
│ vmlinuz │ │ NVMe (persistent)│
|
|
│ initrd-netboot.img │ ├─ containerd │
|
|
│ filesystem.squashfs │ └─ longhorn │
|
|
│ boot.ipxe │ └──────────────────┘
|
|
└─────────────────┘
|
|
```
|
|
|
|
**Boot sequence:**
|
|
1. Node PXE boots → loads iPXE
|
|
2. iPXE fetches `boot.ipxe` from phoenix
|
|
3. Downloads kernel + initramfs
|
|
4. Initramfs downloads squashfs root over HTTP
|
|
5. Mounts squashfs read-only with tmpfs overlay
|
|
6. `setup-node-storage.service` partitions/mounts local NVMe
|
|
7. System starts, K3s joins cluster
|
|
|
|
## Building Images
|
|
|
|
### Full Build
|
|
|
|
Builds everything from scratch: debootstrap, packages, initramfs, squashfs.
|
|
|
|
```bash
|
|
cd /home/lindahl/git/netboot
|
|
sudo ./build-image.sh
|
|
make deploy
|
|
```
|
|
|
|
**Time:** 15-30 minutes
|
|
**When to use:** Package changes, kernel updates, major configuration changes
|
|
|
|
### Initramfs-Only Rebuild
|
|
|
|
Faster rebuild when only changing boot/network logic.
|
|
|
|
```bash
|
|
sudo ./rebuild-initramfs.sh
|
|
make deploy
|
|
```
|
|
|
|
**Time:** ~2 minutes
|
|
**When to use:** Changes to `initramfs/` scripts or hooks
|
|
|
|
### Verify Build
|
|
|
|
Check that all components are present and valid:
|
|
|
|
```bash
|
|
./verify-image.sh
|
|
```
|
|
|
|
## Secret Management
|
|
|
|
Secrets are encrypted with [sops](https://github.com/getsops/sops) using age encryption. The age key lives on phoenix.
|
|
|
|
### Encrypted Files
|
|
|
|
| File | Contents |
|
|
|------|----------|
|
|
| `secrets/netboot.sops.yaml` | Root password hash for console login |
|
|
|
|
### Viewing Secrets
|
|
|
|
```bash
|
|
# From any machine with SSH access to phoenix
|
|
cat secrets/netboot.sops.yaml | ssh phoenix "sops -d --input-type yaml --output-type yaml /dev/stdin"
|
|
```
|
|
|
|
### Updating Root Password
|
|
|
|
1. Generate new password hash:
|
|
```bash
|
|
ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin"
|
|
```
|
|
|
|
2. Update the encrypted file:
|
|
```bash
|
|
ssh phoenix "cd /path/to/netboot && sops secrets/netboot.sops.yaml"
|
|
# Edit root_password_hash value, save
|
|
```
|
|
|
|
Or recreate entirely:
|
|
```bash
|
|
NEW_HASH=$(ssh phoenix "echo 'newpassword' | openssl passwd -6 -stdin")
|
|
ssh phoenix "echo 'root_password_hash: \"$NEW_HASH\"' | sops --input-type yaml --output-type yaml -e --age age1gausnystsln7fpenw7arw7x79xe22z697jnauj38npy0usayqqxqc7td2y /dev/stdin" > secrets/netboot.sops.yaml
|
|
```
|
|
|
|
3. Rebuild and deploy:
|
|
```bash
|
|
sudo ./build-image.sh
|
|
make deploy
|
|
```
|
|
|
|
4. Reboot nodes to pick up new password
|
|
|
|
### Adding New Secrets
|
|
|
|
Edit `.sops.yaml` to add new file patterns, then create encrypted files on phoenix:
|
|
|
|
```bash
|
|
ssh phoenix "sops secrets/newfile.sops.yaml"
|
|
```
|
|
|
|
## Node Storage Setup
|
|
|
|
Local NVMe is automatically partitioned on first boot by `setup-node-storage.service`.
|
|
|
|
### Partition Layout
|
|
|
|
| Partition | Size | Label | Mount Point | Purpose |
|
|
|-----------|------|-------|-------------|---------|
|
|
| nvme0n1p1 | 75GB | containerd | /var/lib/containerd | Container images |
|
|
| nvme0n1p2 | Remaining | longhorn | /var/lib/longhorn | Distributed storage |
|
|
|
|
### Automatic Behavior
|
|
|
|
| Drive State | Action |
|
|
|-------------|--------|
|
|
| No partition table | Auto-format (no prompt) |
|
|
| Has our labels (containerd/longhorn) | Mount silently |
|
|
| Has unknown partitions | Prompt on tty1, 120s timeout, skip if no response |
|
|
|
|
### Manual Intervention
|
|
|
|
If a node has an unknown drive and you want to format it:
|
|
|
|
1. Connect to physical console (tty1)
|
|
2. Reboot the node
|
|
3. Press ENTER when prompted (within 120 seconds)
|
|
4. Wait 5 seconds (abort window)
|
|
5. Drive is formatted and mounted
|
|
|
|
### Checking Storage Status
|
|
|
|
```bash
|
|
# On node
|
|
journalctl -u setup-node-storage
|
|
cat /var/lib/containerd/.netboot-storage # marker file with metadata
|
|
lsblk /dev/nvme0n1
|
|
df -h /var/lib/containerd /var/lib/longhorn
|
|
```
|
|
|
|
## SSH Access
|
|
|
|
### Authorized Keys
|
|
|
|
Keys are baked into the image at build time. Current keys:
|
|
|
|
| Key | Source |
|
|
|-----|--------|
|
|
| `ssh-ed25519 AAAAC3...y1J` | lindahl@lindahl-Legion-5-Pro-16ACH6H |
|
|
| `ssh-ed25519 AAAA...0tX` | lindahl@phoenix.home |
|
|
|
|
To add/remove keys, edit `build-image.sh` around line 164-167.
|
|
|
|
### Console Access
|
|
|
|
Root password is set for physical console login only. SSH remains pubkey-only.
|
|
|
|
```bash
|
|
# Physical console or IPMI
|
|
login: root
|
|
Password: <from secrets/netboot.sops.yaml>
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Node Won't Boot
|
|
|
|
1. Check phoenix HTTP server:
|
|
```bash
|
|
ssh phoenix "curl -I http://localhost:8800/boot.ipxe"
|
|
ssh phoenix "ls -lh /srv/netboot/http/"
|
|
```
|
|
|
|
2. Check nginx is running:
|
|
```bash
|
|
ssh phoenix "systemctl status nginx"
|
|
```
|
|
|
|
3. Verify image integrity:
|
|
```bash
|
|
./verify-image.sh
|
|
```
|
|
|
|
### Node Boots But No Network
|
|
|
|
1. Check if initramfs has network driver:
|
|
```bash
|
|
lsinitramfs http/initrd-netboot.img | grep -E "r8169|r8125"
|
|
```
|
|
|
|
2. Check kernel cmdline includes `ip=dhcp`:
|
|
```bash
|
|
cat http/boot.ipxe
|
|
```
|
|
|
|
### Storage Not Mounting
|
|
|
|
1. Check service status:
|
|
```bash
|
|
ssh root@node "systemctl status setup-node-storage"
|
|
ssh root@node "journalctl -u setup-node-storage"
|
|
```
|
|
|
|
2. Check if NVMe exists:
|
|
```bash
|
|
ssh root@node "lsblk"
|
|
```
|
|
|
|
3. Check labels:
|
|
```bash
|
|
ssh root@node "blkid -L containerd && blkid -L longhorn"
|
|
```
|
|
|
|
### Overlay Filling Up
|
|
|
|
The root overlay is only 2GB. If it fills:
|
|
|
|
```bash
|
|
# Check what's using space
|
|
ssh root@node "du -sh /var/* | sort -h"
|
|
|
|
# Temporary files should go to NVMe or tmpfs mounts
|
|
# /tmp, /var/tmp, /var/log are separate tmpfs
|
|
```
|
|
|
|
## File Reference
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `build-image.sh` | Main build script |
|
|
| `rebuild-initramfs.sh` | Quick initramfs rebuild |
|
|
| `verify-image.sh` | Validate built image |
|
|
| `Makefile` | Build/deploy automation |
|
|
| `initramfs/` | Custom initramfs config for mkinitramfs |
|
|
| `initramfs/scripts/netboot` | HTTP root download and overlay mount |
|
|
| `files/setup-node-storage` | NVMe partitioning script |
|
|
| `files/setup-node-storage.service` | Systemd unit for storage setup |
|
|
| `secrets/netboot.sops.yaml` | Encrypted root password |
|
|
| `.sops.yaml` | Sops encryption config |
|
|
| `http/boot.ipxe` | iPXE boot configuration |
|
|
|
|
## Network Configuration
|
|
|
|
### IP Address Layout
|
|
|
|
| Range | Purpose |
|
|
|-------|---------|
|
|
| .1 | phoenix (gateway, DHCP, HTTP) |
|
|
| .2-.19 | Reserved (future infrastructure) |
|
|
| .20-.29 | Infrastructure devices |
|
|
| .50-.59 | Static K3s nodes |
|
|
| .60-.100 | Dynamic DHCP pool |
|
|
|
|
### Static Assignments
|
|
|
|
| Host | IP | MAC | Role |
|
|
|------|-----|-----|------|
|
|
| phoenix | 192.168.100.1 | - | NAS, HTTP server, DHCP |
|
|
| usw-flex-2 | 192.168.100.21 | 94:2a:6f:4c:fc:72 | Managed switch |
|
|
| k3s-node-01 | 192.168.100.51 | 78:55:36:04:e7:c8 | K3s worker |
|
|
| k3s-node-02 | 192.168.100.52 | 78:55:36:04:e7:1d | K3s worker |
|
|
|
|
HTTP server: `http://192.168.100.1:8800/`
|
|
|
|
### DHCP Reservations
|
|
|
|
Static IP assignments are configured in `/etc/dnsmasq.d/pxe-netboot.conf` on phoenix:
|
|
|
|
```
|
|
dhcp-range=192.168.100.60,192.168.100.100,12h
|
|
|
|
# Static DHCP reservations for K3s nodes
|
|
dhcp-host=78:55:36:04:e7:c8,192.168.100.51,k3s-node-01
|
|
dhcp-host=78:55:36:04:e7:1d,192.168.100.52,k3s-node-02
|
|
|
|
# Infrastructure
|
|
dhcp-host=94:2a:6f:4c:fc:72,192.168.100.21,usw-flex-2
|
|
```
|
|
|
|
To add a new node:
|
|
|
|
1. Boot the node once to get its MAC (check leases):
|
|
```bash
|
|
ssh phoenix "cat /var/lib/misc/dnsmasq.leases"
|
|
```
|
|
|
|
2. Add reservation:
|
|
```bash
|
|
ssh phoenix "sudo tee -a /etc/dnsmasq.d/pxe-netboot.conf << EOF
|
|
dhcp-host=XX:XX:XX:XX:XX:XX,192.168.100.5X,k3s-node-0X
|
|
EOF"
|
|
```
|
|
|
|
3. Restart dnsmasq:
|
|
```bash
|
|
ssh phoenix "sudo systemctl restart dnsmasq"
|
|
```
|
|
|
|
To change the boot server IP, edit `http/boot.ipxe` and `initramfs/scripts/netboot`.
|