
Operating on Ruflo
This is the operational companion to Ruflo — A Swarm Orchestrator Next to Paperclip. That post explains the architecture and the rationale. This one covers the day-to-day: connecting, installing tools, bumping images, reading logs, recovering from breakage, and a worked swarm-run cookbook.
What “Healthy” Looks Like
Ruflo is healthy when:
- The
rufloDeployment is2/2 Runningin theruflo-systemnamespace (bothrufloandruflo-shellcontainers Ready). - The web UI loads at
https://ruflo.cluster.derio.netafter Authentik SSO. - SSH to
agent@192.168.55.222succeeds. - All four ExternalSecrets (
ruflo-llm,ruflo-resend,ruflo-shell-alerts, plus any optional add-ons) showSecretSynced=True. - The three PVCs are
Bound(ruflo-data5Gi,ruflo-shell-home10Gi,ruflo-workspace20Gi). - The
ruflo-dbBitnami postgresql StatefulSet is1/1 Running(parked but green — see the building post for the RVF deviation).
kubectl get pods,pvc,externalsecret,svc -n ruflo-systemExpected: one ruflo-… Deployment pod (2/2), one ruflo-db-postgresql-0 StatefulSet pod (2/2), three Bound PVCs, four synced ExternalSecrets, two Services (ClusterIP for ruvocal, LoadBalancer at 192.168.55.222 for SSH+Mosh).
Connecting
Web UI
Open https://ruflo.cluster.derio.net. Authentik forward-auth handles the SSO redirect. After login you land on ruvocal’s chat surface. The session cookie is shared with every other Authentik-fronted service on the cluster, so you sign in once.
SSH
Add to ~/.ssh/config:
Host ruflo
HostName 192.168.55.222
User agent
Port 22Then:
ssh rufloMosh
Mosh works over a UDP port range allocated on the Service (60016–60031). You can wrap it in a shell function or just call it directly:
mosh --ssh="ssh -i ~/.ssh/<your-key>" \
--server="mosh-server new -p 60016:60031" \
agent@192.168.55.222Sixteen ports is plenty of headroom; MOSH_SERVER_NETWORK_TMOUT reaps stuck sessions so the range doesn’t bleed.
Authorised-Keys Bootstrap
Authorised keys live in a SOPS-encrypted Secret ruflo-shell-ssh-keys (under secrets/ruflo/). The pod boots whether the Secret exists or not — sshd just rejects key-based logins until the bootstrap is applied. To rotate or seed keys:
# Edit secrets/ruflo/ruflo-shell-ssh-keys.yaml (SOPS will round-trip the encryption)
sops secrets/ruflo/ruflo-shell-ssh-keys.yaml
# Apply
sops -d secrets/ruflo/ruflo-shell-ssh-keys.yaml | kubectl apply -f -If the pod was running before the bootstrap landed, the cont-init.d/30-authorized-keys hook only fires at boot — so the new keys won’t be live until you trigger a re-copy:
kubectl exec -n ruflo-system deploy/ruflo -c ruflo-shell -- \
bash -c 'cp /etc/ssh-keys/authorized_keys "${AGENT_HOME:-/home/agent}/.ssh/authorized_keys" \
&& chmod 600 "${AGENT_HOME:-/home/agent}/.ssh/authorized_keys"'Or just kubectl rollout restart deploy/ruflo -n ruflo-system and let cont-init.d re-fire on the new pod. (The same hook bites every shell sidecar — there’s a frank-gotchas entry for it.)
Adding and Removing Tools
The shell sidecar’s tool inventory is declarative — apps/ruflo/manifests/configmap-shell-inventory.yaml:
data:
inventory.yaml: |
mise:
- python@3.12
- node@20
- rust@stable
npm-global:
- "claude-flow@alpha"
- "@openai/codex"
pipx:
- black
- ruff
cargo:
- ripgrep
- eza
removed:
mise: []
npm-global: []
pipx: []
cargo: []Edit, commit, push. ArgoCD syncs the ConfigMap. The boot-time reconciler picks up the new declaration on the next pod restart and installs/removes accordingly. To trigger immediately without a restart:
ssh ruflo -- ruflo-shell-reconcileThe removed: Arrays
Removing a tool from the upper arrays does NOT uninstall it — that’s intentional, so removing a declaration doesn’t surprise an in-flight session. To actively uninstall, add the tool to the matching removed: list:
removed:
cargo: [eza] # forces `cargo uninstall eza` on next reconcileOnce reconcile runs and reports the removal, you can drop the entry from removed: (or leave it as a record).
Interactive Installs (Layer-3 Escape Hatch)
For discovery work, just install on the pod — mise install, npm i -g, pipx install, cargo install. All four managers persist their state under ${AGENT_HOME} (i.e. /home/agent/{.local/share/mise, .cargo/bin, .local/pipx, .local/share/mise/installs/node/20.../lib}) which is mounted from the ruflo-shell-home PVC. Tools survive pod bounces.
ssh ruflo -- cargo install fd-find
kubectl rollout restart deploy/ruflo -n ruflo-system
ssh ruflo -- which fd # /home/agent/.cargo/bin/fd — survived the bounceWhen to Promote to the Inventory
Promote an interactive install when:
- You want the tool to survive a PV migration (the inventory ConfigMap is the source of truth — interactive installs are in PV state).
- You want the next operator (or a freshly recreated PVC) to inherit the tool.
- You want the boot-time reconcile and the Telegram-on-failure alert path to cover it.
Otherwise leave it interactive. Discovery week is meant to lean on this.
The mise-Activation Workaround (Pending Upstream Fix)
install-inventory.sh in the current agent-shell-base has a known bug: after mise install <tool>, it doesn’t run mise use --global <tool>, so subsequent steps that resolve npm / python fall through to the system installs and fail (EACCES on /usr/lib/node_modules/, missing pyyaml, etc.). Workaround on the live pod:
ssh ruflo -- 'mise use --global node@20 rust@stable python@3.12'
ssh ruflo -- 'mise exec -- pip install pyyaml'
ssh ruflo -- ruflo-shell-reconcile # re-run after activationThe fix belongs in agent-shell-base. Once it ships, this workaround goes away.
Reading the Install Log
Every reconcile run writes a per-tool log to /var/log/cont-init.d/install-inventory.log and a one-line MOTD summary to /run/motd.dynamic. The next SSH login sees the summary on the banner:
✓ ruflo-shell: 7 installed, 0 already present, 0 removed @ 2026-05-03T14:22:11ZA failed=N count flips the banner to a warning glyph and triggers a Telegram alert via the ruflo-shell-alerts ExternalSecret. The alert contains the tool, the manager, the exit code, and the last 40 lines of the install log. Recovery flow:
ssh ruflo
sudo less /var/log/cont-init.d/install-inventory.log # full log
ruflo-shell-reconcile # re-try after fixingIf the Telegram alert path is silent on a known failure, check: (a) the ruflo-shell-alerts Secret exists and is SecretSynced=True; (b) the notify-telegram.sh helper is on PATH on the pod; (c) FRANK_C2_TELEGRAM_BOT_TOKEN and FRANK_C2_TELEGRAM_CHAT_ID exist in Infisical.
Bumping Images
ruflo-shell (Layer-1 changes)
Bump the ruflo-shell image when you want to change something baked at build time — i.e. anything that lives outside ${AGENT_HOME}. Examples: a new tool that should live in /usr/local/bin, an s6 service unit, a cont-init.d hook fix, an MOTD template change.
The image is built by the derio-net/agent-images matrix CI. Workflow:
- Land the change in the
agent-imagesrepo (PR againstmain). - CI builds and pushes
ghcr.io/derio-net/ruflo-shell:<short-sha>. - The lockstep bumper opens a PR in
frankupdatingapps/ruflo/manifests/deployment.yamlto the new SHA. - Merge → ArgoCD syncs → Deployment rolls.
ruflo-server (upstream ruvocal)
The ruflo-server image is a thin wrapper around ruvnet/ruflo at a pinned upstream SHA. To bump:
- Edit
agent-images/ruflo-server/Dockerfile— change theRUFLO_UPSTREAM_SHA=…build arg. - CI rebuilds; new tag
ghcr.io/derio-net/ruflo-server:<short-sha>. - Lockstep bumper PR in
frank.
Read the upstream changelog before bumping — ruvocal has had a few “the data layer is now …” surprises (Mongo → RVF/Postgres). If DATABASE_URL start being honored at a new SHA, drop the parked ruflo-db and migrate state out of the RVF JSON file before flipping the image.
When to Add to the Inventory Instead
If the change is “a new CLI tool the operator wants on the shell,” prefer the inventory ConfigMap over rebuilding ruflo-shell. Inventory edits are PR-and-sync; image bumps are PR-build-PR-sync. Bake into the image only when:
- The tool needs to live in
/usr/local/bin(root-owned, system-wide). - The tool has heavy dependencies you don’t want to install on every pod re-create.
- The tool participates in the s6/
cont-init.dlifecycle.
Everything else: inventory.
Backup and Recovery
Three Longhorn-backed PVCs:
| PVC | Size | Holds |
|---|---|---|
ruflo-data | 5Gi | RVF JSON store (/app/db/ruvocal.rvf.json + indices) |
ruflo-shell-home | 10Gi | mise installs, cargo bin, pipx, claude-flow CLI state, dotfiles |
ruflo-workspace | 20Gi | shared between containers; project checkouts, scratch space |
Plus the ruflo-db StatefulSet’s PVC (20Gi, parked).
Cluster-wide recurring backup policy applies (see Operating on Storage & Backups). Schedule, retention, and offsite (Cloudflare R2) target are inherited from the cluster default.
To restore a single PVC:
# 1. Scale Deployment to 0
kubectl scale deploy/ruflo -n ruflo-system --replicas=0
# 2. Delete the PVC
kubectl delete pvc ruflo-data -n ruflo-system
# 3. In Longhorn UI (192.168.55.201) → Volumes → ruflo-data backup → Restore
# Restore as PVC named `ruflo-data` in namespace `ruflo-system`.
# 4. Scale back up
kubectl scale deploy/ruflo -n ruflo-system --replicas=1The ruflo-data PVC is the one to back up religiously — it holds every hive, run, and conversation. The other two are reproducible from declarative state (image + inventory ConfigMap + git).
Swarm-Run Cookbook
A worked example of running claude-flow orchestrate against the live ruvocal. Minimum pieces:
- SSH to the shell.
- Confirm the CLI talks to ruvocal.
- Define a hive.
- Kick a run.
- Watch it from the web UI.
ssh ruflo
# 1. Sanity-check the CLI and the gateway path.
claude-flow --version # should print: ruflo vX.Y.Z
claude-flow status # reaches into ruvocal at http://localhost:3000
# 2. Verify zero-direct-key egress.
env | grep -E '^(OPENAI|OPENROUTER|LITELLM)_' | sort
# expected:
# OPENAI_API_KEY=<RUFLO_LITELLM_KEY value, set on the container env>
# OPENAI_BASE_URL=http://litellm.litellm.svc:4000
# OPENROUTER_API_KEY=<openrouter key, for direct-OpenRouter code paths>
# LITELLM_BASE_URL=http://litellm.litellm.svc:4000
# 3. Define a hive (claude-flow's name for a swarm config) and kick a run.
mkdir -p /workspace/swarms/hello-swarm && cd /workspace/swarms/hello-swarm
cat > hive.yaml <<'YAML'
name: hello-swarm
goal: "Refactor README.md and open a PR with conventional-commit message."
agents:
- role: writer
model: openrouter/x-ai/grok-4-fast
- role: reviewer
model: openrouter/anthropic/claude-sonnet-4.6
budget:
max_tokens: 200000
max_minutes: 15
YAML
claude-flow orchestrate --hive hive.yaml --workdir /workspace/projects/<repo>The run shows up in the ruvocal web UI under the hive name. Tail the run log either through the UI or in the CLI:
claude-flow runs ls
claude-flow runs tail <run-id>If the run hangs without producing tokens, the LiteLLM gateway is the most likely culprit — check kubectl logs -n litellm deploy/litellm | tail -50 for 401/429/quota errors.
Common Operations
Restarting Ruflo
The Deployment uses strategy: Recreate (every container’s PVC is RWO). Rolling updates would deadlock — the new pod can’t mount any of the three volumes while the old pod holds them.
kubectl rollout restart deploy/ruflo -n ruflo-system
kubectl rollout status deploy/ruflo -n ruflo-system --timeout=120sExpect ~30–60s of downtime while the new pod attaches the three PVCs and the s6 init in ruflo-shell finishes its cont-init.d chain. The web UI bounces; SSH connections drop.
Forcing a Reconcile Without a Restart
ssh ruflo -- ruflo-shell-reconcileThat re-reads the inventory and installs/removes against the current state on the ruflo-shell-home PVC. Useful after a ConfigMap edit when you don’t want to wait for a pod bounce.
Manually Driving the Database Tier (Parked Postgres)
The ruflo-db StatefulSet is parked but green — kept around in case a future re-vendor flips ruvocal’s data layer back to Postgres. Treat it as inert. If you need to confirm it’s alive:
kubectl exec -n ruflo-system ruflo-db-postgresql-0 -c postgresql -- \
psql -U ruflo -d ruflo -c '\dt'If you’re satisfied that the data layer will never flip back, the cleanup is: delete apps/ruflo-db/, delete its Application CR, drop the StatefulSet’s PVC. Out of scope for this layer.
Troubleshooting
Pod Stuck 0/2 or 1/2
Both containers must reach Ready. Common causes:
ruflocontainerPendingorCrashLoopBackOff— most likely the LiteLLM virtual key is wrong (401 from upstream LiteLLM at boot) or theruflo-dataPVC is unwritable. Check:kubectl logs -n ruflo-system deploy/ruflo -c ruflo --previous | tail -30 kubectl describe pvc ruflo-data -n ruflo-systemruflo-shellcontainerInit:Error— the s6-overlay v3 init refuses to start as non-pid-1. If you sees6-overlay-suexec: fatal: can only run as pid 1, check thatshareProcessNamespaceis not set on the Pod spec (it’s incompatible with agent-shell-base — see the building post and the gotchas file).ruflo-shellcontainerRunningbut sshd not answering —cont-init.d/30-authorized-keysshort-circuited because the SOPS Secret wasn’t applied yet. Apply the Secret and follow the recovery in Connecting → Authorised-Keys Bootstrap.
502 from the Web UI
Traefik returns 502 when ruvocal’s /api/v2/feature-flags readiness probe is failing. The pod is up but the kubelet has marked it NotReady, so the Service has no endpoints.
kubectl get endpoints -n ruflo-system ruflo
kubectl describe pod -n ruflo-system -l app.kubernetes.io/name=rufloIf the probe is failing, it’s almost always upstream — LiteLLM is down, OpenRouter is rate-limiting, or the LiteLLM virtual key has been revoked.
401 Loop on Model Calls (After a Working Boot)
The LiteLLM virtual key (RUFLO_LITELLM_KEY) was revoked or rotated in Infisical. The pod has the cached value; ESO re-syncs every 5 minutes. Force a re-sync:
kubectl annotate externalsecret ruflo-llm -n ruflo-system \
force-sync=$(date +%s) --overwrite
kubectl rollout restart deploy/ruflo -n ruflo-systemTelegram Alerts Stop
Check the alert ExternalSecret:
kubectl get externalsecret ruflo-shell-alerts -n ruflo-system
kubectl describe externalsecret ruflo-shell-alerts -n ruflo-systemThe alert helper resolves FRANK_C2_TELEGRAM_BOT_TOKEN / FRANK_C2_TELEGRAM_CHAT_ID at send time. If those rotate in Infisical, ESO syncs the new values within 5 minutes.
Gotchas
shareProcessNamespace: trueis incompatible with the shell sidecar. s6-overlay v3 must be pid 1 in its container’s namespace. Cross-container debugging goes throughkubectl exec -c <other>instead./is the wrong probe path. ruvocal SSR-renders the model list at request time, so probes against/are full upstream-dependency checks. Use/api/v2/feature-flags(already configured).OPENAI_API_KEYmust be a LiteLLM virtual key, not the OpenRouter key. LiteLLM authenticates against its own key store. Symptom of a wrong key: 401 on every model-list call, 500 on/.- The data layer is RVF, not Postgres. Mounting a PVC at
/app/dbis essential — without it, every restart starts from a fresh emptyruvocal.rvf.jsonand every hive vanishes. mise installdoesn’t activate. Until the upstreamagent-shell-basefix lands, manualmise use --global …after first reconcile is required (see workaround above).- The
cont-init.d/30-authorized-keyshook only fires at pod boot. Rotating SSH keys mid-life requires either akubectl execre-copy or a pod bounce.
References
- Building Post: Ruflo — architecture and rationale
- Building Post: Agent Images and the VK-Local Sidecar — the agent-shell-base lineage
- Operating on Paperclip — the org-chart counterpart
- Operating on Storage & Backups — Longhorn snapshot policy
- ruvnet/ruflo — upstream
