Operating on Ruflo

This is the operational companion to Ruflo — A Swarm Orchestrator Next to Paperclip. That post explains the architecture and the rationale. This one covers the day-to-day: connecting, installing tools, bumping images, reading logs, recovering from breakage, and a worked swarm-run cookbook.

What “Healthy” Looks Like

Ruflo is healthy when:

The ruflo Deployment is 2/2 Running in the ruflo-system namespace (both ruflo and ruflo-shell containers Ready).
The web UI loads at https://ruflo.cluster.derio.net after Authentik SSO.
SSH to agent@192.168.55.222 succeeds.
All four ExternalSecrets (ruflo-llm, ruflo-resend, ruflo-shell-alerts, plus any optional add-ons) show SecretSynced=True.
The three PVCs are Bound (ruflo-data 5Gi, ruflo-shell-home 10Gi, ruflo-workspace 20Gi).
The ruflo-db Bitnami postgresql StatefulSet is 1/1 Running (parked but green — see the building post for the RVF deviation).

kubectl get pods,pvc,externalsecret,svc -n ruflo-system

Expected: one ruflo-… Deployment pod (2/2), one ruflo-db-postgresql-0 StatefulSet pod (2/2), three Bound PVCs, four synced ExternalSecrets, two Services (ClusterIP for ruvocal, LoadBalancer at 192.168.55.222 for SSH+Mosh).

Connecting

Web UI

Open https://ruflo.cluster.derio.net. Authentik forward-auth handles the SSO redirect. After login you land on ruvocal’s chat surface. The session cookie is shared with every other Authentik-fronted service on the cluster, so you sign in once.

SSH

Add to ~/.ssh/config:

Host ruflo
  HostName 192.168.55.222
  User agent
  Port 22

Then:

ssh ruflo

Mosh

Mosh works over a UDP port range allocated on the Service (60016–60031). You can wrap it in a shell function or just call it directly:

mosh --ssh="ssh -i ~/.ssh/<your-key>" \
     --server="mosh-server new -p 60016:60031" \
     agent@192.168.55.222

Sixteen ports is plenty of headroom; MOSH_SERVER_NETWORK_TMOUT reaps stuck sessions so the range doesn’t bleed.

Authorised-Keys Bootstrap

Authorised keys live in a SOPS-encrypted Secret ruflo-shell-ssh-keys (under secrets/ruflo/). The pod boots whether the Secret exists or not — sshd just rejects key-based logins until the bootstrap is applied. To rotate or seed keys:

# Edit secrets/ruflo/ruflo-shell-ssh-keys.yaml (SOPS will round-trip the encryption)
sops secrets/ruflo/ruflo-shell-ssh-keys.yaml

# Apply
sops -d secrets/ruflo/ruflo-shell-ssh-keys.yaml | kubectl apply -f -

If the pod was running before the bootstrap landed, the cont-init.d/30-authorized-keys hook only fires at boot — so the new keys won’t be live until you trigger a re-copy:

kubectl exec -n ruflo-system deploy/ruflo -c ruflo-shell -- \
  bash -c 'cp /etc/ssh-keys/authorized_keys "${AGENT_HOME:-/home/agent}/.ssh/authorized_keys" \
           && chmod 600 "${AGENT_HOME:-/home/agent}/.ssh/authorized_keys"'

Or just kubectl rollout restart deploy/ruflo -n ruflo-system and let cont-init.d re-fire on the new pod. (The same hook bites every shell sidecar — there’s a frank-gotchas entry for it.)

Adding and Removing Tools

The shell sidecar’s tool inventory is declarative — apps/ruflo/manifests/configmap-shell-inventory.yaml:

data:
  inventory.yaml: |
    mise:
      - python@3.12
      - node@20
      - rust@stable
    npm-global:
      - "claude-flow@alpha"
      - "@openai/codex"
    pipx:
      - black
      - ruff
    cargo:
      - ripgrep
      - eza
    removed:
      mise: []
      npm-global: []
      pipx: []
      cargo: []

Edit, commit, push. ArgoCD syncs the ConfigMap. The boot-time reconciler picks up the new declaration on the next pod restart and installs/removes accordingly. To trigger immediately without a restart:

ssh ruflo -- ruflo-shell-reconcile

The `removed:` Arrays

Removing a tool from the upper arrays does NOT uninstall it — that’s intentional, so removing a declaration doesn’t surprise an in-flight session. To actively uninstall, add the tool to the matching removed: list:

removed:
  cargo: [eza]   # forces `cargo uninstall eza` on next reconcile

Once reconcile runs and reports the removal, you can drop the entry from removed: (or leave it as a record).

Interactive Installs (Layer-3 Escape Hatch)

For discovery work, just install on the pod — mise install, npm i -g, pipx install, cargo install. All four managers persist their state under ${AGENT_HOME} (i.e. /home/agent/{.local/share/mise, .cargo/bin, .local/pipx, .local/share/mise/installs/node/20.../lib}) which is mounted from the ruflo-shell-home PVC. Tools survive pod bounces.

ssh ruflo -- cargo install fd-find
kubectl rollout restart deploy/ruflo -n ruflo-system
ssh ruflo -- which fd     # /home/agent/.cargo/bin/fd — survived the bounce

When to Promote to the Inventory

Promote an interactive install when:

You want the tool to survive a PV migration (the inventory ConfigMap is the source of truth — interactive installs are in PV state).
You want the next operator (or a freshly recreated PVC) to inherit the tool.
You want the boot-time reconcile and the Telegram-on-failure alert path to cover it.

Otherwise leave it interactive. Discovery week is meant to lean on this.

The mise-Activation Workaround (Pending Upstream Fix)

install-inventory.sh in the current agent-shell-base has a known bug: after mise install <tool>, it doesn’t run mise use --global <tool>, so subsequent steps that resolve npm / python fall through to the system installs and fail (EACCES on /usr/lib/node_modules/, missing pyyaml, etc.). Workaround on the live pod:

ssh ruflo -- 'mise use --global node@20 rust@stable python@3.12'
ssh ruflo -- 'mise exec -- pip install pyyaml'
ssh ruflo -- ruflo-shell-reconcile      # re-run after activation

The fix belongs in agent-shell-base. Once it ships, this workaround goes away.

Reading the Install Log

Every reconcile run writes a per-tool log to /var/log/cont-init.d/install-inventory.log and a one-line MOTD summary to /run/motd.dynamic. The next SSH login sees the summary on the banner:

✓ ruflo-shell: 7 installed, 0 already present, 0 removed @ 2026-05-03T14:22:11Z

A failed=N count flips the banner to a warning glyph and triggers a Telegram alert via the ruflo-shell-alerts ExternalSecret. The alert contains the tool, the manager, the exit code, and the last 40 lines of the install log. Recovery flow:

ssh ruflo
sudo less /var/log/cont-init.d/install-inventory.log    # full log
ruflo-shell-reconcile                                   # re-try after fixing

If the Telegram alert path is silent on a known failure, check: (a) the ruflo-shell-alerts Secret exists and is SecretSynced=True; (b) the notify-telegram.sh helper is on PATH on the pod; (c) FRANK_C2_TELEGRAM_BOT_TOKEN and FRANK_C2_TELEGRAM_CHAT_ID exist in Infisical.

Bumping Images

`ruflo-shell` (Layer-1 changes)

Bump the ruflo-shell image when you want to change something baked at build time — i.e. anything that lives outside ${AGENT_HOME}. Examples: a new tool that should live in /usr/local/bin, an s6 service unit, a cont-init.d hook fix, an MOTD template change.

The image is built by the derio-net/agent-images matrix CI. Workflow:

Land the change in the agent-images repo (PR against main).
CI builds and pushes ghcr.io/derio-net/ruflo-shell:<short-sha>.
The lockstep bumper opens a PR in frank updating apps/ruflo/manifests/deployment.yaml to the new SHA.
Merge → ArgoCD syncs → Deployment rolls.

`ruflo-server` (upstream ruvocal)

The ruflo-server image is a thin wrapper around ruvnet/ruflo at a pinned upstream SHA. To bump:

Edit agent-images/ruflo-server/Dockerfile — change the RUFLO_UPSTREAM_SHA=… build arg.
CI rebuilds; new tag ghcr.io/derio-net/ruflo-server:<short-sha>.
Lockstep bumper PR in frank.

Read the upstream changelog before bumping — ruvocal has had a few “the data layer is now …” surprises (Mongo → RVF/Postgres). If DATABASE_URL start being honored at a new SHA, drop the parked ruflo-db and migrate state out of the RVF JSON file before flipping the image.

When to Add to the Inventory Instead

If the change is “a new CLI tool the operator wants on the shell,” prefer the inventory ConfigMap over rebuilding ruflo-shell. Inventory edits are PR-and-sync; image bumps are PR-build-PR-sync. Bake into the image only when:

The tool needs to live in /usr/local/bin (root-owned, system-wide).
The tool has heavy dependencies you don’t want to install on every pod re-create.
The tool participates in the s6/cont-init.d lifecycle.

Everything else: inventory.

Backup and Recovery

Three Longhorn-backed PVCs:

PVC	Size	Holds
`ruflo-data`	5Gi	RVF JSON store (`/app/db/ruvocal.rvf.json` + indices)
`ruflo-shell-home`	10Gi	mise installs, cargo bin, pipx, claude-flow CLI state, dotfiles
`ruflo-workspace`	20Gi	shared between containers; project checkouts, scratch space

Plus the ruflo-db StatefulSet’s PVC (20Gi, parked).

Cluster-wide recurring backup policy applies (see Operating on Storage & Backups). Schedule, retention, and offsite (Cloudflare R2) target are inherited from the cluster default.

To restore a single PVC:

# 1. Scale Deployment to 0
kubectl scale deploy/ruflo -n ruflo-system --replicas=0

# 2. Delete the PVC
kubectl delete pvc ruflo-data -n ruflo-system

# 3. In Longhorn UI (192.168.55.201) → Volumes → ruflo-data backup → Restore
#    Restore as PVC named `ruflo-data` in namespace `ruflo-system`.

# 4. Scale back up
kubectl scale deploy/ruflo -n ruflo-system --replicas=1

The ruflo-data PVC is the one to back up religiously — it holds every hive, run, and conversation. The other two are reproducible from declarative state (image + inventory ConfigMap + git).

Swarm-Run Cookbook

A worked example of running claude-flow orchestrate against the live ruvocal. Minimum pieces:

SSH to the shell.
Confirm the CLI talks to ruvocal.
Define a hive.
Kick a run.
Watch it from the web UI.

ssh ruflo

# 1. Sanity-check the CLI and the gateway path.
claude-flow --version             # should print: ruflo vX.Y.Z
claude-flow status                # reaches into ruvocal at http://localhost:3000

# 2. Verify zero-direct-key egress.
env | grep -E '^(OPENAI|OPENROUTER|LITELLM)_' | sort
# expected:
#   OPENAI_API_KEY=<RUFLO_LITELLM_KEY value, set on the container env>
#   OPENAI_BASE_URL=http://litellm.litellm.svc:4000
#   OPENROUTER_API_KEY=<openrouter key, for direct-OpenRouter code paths>
#   LITELLM_BASE_URL=http://litellm.litellm.svc:4000

# 3. Define a hive (claude-flow's name for a swarm config) and kick a run.
mkdir -p /workspace/swarms/hello-swarm && cd /workspace/swarms/hello-swarm
cat > hive.yaml <<'YAML'
name: hello-swarm
goal: "Refactor README.md and open a PR with conventional-commit message."
agents:
  - role: writer
    model: openrouter/x-ai/grok-4-fast
  - role: reviewer
    model: openrouter/anthropic/claude-sonnet-4.6
budget:
  max_tokens: 200000
  max_minutes: 15
YAML

claude-flow orchestrate --hive hive.yaml --workdir /workspace/projects/<repo>

The run shows up in the ruvocal web UI under the hive name. Tail the run log either through the UI or in the CLI:

claude-flow runs ls
claude-flow runs tail <run-id>

If the run hangs without producing tokens, the LiteLLM gateway is the most likely culprit — check kubectl logs -n litellm deploy/litellm | tail -50 for 401/429/quota errors.

Common Operations

Restarting Ruflo

The Deployment uses strategy: Recreate (every container’s PVC is RWO). Rolling updates would deadlock — the new pod can’t mount any of the three volumes while the old pod holds them.

kubectl rollout restart deploy/ruflo -n ruflo-system
kubectl rollout status deploy/ruflo -n ruflo-system --timeout=120s

Expect ~30–60s of downtime while the new pod attaches the three PVCs and the s6 init in ruflo-shell finishes its cont-init.d chain. The web UI bounces; SSH connections drop.

Forcing a Reconcile Without a Restart

ssh ruflo -- ruflo-shell-reconcile

That re-reads the inventory and installs/removes against the current state on the ruflo-shell-home PVC. Useful after a ConfigMap edit when you don’t want to wait for a pod bounce.

Manually Driving the Database Tier (Parked Postgres)

The ruflo-db StatefulSet is parked but green — kept around in case a future re-vendor flips ruvocal’s data layer back to Postgres. Treat it as inert. If you need to confirm it’s alive:

kubectl exec -n ruflo-system ruflo-db-postgresql-0 -c postgresql -- \
  psql -U ruflo -d ruflo -c '\dt'

If you’re satisfied that the data layer will never flip back, the cleanup is: delete apps/ruflo-db/, delete its Application CR, drop the StatefulSet’s PVC. Out of scope for this layer.

Troubleshooting

Pod Stuck `0/2` or `1/2`

Both containers must reach Ready. Common causes:

ruflo container Pending or CrashLoopBackOff — most likely the LiteLLM virtual key is wrong (401 from upstream LiteLLM at boot) or the ruflo-data PVC is unwritable. Check:
```
kubectl logs -n ruflo-system deploy/ruflo -c ruflo --previous | tail -30
kubectl describe pvc ruflo-data -n ruflo-system
```
ruflo-shell container Init:Error — the s6-overlay v3 init refuses to start as non-pid-1. If you see s6-overlay-suexec: fatal: can only run as pid 1, check that shareProcessNamespace is not set on the Pod spec (it’s incompatible with agent-shell-base — see the building post and the gotchas file).
ruflo-shell container Running but sshd not answering — cont-init.d/30-authorized-keys short-circuited because the SOPS Secret wasn’t applied yet. Apply the Secret and follow the recovery in Connecting → Authorised-Keys Bootstrap.

502 from the Web UI

Traefik returns 502 when ruvocal’s /api/v2/feature-flags readiness probe is failing. The pod is up but the kubelet has marked it NotReady, so the Service has no endpoints.

kubectl get endpoints -n ruflo-system ruflo
kubectl describe pod -n ruflo-system -l app.kubernetes.io/name=ruflo

If the probe is failing, it’s almost always upstream — LiteLLM is down, OpenRouter is rate-limiting, or the LiteLLM virtual key has been revoked.

401 Loop on Model Calls (After a Working Boot)

The LiteLLM virtual key (RUFLO_LITELLM_KEY) was revoked or rotated in Infisical. The pod has the cached value; ESO re-syncs every 5 minutes. Force a re-sync:

kubectl annotate externalsecret ruflo-llm -n ruflo-system \
  force-sync=$(date +%s) --overwrite
kubectl rollout restart deploy/ruflo -n ruflo-system

Telegram Alerts Stop

Check the alert ExternalSecret:

kubectl get externalsecret ruflo-shell-alerts -n ruflo-system
kubectl describe externalsecret ruflo-shell-alerts -n ruflo-system

The alert helper resolves FRANK_C2_TELEGRAM_BOT_TOKEN / FRANK_C2_TELEGRAM_CHAT_ID at send time. If those rotate in Infisical, ESO syncs the new values within 5 minutes.

Gotchas

shareProcessNamespace: true is incompatible with the shell sidecar. s6-overlay v3 must be pid 1 in its container’s namespace. Cross-container debugging goes through kubectl exec -c <other> instead.
/ is the wrong probe path. ruvocal SSR-renders the model list at request time, so probes against / are full upstream-dependency checks. Use /api/v2/feature-flags (already configured).
OPENAI_API_KEY must be a LiteLLM virtual key, not the OpenRouter key. LiteLLM authenticates against its own key store. Symptom of a wrong key: 401 on every model-list call, 500 on /.
The data layer is RVF, not Postgres. Mounting a PVC at /app/db is essential — without it, every restart starts from a fresh empty ruvocal.rvf.json and every hive vanishes.
mise install doesn’t activate. Until the upstream agent-shell-base fix lands, manual mise use --global … after first reconcile is required (see workaround above).
The cont-init.d/30-authorized-keys hook only fires at pod boot. Rotating SSH keys mid-life requires either a kubectl exec re-copy or a pod bounce.

References

Building Post: Ruflo — architecture and rationale
Building Post: Agent Images and the VK-Local Sidecar — the agent-shell-base lineage
Operating on Paperclip — the org-chart counterpart
Operating on Storage & Backups — Longhorn snapshot policy
ruvnet/ruflo — upstream

Operating on In-Cluster Ingress Operating on Cluster & Nodes