
Local Inference — Ollama, LiteLLM, and OpenRouter
The cluster has a GPU. Layer 4 installed the NVIDIA operator. Layer 5 gave the mini nodes their Intel iGPUs. But none of that is useful until something actually runs inference.
Layer 10 wires up a unified LLM gateway. Any tool on the network — agentic frameworks, document processors, coding assistants — talks to one OpenAI-compatible endpoint at 192.168.55.206:4000. Behind that endpoint, requests route to either a local model on gpu-1’s RTX 5070 Ti or a free cloud model via OpenRouter. The consumer never needs to know which.
The Architecture
Three components:
Ollama runs on gpu-1 and serves local models. It manages model downloads, VRAM allocation, and the inference runtime. It exposes a ClusterIP on port 11434 — internal only.
LiteLLM is the gateway. It presents a single OpenAI-compatible API and routes requests to the right backend based on the model name in the request. It also handles virtual API keys, spend tracking, and rate limiting. It runs on any non-GPU node.
OpenRouter aggregates cloud model providers behind one API key. Free-tier models have limits (20 requests/minute, 200/day per model), but that is plenty for a homelab.
Consumers (AnythingLLM, Paperless-ngx, agentic frameworks, etc.)
|
v
LiteLLM Gateway (192.168.55.206:4000)
| unified OpenAI-compatible API
| virtual keys, spend tracking, rate limits
|
|---> Ollama (gpu-1, ClusterIP)
| |-- mistral-small3.2:24b (default, kept warm)
| |-- gemma3:12b (multimodal — vision general)
| |-- qwen2.5vl:7b-q8_0 (multimodal — OCR/structured)
| |-- qwen2.5-coder:14b q6 (code)
| +-- qwen3:14b (reasoning, thinking mode)
|
+---> OpenRouter (cloud, free tier)
+-- gemma-31b, nemotron-vl-12b, nemotron-omni-30b,
qwen-next-80b, qwen-coder-480b, hermes-405bAny consumer that speaks OpenAI’s API format works out of the box:
OPENAI_API_BASE=http://192.168.55.206:4000/v1
OPENAI_API_KEY=<litellm-virtual-key>Why Not Just Ollama?
Ollama alone handles local models well. But the moment you want cloud fallback, multiple consumers with different keys, or spend tracking, you need a routing layer. LiteLLM adds that without changing how consumers connect.
It also means model migration is invisible to consumers. If a cloud model gets retired or a better local model appears, you update LiteLLM’s config. No consumer reconfiguration.
Local Models: What Fits in 16GB?
The RTX 5070 Ti has 16GB of GDDR7. That is the hard constraint. Ollama quantizes models to Q4 by default, which cuts memory roughly in half — but at 16GB there is enough headroom to upgrade specific models to Q6 or Q8 where it matters.
Five models in the current lineup, each chosen to fit alongside ~1.5GB of KV cache (and, for vision models, a ~1.4GB vision tower):
| Alias | Tag | Quant | VRAM | Context | Best For |
|---|---|---|---|---|---|
mistral-small-24b | mistral-small3.2:24b | Q4_K_M | ~14 GB | 128K | Default general-purpose, function calling |
gemma-12b | gemma3:12b | Q4_K_M | ~9 GB | 128K | Multimodal — general vision, screenshots, charts |
qwen-vl-7b | qwen2.5vl:7b-q8_0 | Q8_0 | ~9 GB | 128K | Multimodal — OCR, tables, scanned docs |
qwen-coder-14b | qwen2.5-coder:14b-instruct-q6_K | Q6_K | ~12 GB | 32K | Code generation and completion |
qwen-think-14b | qwen3:14b | Q4_K_M | ~10 GB | 32K | Reasoning with native thinking mode |
Only one model stays loaded in VRAM at a time (OLLAMA_MAX_LOADED_MODELS=1). The default model is kept warm for 24 hours (OLLAMA_KEEP_ALIVE=24h). Switching takes about 5 seconds — Ollama unloads one and loads the other from the Longhorn PVC.
This is a deliberate trade-off. Loading multiple models simultaneously would leave each with less VRAM for KV cache, reducing effective context length. For a homelab with low concurrency, fast swapping is better than degraded context.
Why Two Multimodal Models?
gemma-12b and qwen-vl-7b look redundant on paper — both are vision models that fit in VRAM. They are not. Gemma 3’s vision tower was trained on a wide image corpus and excels at “what is in this picture”: general visual reasoning, screenshots, photographs. Qwen2.5-VL was specifically trained on structured visual content — tables, charts with dense text, scanned documents — and produces noticeably more accurate OCR. Picking one would force every vision request through a model that is wrong for half the cases.
Why Q6 for the Coder?
Code is the one place where quantization quality is measurable in production. At Q4_K_M, 14B-class coding models produce more syntax errors and forget API surface details. At Q6_K the model uses ~3GB more VRAM but the error rate drops noticeably. The 16GB budget makes that trade-off available; the 12GB original config didn’t.
Why Not the Mini iGPUs?
The three mini nodes each have an Intel Arc iGPU. These share system RAM instead of having dedicated VRAM — which makes them unsuitable for LLM inference where memory bandwidth is the bottleneck. Their value is in media and vision workloads: hardware video transcode via Quick Sync, object detection via OpenVINO, and general OpenCL compute.
Cloud Models: The Free Tier Treadmill
OpenRouter aggregates providers and offers free tiers for many models. The catch: free model availability shifts constantly. Models get promoted, retired, or rate-limited without notice. This is a maintenance concern, not an architectural one.
The current free model roster (refreshed May 2026 — see “Refresh” below):
| Alias | Model | Context | Modalities | Strengths | Data Policy |
|---|---|---|---|---|---|
gemma-31b | Gemma 4 31B Instruct | 256K | text + image + video | Flagship multimodal, function calling, 140+ langs | Open-weight |
nemotron-vl-12b | NVIDIA Nemotron Nano 2 VL | 128K | text + image + video | Document intelligence, video understanding | Open-weight |
nemotron-omni-30b | NVIDIA Nemotron 3 Nano Omni 30B | 256K | text + image + video + audio | Multimodal + reasoning | Open-weight |
qwen-next-80b | Qwen3 Next 80B A3B Instruct | 262K | text | Strong reasoning, coding, math | Alibaba; may retain |
qwen-coder-480b | Qwen3 Coder 480B MoE | 262K | text | Frontier coding | Alibaba; may retain |
hermes-405b | Hermes 3 (Llama 3.1 405B) | 131K | text | Largest open-weight backstop | Open-weight |
The data policy column matters. Some free providers train on prompts. The config comments document this per model so you can make informed choices about what you send where.
Keeping the List Current
We built a /update-openrouter-models command that automates the refresh cycle: query the OpenRouter API for current free models, compare against the config, replace retired ones, deploy, and verify. Run it when models start returning 404s.
Deploying Ollama
Ollama uses the community Helm chart via ArgoCD:
# apps/ollama/values.yaml (abbreviated)
ollama:
gpu:
enabled: true
type: nvidia
number: 1
models:
pull: [] # pulled on first request via LiteLLM
run: []
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: "24h"
- name: OLLAMA_MAX_LOADED_MODELS
value: "1"
persistentVolume:
enabled: true
size: 200Gi # 5-model shelf ≈ 55GB at rest, with experimentation room
storageClass: longhorn
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleThe GPU resource request and toleration ensure Ollama lands on gpu-1 — the only node with an NVIDIA GPU and the corresponding NoSchedule taint.
Deploying LiteLLM
LiteLLM uses two ArgoCD apps — one for the Helm chart, one for the ExternalSecret manifest:
| App | Source | Purpose |
|---|---|---|
litellm | OCI Helm chart (docker.litellm.ai/berriai/litellm-helm) | Gateway + PostgreSQL |
litellm-extras | apps/litellm/manifests/ | ExternalSecret for API keys |
The model routing config lives in values.yaml under proxy_config.model_list. Each model entry maps an alias to a backend:
proxy_config:
model_list:
- model_name: mistral-small-24b
litellm_params:
model: ollama/mistral-small3.2:24b
api_base: http://ollama.ollama.svc.cluster.local:11434
- model_name: qwen-coder-480b
litellm_params:
model: openrouter/qwen/qwen3-coder:free
api_key: os.environ/OPENROUTER_API_KEYLiteLLM resolves os.environ/OPENROUTER_API_KEY at runtime from the pod’s environment, which is injected by the ExternalSecret.
Secrets Flow
Infisical (192.168.55.204)
|
v
ExternalSecret "litellm-api-keys" (litellm namespace)
| syncs: OPENROUTER_API_KEY, LITELLM_MASTER_KEY
v
K8s Secret --> env vars in LiteLLM podNo plaintext secrets in the repo. The ExternalSecret refreshes every 5 minutes.
Gotchas
LiteLLM Image Tags
The LiteLLM Helm chart generates an image tag from the chart version (e.g., main-v1.81.13). That tag does not exist on GHCR. Override it explicitly:
image:
repository: ghcr.io/berriai/litellm-database
tag: main-stable
pullPolicy: AlwaysLoadBalancer IP Pinning
The LiteLLM chart does not expose a service.loadBalancerIP field. Use a Cilium annotation instead:
service:
type: LoadBalancer
annotations:
lbipam.cilium.io/ips: "192.168.55.206"Free Model Churn
During deployment, four of the six originally selected cloud models had already been retired from OpenRouter’s free tier. The models that replaced them were verified against the live API (/api/v1/models) rather than the marketing page. Trust the API, not the website.
Multi-tenancy
LiteLLM has built-in virtual key management. Each consumer gets its own key with optional per-key budgets and rate limits. When multi-tenancy via vCluster arrives in a future layer, tenant isolation is a configuration concern — not an architectural change.
Refresh — May 2026
The original lineup (qwen3.5:9b, deepseek-coder:6.7b, plus a six-model OpenRouter shelf) was assembled before the GPU Operator fix landed. Two things changed since:
- The card had 16GB all along. The first version of this post said “12GB” because that is the spec for the non-Ti RTX 5070 — but gpu-1 actually runs the Ti variant with 16GB GDDR7. The 4GB extra unlocked the 24B class at Q4 (Mistral Small 3.2) and let the coder model jump to Q6 quantization.
- The OpenRouter free tier had churned. Mistral Small 3.1 and Step Flash were no longer free; Qwen3-Next, Nemotron-VL, Nemotron-Omni, and Gemma 4 had appeared. The
/update-openrouter-modelsskill confirmed the live list against/api/v1/models.
The replacement strategy:
- Move Mistral Small 24B from cloud to local — the 16GB card can run it.
- Add two local multimodal models (Gemma 3 12B for general vision, Qwen2.5-VL 7B at Q8 for OCR), removing the old text-only-only constraint.
- Replace the aging cloud shelf with three multimodal options (Gemma 4 31B, Nemotron-VL, Nemotron-Omni) plus a stronger reasoning option (Qwen3-Next 80B).
- Drop
deepseek-coder:6.7bandomnicoder:9bin favor of one strongerqwen2.5-coder:14bat Q6. - Bump the Ollama PVC from 30Gi to 200Gi — the new lineup occupies ~55GB at rest, and disk is no longer scarce.
Aliases changed in this refresh — consumers using the old names (qwen3.5, deepseek-coder, mistral-small, gemma-27b, llama-70b, step-flash) need to update to the new ones (mistral-small-24b, qwen-coder-14b, gemma-31b, etc.). The data-policy comments per model are kept and re-verified against each provider’s current terms.
What is Next
Any consumer on the network can use 192.168.55.206:4000 today — local GPU models, multimodal vision (local + cloud), and frontier-scale reasoning (cloud) are all operational behind one OpenAI-compatible endpoint.
