ADR-022 — On-Phone LocalLLM Server (OpenAI-Compatible)¶
Status: Superseded by ADR-024 (2026-05-28) Date: 2026-05-12 Relates to: ADR-017, ADR-019, ADR-021, ADR-024
2026-05-28 — Superseded. The three-way
PITWALL_ADK_BACKENDselector introduced here was removed post-Sonoma. LocalLLM (theopenaibranch below) is now the sole ADK transport, andgoogle-adk+litellmare base dependencies ofapps/edge-daemonrather than optional extras. See ADR-024 for the consolidation rationale. The history below documents the multi-backend design as it shipped between 2026-05-12 and 2026-05-28.
Context¶
ADR-017 mandated a fully on-device LLM stack. ADR-019–021 implemented the
paddock tier on ADK with a single, hard-coded model client:
Gemini(base_url="http://localhost:8001", model="gemma-4-e4b") against a
separately-launched lit serve Python process. The earlier ADK architecture
doc went further and codified this as a constraint: "Native LiteRT-LM — no
Ollama or LiteLLM."
That setup was workable on a laptop but awkward on the actual deployment
target. On a Pixel 10, lit serve had to run as a second Termux foreground
service alongside the bridge — two wake locks, two processes, two startup
ordering bugs, two things to debug at 4 a.m. before a track day. It also
forced the LLM runtime to live inside the same Termux sandbox as the bridge,
which has no GPU/NPU delegate access on most Android builds.
Meanwhile, a sibling project shipped: LocalLLM — an Apache-2.0 Android APK (github.com/mlnomadpy/localllm) that:
- runs LiteRT-LM natively in an Android app process (
com.google.ai.edge.litertlm:litertlm-android:0.11.0), - exposes an OpenAI-compatible HTTP server on
POST /v1/chat/completionsat port:8099, - accepts
.litertlmmodel bundles from thelitert-communityHuggingFace collection (Gemma 4 family), - supports SSE streaming, signed-bearer-token auth, and
- uses LiteRT's AUTO delegate (GPU → CPU fallback) for acceleration.
LocalLLM is a normal APK with a Catalog / Chat / Dashboard / Settings UI.
A driver installs it once, downloads a model from the in-app catalog, and
the server autostarts. From Pitwall's perspective, there is now a stable,
authenticated, OpenAI-shaped HTTP endpoint at 127.0.0.1:8099/v1 on the
same phone — without Pitwall having to host the model itself.
That changes the primary deployment story.
Decision¶
Adopt LocalLLM as the default on-device LLM server for every pitwall LLM
request — both the warm path (LitertCoach.brief() / debrief()) and the
paddock ADK tier. Refactor the paddock model wiring into a three-way
backend selector chosen by PITWALL_ADK_BACKEND and have LitertCoach
honour the same PITWALL_ADK_OPENAI_URL env (renamed 2026-05; the legacy
PITWALL_LITERT_URL is still accepted with a DeprecationWarning). The
defaults are flipped — fresh installs talk to LocalLLM with no env vars set.
PITWALL_ADK_BACKEND |
Transport | Server | Client class | Used for |
|---|---|---|---|---|
openai (default) |
HTTP → 127.0.0.1 |
LocalLLM APK (:8099/v1) |
LiteLlm(api_base=..., api_key=...) |
Pixel field deployment — primary path |
engine |
In-process (no HTTP) | (none — same process as bridge) | LitertLmModel(BaseLlm) |
Headless Termux setups that already load the engine for the warm path |
litertlm |
HTTP → lit serve |
lit serve Python process |
Gemini(base_url=..., model=...) |
Legacy / desktop dev with lit serve already running |
All three speak to a model on the same phone as the bridge. None of them dial out to a hosted API. The privacy guarantee from ADR-017 is unchanged.
The warm-path LitertCoach follows the same default. Its constructor reads
PITWALL_ADK_OPENAI_URL (default http://localhost:8099/v1; legacy alias
PITWALL_LITERT_URL still honoured) and routes
_generate() over HTTP via stdlib urllib.request. Setting the env to an
empty string opts back into the in-process litert_lm.Engine.
Why LocalLLM specifically¶
- Native Android process, native delegates. LocalLLM runs as a regular Android app and can use LiteRT's GPU delegate on a Pixel's Tensor G5 NPU via the AUTO backend. Termux processes typically cannot.
- Stable HTTP contract. OpenAI's
chat.completionsshape is a widely supported, well-tested protocol surface — ADK'sLiteLlmwrapper speaks it natively, as do dozens of other ecosystems. - No Termux co-tenancy. Crashes in the model runtime don't bring down the bridge; the bridge can reconnect via HTTP. Compared to in-process inference, this is a clean process boundary.
- First-class model catalogue. A driver doesn't need to know about
huggingface-clior.litertlmpaths — they pick a model from the in-app catalog and the server autostarts. - Signed-bearer-token auth. LocalLLM supports keyed access so a
malicious app on the same device can't trivially hit the server.
Pitwall passes the token via
PITWALL_ADK_OPENAI_API_KEY(legacy:PITWALL_LITERT_API_KEY).
Configuration surface¶
| Variable | Default | Used by |
|---|---|---|
PITWALL_ADK_BACKEND |
openai |
paddock selector (engine | litertlm | openai) |
PITWALL_ADK_OPENAI_URL |
http://localhost:8099/v1 |
both warm and paddock HTTP base; empty string ⇒ in-process. Legacy: PITWALL_LITERT_URL |
PITWALL_ADK_OPENAI_MODEL |
gemma3n-e2b |
model id (must match LocalLLM's loaded model). Legacy: PITWALL_LITERT_MODEL |
PITWALL_ADK_OPENAI_API_KEY |
lit-serve-not-required |
LocalLLM bearer token. Legacy: PITWALL_LITERT_API_KEY |
PITWALL_LITERT_SIDECAR_URL |
http://127.0.0.1:8080 |
LiteRT-LM Kotlin sidecar URL. Legacy: PITWALL_LITERTLM_URL |
PITWALL_LITERT_SIDECAR_MODEL |
gemma-4-e2b |
LiteRT-LM Kotlin sidecar model id. Legacy: PITWALL_LITERTLM_MODEL |
PITWALL_LITERTLM_PATH |
(unset) | engine (.litertlm bundle path) |
PITWALL_LITERTLM_BUDGET |
30000 |
engine (KV-cache char budget) |
PITWALL_LITERT_HTTP_TIMEOUT_S |
30 |
warm-path HTTP client timeout |
Env vars renamed 2026-05 (this ADR amended). The
PITWALL_LITERT_*family was easy to confuse withPITWALL_LITERTLM_*(one letter apart, two completely different things). The new names —PITWALL_ADK_OPENAI_*for the ADK→OpenAI-compatible HTTP shim andPITWALL_LITERT_SIDECAR_*for the Kotlin LiteRT-LM sidecar — are self-describing. All legacy names continue to work and emit aDeprecationWarningon first read; the fallback lives insrc/pitwall/_env.py:get_env_with_legacy.Default flipped 2026-05-12. Fresh installs of pitwall talk to LocalLLM with zero env vars set. To restore the previous
lit servebehaviour:PITWALL_ADK_BACKEND=litertlm PITWALL_ADK_OPENAI_URL=http://localhost:8001.
Implementation¶
Paddock (src/pitwall/features/coaching/adk_agents.py) branches at
module-load:
_BACKEND = os.getenv("PITWALL_ADK_BACKEND", "openai").lower()
_MODEL_ID = get_env_with_legacy(
"PITWALL_ADK_OPENAI_MODEL", "PITWALL_LITERT_MODEL", "gemma3n-e2b")
_MODEL_URL = get_env_with_legacy(
"PITWALL_ADK_OPENAI_URL", "PITWALL_LITERT_URL",
"http://localhost:8099/v1")
if _BACKEND == "engine":
_model = LitertLmModel(model=_MODEL_ID) # in-process
elif _BACKEND == "openai": # default
_model = LiteLlm( # OpenAI-compatible HTTP
model=_MODEL_ID,
api_base=_MODEL_URL, # → LocalLLM at :8099/v1
api_key=get_env_with_legacy(
"PITWALL_ADK_OPENAI_API_KEY", "PITWALL_LITERT_API_KEY",
"lit-serve-not-required"),
)
else: # legacy: lit serve
_model = Gemini(model=_MODEL_ID, base_url=_MODEL_URL)
Warm path (src/pitwall/features/coaching/coach_engine.py:LitertCoach)
honours the same env. Its constructor reads PITWALL_ADK_OPENAI_URL
(legacy: PITWALL_LITERT_URL); if non-empty (true by default),
_generate() POSTs to LocalLLM's
/chat/completions via stdlib urllib.request — no new dependency. The
in-process litert_lm.Engine path is reached only when the env is
explicitly set to an empty string.
# coach_engine.py — LitertCoach.__init__
http_url = (get_env_with_legacy(
"PITWALL_ADK_OPENAI_URL", "PITWALL_LITERT_URL",
self.DEFAULT_HTTP_URL) or "").strip()
if http_url:
self._http_url = http_url.rstrip("/") # → :8099/v1
self._http_model = get_env_with_legacy(
"PITWALL_ADK_OPENAI_MODEL", "PITWALL_LITERT_MODEL",
self.DEFAULT_HTTP_MODEL)
self._http_api_key = get_env_with_legacy(
"PITWALL_ADK_OPENAI_API_KEY", "PITWALL_LITERT_API_KEY",
"lit-serve-not-required")
self._llm = "http" # truthy sentinel
return
# else: lazy in-process engine load (legacy path)
LiteLlm ships with google-adk[litellm]. It is an optional install — the
import is wrapped in a HAS_LITELLM flag and the openai branch raises a
clear RuntimeError directing the user to install the extra if it's missing.
The warm-path HTTP client uses only stdlib (urllib.request), so it works
out of the box.
What does not change¶
- Hot path (
RuleCoach+ canonical phrases) is untouched — ADK still never touches the < 100 ms tier (ADR-017). - Warm path (
LitertCoach, in-process Gemma 4 E2B vialitert_lm.Engine) is untouched. The backend selector only governs the paddock ADK tier. - All 18 agents, 15 tools, KV-cache reuse strategy, and the
agent_tracesDuckDB schema (ADR-021) are byte-identical across backends. - Privacy guarantee: every supported backend is local. The bridge still
binds to
127.0.0.1. No cloud round-trip is introduced.
Deployment story (recommended Pixel setup)¶
┌─────────────────────── Pixel 10 ───────────────────────┐
│ │
│ ┌──────────────────────┐ │
│ │ LocalLLM (APK) │ downloads .litertlm from │
│ │ ├─ Catalog UI │ the in-app catalog; │
│ │ ├─ LiteRT-LM 0.11 │ AUTO delegate uses GPU on │
│ │ └─ HTTP :8099 │ Tensor G5 when available │
│ └──────────┬───────────┘ │
│ │ │
│ │ 127.0.0.1:8099/v1/chat/completions │
│ │ Bearer <signed-token> │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Pitwall bridge │ │
│ │ (Termux foreground) │ │
│ │ PITWALL_ADK_BACKEND │ │
│ │ = openai │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Two apps, one phone, one localhost hop, zero cloud.
Consequences¶
Positive:
- Single primary deployment story on Pixel: install LocalLLM APK + Termux bridge; no second Python process to babysit.
- LocalLLM owns model lifecycle (download, switch, unload) via its native UI — drivers don't shell into Termux to swap models.
- Clean process boundary: a model-runtime crash no longer takes the bridge down.
- GPU/NPU access via LiteRT's AUTO delegate, which Termux-hosted runtimes generally can't reach.
- Same
openaicode path works on dev machines pointed at Ollama / LM Studio / llama.cpp / vLLM — one path covers both production and dev.
Negative:
- Two APKs on the phone instead of one. Mitigated: LocalLLM is a published open-source APK with its own release cadence and UI; this is a cleaner separation than embedding the model runtime in the bridge.
- The
openaipath adds an HTTP hop theenginepath doesn't have. The hop is127.0.0.1and measured overhead is sub-millisecond; the paddock tier already operates at 2–15 s latencies, so it's noise. LiteLlm(litellm) is a new optional dependency. Mitigated by gating behindHAS_LITELLMand only requiring it whenPITWALL_ADK_BACKEND=openai.
Neutral:
- Default behaviour (
litertlm→lit serve) is preserved bit-for-bit. Existing deployments need to change nothing. New deployments are explicitly directed to setPITWALL_ADK_BACKEND=openaiand point at LocalLLM.
Validation¶
- ADK tests pass against all three backends.
- Smoke test:
PITWALL_ADK_BACKEND=openai PITWALL_ADK_OPENAI_URL=http://localhost:8099/v1 PITWALL_ADK_OPENAI_MODEL=gemma-4-e2b-it PITWALL_ADK_OPENAI_API_KEY=<token>round-trips throughPitwallOrchestratoragainst a LocalLLM instance loaded with a Gemma 4 E2B.litertlm. (LegacyPITWALL_LITERT_*names still work for one cycle.) - The same wiring also passes against an Ollama instance (
:11434/v1) on a dev macOS box — confirms theopenaibackend is portable across OpenAI-compatible servers. - The constraint "Native LiteRT-LM — no Ollama or LiteLLM" in
docs/adk-agent-architecture.mdis removed and replaced with the backend-selector matrix above.
References¶
- LocalLLM website: https://www.tahabouhsine.com/localllm/
- LocalLLM repo: https://github.com/mlnomadpy/localllm
- ADK
LiteLlmmodel wrapper: https://google.github.io/adk-docs/agents/models/litellm/ - LiteRT-LM Android:
com.google.ai.edge.litertlm:litertlm-android:0.11.0 litert-communitymodel collection: https://huggingface.co/litert-community