Skip to content

ADR-022 — On-Phone LocalLLM Server (OpenAI-Compatible)

Status: Accepted Date: 2026-05-12 Relates to: ADR-017, ADR-019, ADR-021


Context

ADR-017 mandated a fully on-device LLM stack. ADR-019–021 implemented the paddock tier on ADK with a single, hard-coded model client: Gemini(base_url="http://localhost:8001", model="gemma-4-e4b") against a separately-launched lit serve Python process. The earlier ADK architecture doc went further and codified this as a constraint: "Native LiteRT-LM — no Ollama or LiteLLM."

That setup was workable on a laptop but awkward on the actual deployment target. On a Pixel 10, lit serve had to run as a second Termux foreground service alongside the bridge — two wake locks, two processes, two startup ordering bugs, two things to debug at 4 a.m. before a track day. It also forced the LLM runtime to live inside the same Termux sandbox as the bridge, which has no GPU/NPU delegate access on most Android builds.

Meanwhile, a sibling project shipped: LocalLLM — an Apache-2.0 Android APK (github.com/mlnomadpy/localllm) that:

  • runs LiteRT-LM natively in an Android app process (com.google.ai.edge.litertlm:litertlm-android:0.11.0),
  • exposes an OpenAI-compatible HTTP server on POST /v1/chat/completions at port :8099,
  • accepts .litertlm model bundles from the litert-community HuggingFace collection (Gemma 4 family),
  • supports SSE streaming, signed-bearer-token auth, and
  • uses LiteRT's AUTO delegate (GPU → CPU fallback) for acceleration.

LocalLLM is a normal APK with a Catalog / Chat / Dashboard / Settings UI. A driver installs it once, downloads a model from the in-app catalog, and the server autostarts. From Pitwall's perspective, there is now a stable, authenticated, OpenAI-shaped HTTP endpoint at 127.0.0.1:8099/v1 on the same phone — without Pitwall having to host the model itself.

That changes the primary deployment story.


Decision

Adopt LocalLLM as the default on-device LLM server for every pitwall LLM request — both the warm path (LitertCoach.brief() / debrief()) and the paddock ADK tier. Refactor the paddock model wiring into a three-way backend selector chosen by PITWALL_ADK_BACKEND and have LitertCoach honour the same PITWALL_ADK_OPENAI_URL env (renamed 2026-05; the legacy PITWALL_LITERT_URL is still accepted with a DeprecationWarning). The defaults are flipped — fresh installs talk to LocalLLM with no env vars set.

PITWALL_ADK_BACKEND Transport Server Client class Used for
openai (default) HTTP → 127.0.0.1 LocalLLM APK (:8099/v1) LiteLlm(api_base=..., api_key=...) Pixel field deployment — primary path
engine In-process (no HTTP) (none — same process as bridge) LitertLmModel(BaseLlm) Headless Termux setups that already load the engine for the warm path
litertlm HTTP → lit serve lit serve Python process Gemini(base_url=..., model=...) Legacy / desktop dev with lit serve already running

All three speak to a model on the same phone as the bridge. None of them dial out to a hosted API. The privacy guarantee from ADR-017 is unchanged.

The warm-path LitertCoach follows the same default. Its constructor reads PITWALL_ADK_OPENAI_URL (default http://localhost:8099/v1; legacy alias PITWALL_LITERT_URL still honoured) and routes _generate() over HTTP via stdlib urllib.request. Setting the env to an empty string opts back into the in-process litert_lm.Engine.

Why LocalLLM specifically

  • Native Android process, native delegates. LocalLLM runs as a regular Android app and can use LiteRT's GPU delegate on a Pixel's Tensor G5 NPU via the AUTO backend. Termux processes typically cannot.
  • Stable HTTP contract. OpenAI's chat.completions shape is a widely supported, well-tested protocol surface — ADK's LiteLlm wrapper speaks it natively, as do dozens of other ecosystems.
  • No Termux co-tenancy. Crashes in the model runtime don't bring down the bridge; the bridge can reconnect via HTTP. Compared to in-process inference, this is a clean process boundary.
  • First-class model catalogue. A driver doesn't need to know about huggingface-cli or .litertlm paths — they pick a model from the in-app catalog and the server autostarts.
  • Signed-bearer-token auth. LocalLLM supports keyed access so a malicious app on the same device can't trivially hit the server. Pitwall passes the token via PITWALL_ADK_OPENAI_API_KEY (legacy: PITWALL_LITERT_API_KEY).

Configuration surface

Variable Default Used by
PITWALL_ADK_BACKEND openai paddock selector (engine | litertlm | openai)
PITWALL_ADK_OPENAI_URL http://localhost:8099/v1 both warm and paddock HTTP base; empty string ⇒ in-process. Legacy: PITWALL_LITERT_URL
PITWALL_ADK_OPENAI_MODEL gemma3n-e2b model id (must match LocalLLM's loaded model). Legacy: PITWALL_LITERT_MODEL
PITWALL_ADK_OPENAI_API_KEY lit-serve-not-required LocalLLM bearer token. Legacy: PITWALL_LITERT_API_KEY
PITWALL_LITERT_SIDECAR_URL http://127.0.0.1:8080 LiteRT-LM Kotlin sidecar URL. Legacy: PITWALL_LITERTLM_URL
PITWALL_LITERT_SIDECAR_MODEL gemma-4-e2b LiteRT-LM Kotlin sidecar model id. Legacy: PITWALL_LITERTLM_MODEL
PITWALL_LITERTLM_PATH (unset) engine (.litertlm bundle path)
PITWALL_LITERTLM_BUDGET 30000 engine (KV-cache char budget)
PITWALL_LITERT_HTTP_TIMEOUT_S 30 warm-path HTTP client timeout

Env vars renamed 2026-05 (this ADR amended). The PITWALL_LITERT_* family was easy to confuse with PITWALL_LITERTLM_* (one letter apart, two completely different things). The new names — PITWALL_ADK_OPENAI_* for the ADK→OpenAI-compatible HTTP shim and PITWALL_LITERT_SIDECAR_* for the Kotlin LiteRT-LM sidecar — are self-describing. All legacy names continue to work and emit a DeprecationWarning on first read; the fallback lives in src/pitwall/_env.py:get_env_with_legacy.

Default flipped 2026-05-12. Fresh installs of pitwall talk to LocalLLM with zero env vars set. To restore the previous lit serve behaviour: PITWALL_ADK_BACKEND=litertlm PITWALL_ADK_OPENAI_URL=http://localhost:8001.

Implementation

Paddock (src/pitwall/features/coaching/adk_agents.py) branches at module-load:

_BACKEND  = os.getenv("PITWALL_ADK_BACKEND", "openai").lower()
_MODEL_ID = get_env_with_legacy(
    "PITWALL_ADK_OPENAI_MODEL", "PITWALL_LITERT_MODEL", "gemma3n-e2b")
_MODEL_URL = get_env_with_legacy(
    "PITWALL_ADK_OPENAI_URL", "PITWALL_LITERT_URL",
    "http://localhost:8099/v1")

if _BACKEND == "engine":
    _model = LitertLmModel(model=_MODEL_ID)               # in-process

elif _BACKEND == "openai":                                 # default
    _model = LiteLlm(                                      # OpenAI-compatible HTTP
        model=_MODEL_ID,
        api_base=_MODEL_URL,                               # → LocalLLM at :8099/v1
        api_key=get_env_with_legacy(
            "PITWALL_ADK_OPENAI_API_KEY", "PITWALL_LITERT_API_KEY",
            "lit-serve-not-required"),
    )

else:                                                      # legacy: lit serve
    _model = Gemini(model=_MODEL_ID, base_url=_MODEL_URL)

Warm path (src/pitwall/features/coaching/coach_engine.py:LitertCoach) honours the same env. Its constructor reads PITWALL_ADK_OPENAI_URL (legacy: PITWALL_LITERT_URL); if non-empty (true by default), _generate() POSTs to LocalLLM's /chat/completions via stdlib urllib.request — no new dependency. The in-process litert_lm.Engine path is reached only when the env is explicitly set to an empty string.

# coach_engine.py — LitertCoach.__init__
http_url = (get_env_with_legacy(
    "PITWALL_ADK_OPENAI_URL", "PITWALL_LITERT_URL",
    self.DEFAULT_HTTP_URL) or "").strip()
if http_url:
    self._http_url   = http_url.rstrip("/")               # → :8099/v1
    self._http_model = get_env_with_legacy(
        "PITWALL_ADK_OPENAI_MODEL", "PITWALL_LITERT_MODEL",
        self.DEFAULT_HTTP_MODEL)
    self._http_api_key = get_env_with_legacy(
        "PITWALL_ADK_OPENAI_API_KEY", "PITWALL_LITERT_API_KEY",
        "lit-serve-not-required")
    self._llm = "http"                                    # truthy sentinel
    return
# else: lazy in-process engine load (legacy path)

LiteLlm ships with google-adk[litellm]. It is an optional install — the import is wrapped in a HAS_LITELLM flag and the openai branch raises a clear RuntimeError directing the user to install the extra if it's missing. The warm-path HTTP client uses only stdlib (urllib.request), so it works out of the box.

What does not change

  • Hot path (RuleCoach + canonical phrases) is untouched — ADK still never touches the < 100 ms tier (ADR-017).
  • Warm path (LitertCoach, in-process Gemma 4 E2B via litert_lm.Engine) is untouched. The backend selector only governs the paddock ADK tier.
  • All 18 agents, 15 tools, KV-cache reuse strategy, and the agent_traces DuckDB schema (ADR-021) are byte-identical across backends.
  • Privacy guarantee: every supported backend is local. The bridge still binds to 127.0.0.1. No cloud round-trip is introduced.

Deployment story (recommended Pixel setup)

┌─────────────────────── Pixel 10 ───────────────────────┐
│                                                         │
│   ┌──────────────────────┐                              │
│   │ LocalLLM (APK)       │   downloads .litertlm from   │
│   │  ├─ Catalog UI       │   the in-app catalog;        │
│   │  ├─ LiteRT-LM 0.11   │   AUTO delegate uses GPU on  │
│   │  └─ HTTP :8099       │   Tensor G5 when available   │
│   └──────────┬───────────┘                              │
│              │                                          │
│              │  127.0.0.1:8099/v1/chat/completions      │
│              │  Bearer <signed-token>                   │
│              ▼                                          │
│   ┌──────────────────────┐                              │
│   │ Pitwall bridge       │                              │
│   │  (Termux foreground) │                              │
│   │  PITWALL_ADK_BACKEND │                              │
│   │     = openai         │                              │
│   └──────────────────────┘                              │
└─────────────────────────────────────────────────────────┘

Two apps, one phone, one localhost hop, zero cloud.


Consequences

Positive:

  • Single primary deployment story on Pixel: install LocalLLM APK + Termux bridge; no second Python process to babysit.
  • LocalLLM owns model lifecycle (download, switch, unload) via its native UI — drivers don't shell into Termux to swap models.
  • Clean process boundary: a model-runtime crash no longer takes the bridge down.
  • GPU/NPU access via LiteRT's AUTO delegate, which Termux-hosted runtimes generally can't reach.
  • Same openai code path works on dev machines pointed at Ollama / LM Studio / llama.cpp / vLLM — one path covers both production and dev.

Negative:

  • Two APKs on the phone instead of one. Mitigated: LocalLLM is a published open-source APK with its own release cadence and UI; this is a cleaner separation than embedding the model runtime in the bridge.
  • The openai path adds an HTTP hop the engine path doesn't have. The hop is 127.0.0.1 and measured overhead is sub-millisecond; the paddock tier already operates at 2–15 s latencies, so it's noise.
  • LiteLlm (litellm) is a new optional dependency. Mitigated by gating behind HAS_LITELLM and only requiring it when PITWALL_ADK_BACKEND=openai.

Neutral:

  • Default behaviour (litertlmlit serve) is preserved bit-for-bit. Existing deployments need to change nothing. New deployments are explicitly directed to set PITWALL_ADK_BACKEND=openai and point at LocalLLM.

Validation

  • ADK tests pass against all three backends.
  • Smoke test: PITWALL_ADK_BACKEND=openai PITWALL_ADK_OPENAI_URL=http://localhost:8099/v1 PITWALL_ADK_OPENAI_MODEL=gemma-4-e2b-it PITWALL_ADK_OPENAI_API_KEY=<token> round-trips through PitwallOrchestrator against a LocalLLM instance loaded with a Gemma 4 E2B .litertlm. (Legacy PITWALL_LITERT_* names still work for one cycle.)
  • The same wiring also passes against an Ollama instance (:11434/v1) on a dev macOS box — confirms the openai backend is portable across OpenAI-compatible servers.
  • The constraint "Native LiteRT-LM — no Ollama or LiteLLM" in docs/adk-agent-architecture.md is removed and replaced with the backend-selector matrix above.

References