HTTP API¶

Everything except /health and /metrics is gated by the optional API key configured in Settings → API key. When the key is empty (the default), all endpoints are open.

Path	Auth	Notes
`GET /health`	always open	Liveness + cached-engine attempts + AICore status block.
`POST /health/warm`	optional bearer	Force-warm an engine; waits for first-token-ready (60s timeout).
`GET /v1/models`	optional bearer	Lists `.litertlm` files, ONNX embedding models, and the virtual `gemini-nano-aicore` entry.
`POST /v1/chat/completions`	optional bearer	OpenAI-style. Routes to AICore for `gemini-nano-aicore`, otherwise to LiteRT-LM.
`POST /v1/embeddings`	optional bearer	ONNX-backed embeddings. Single or batched `input`.
`POST /v1/documents` · `GET /v1/documents` · `DELETE /v1/documents/{id}`	optional bearer	RAG document store (ObjectBox HNSW, dim 384).
`POST /v1/search`	optional bearer	Top-K vector search over the document store.
`GET /v1/tenants` · `DELETE /v1/tenants/{tenantId}`	optional bearer	Tenant isolation.
`GET /v1/aicore/status`	optional bearer	Detailed Gemini Nano readiness probe. `?probe=all` for per-config breakdown.
`POST/GET /v1/aicore/benchmark`	optional bearer	TTFT + tokens/sec + total-ms speed test.
`GET /metrics`	always open	Prometheus exposition (counters + gauges, per-engine + per-client).

Response headers¶

POST /v1/chat/completions emits these headers before the first SSE byte, so streaming clients can show queue position up front:

Header	Meaning
`X-Request-Id`	Unique 64-bit counter assigned by `RequestTracker`.
`X-Client-Id`	Resolved from `User-Agent` (or `"anonymous"`).
`X-Queue-Position`	1-indexed position at admission.
`X-Queue-Depth`	Total queued at admission.
`X-Estimated-Wait-Ms`	`(position - 1) × avg_latency_ms` from recent history.
`Retry-After`	Seconds to wait — set on `429`.
`X-RateLimit-Client`	Client UA that hit the bucket — set on `429` from rate limiter.

RAG routes emit X-Tenant-Id so clients can confirm the resolved tenant from their X-Client-Id / User-Agent headers.

`GET /health`¶

Never gated. Use this for liveness / readiness probes.

curl -s http://localhost:8080/health

{
  "status": "ok",
  "service": "localllm-android",
  "version": "1.0",
  "queue_depth": 0,
  "engines_loaded": 1,
  "engines": [
    {
      "key": "gemma-4-e2b_model_LITERT_CPU",
      "backend": "LITERT_CPU",
      "attempts": [
        {"backend": "NPU-primer", "result": "expected-fail: no vendor delegate", "duration_ms": 312},
        {"backend": "LITERT_CPU", "result": "ok", "duration_ms": 3168}
      ]
    }
  ]
}

queue_depth — requests currently queued behind the inference mutex.
engines_loaded — LiteRT engines in the LRU cache. AICore requests do not appear here; they run inside the AICore system service.
engines[].key — engine cache key in the shape <model>_<maxTokens|"model">_<backend>.
engines[].backend — the backend declared for this model in the catalog (LITERT_CPU, LITERT_GPU, or LITERT_NPU). Each engine records the single init attempt that built it; no AUTO chain.
engines[].attempts — the init record for the declared backend with result (ok / failed: …) and duration_ms. Single entry now — the AUTO chain was removed.
aicore — readiness of Gemini Nano: {status_code, status, model_id, is_default: true}. On a device where the AICore probe throws (e.g. service not installed), status_code is null and error carries the SDK message.

`POST /health/warm`¶

Force-warm an engine and wait until it's ready to emit the first token. Useful before showing a "ready" UI in a sibling app: the endpoint blocks for up to 60 seconds while LiteRT-LM links the JNI runtime or AICore triggers a model download.

curl -s -X POST "http://localhost:8080/health/warm?model=gemma-4-e2b" \
  -H "Authorization: Bearer $LLM_KEY"

Query parameters:

model (optional) — model id to warm. Defaults to Settings.selectedModelId (gemini-nano-aicore).

Responses:

Status	Body
`200`	`{"model": "...", "status": "warm", "engine_loaded": true, "ms": 1500}`
`503`	`{"status": "aicore_not_ready", "aicore_status": "downloadable", "ms": …}`
`504`	`{"status": "timeout", "ms": 60000}`
`503`	`{"status": "error", "error": "...", "ms": …}`

The warm-up uses its own mutex — it does not block concurrent chat requests. For LiteRT models the engine runs a 1-token generation to ensure the JNI side-effects have settled (Tensor SoCs need this); for AICore the readiness probe runs.

`GET /v1/models`¶

Lists .litertlm LLMs on disk, ONNX embedding models (with a sibling *-vocab.txt), and the always-present virtual AICore entry.

curl -s -H "Authorization: Bearer $LLM_KEY" \
  http://localhost:8080/v1/models

{
  "object": "list",
  "data": [
    { "id": "gemma-4-e2b",         "object": "model", "created": 1778610084, "owned_by": "local" },
    { "id": "gemini-nano-aicore",  "object": "model", "created": 1778610090, "owned_by": "google-aicore" }
  ]
}

For LiteRT-LM entries, id is the filename with .litertlm stripped. The gemini-nano-aicore entry is listed unconditionally so clients can probe; the chat handler surfaces a clean error if AICore isn't installed or the model isn't downloaded yet.

`POST /v1/chat/completions`¶

OpenAI-compatible chat completion. Streaming and blocking.

Routing. When model == "gemini-nano-aicore" the request bypasses the LiteRT engine cache and the inference mutex — AICore runs inside the system service and handles its own serialization. Every other model id is resolved against .litertlm files on disk and goes through the LiteRT-LM path. The two paths differ in three places that clients should know about:

	AICore (`gemini-nano-aicore`)	LiteRT-LM (e.g. `gemma-4-e2b`)
Backend selection	Decided by AICore. No surface.	Declared per-model in the catalog (`Backend.LITERT_CPU` / `_GPU` / `_NPU`). No fallback chain.
`session_id` (KV reuse)	Ignored. Stateless. History is flattened into one prompt.	Honored — see multi-turn.
App lifecycle constraint	Foreground only. Backgrounded calls fail with ErrorCode 30.	None.
`tools` / `tool_choice`	Not supported by the SDK.	Honored.
Multimodal `content` parts	Text only.	Text + `image_url` parts.

Request¶

Field	Type	Default	Notes
`model`	string	required	An `id` from `GET /v1/models`.
`messages`	array	required	Each item is `{role, content}` where `role` is `"system"`, `"user"`, or `"assistant"`. The first system message becomes `ConversationConfig.systemInstruction`; remaining messages are fed as conversation history. The last message must be `user`.
`stream`	bool	`false`	Server-Sent Events when true.
`session_id`	string	`null`	Stable opaque ID for KV-cache reuse across turns. See multi-turn.
`temperature`	float	server default	Per-request sampler temperature.
`top_k`	int	server default	Per-request sampler top-k.
`max_tokens`	int	model default	Per-request total-token budget (input + output). Omit to use the model's compiled budget — overriding it down can trigger `DYNAMIC_UPDATE_SLICE` shape mismatches on big-context Gemma 4 weights.

Blocking response¶

{
  "id": "chatcmpl-1",
  "object": "chat.completion",
  "created": 1778611764,
  "model": "gemma-4-e2b",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "Hi! How can I help?" },
      "finish_reason": "stop"
    }
  ]
}

Streaming response¶

SSE frames are emitted as the model generates tokens. The first frame contains delta.role = "assistant"; subsequent frames carry delta.content; the final frame carries finish_reason: "stop". The stream terminates with data: [DONE].

data: {"id":"chatcmpl-2","object":"chat.completion.chunk","created":1778610331,"model":"gemma-4-e2b","choices":[{"delta":{"role":"assistant"},"index":0}]}

data: {"id":"chatcmpl-2","object":"chat.completion.chunk","created":1778610333,"model":"gemma-4-e2b","choices":[{"delta":{"content":"Hi"},"index":0}]}

data: {"id":"chatcmpl-2","object":"chat.completion.chunk","created":1778610334,"model":"gemma-4-e2b","choices":[{"delta":{"content":"!"},"index":0}]}

data: {"id":"chatcmpl-2","object":"chat.completion.chunk","created":1778610335,"model":"gemma-4-e2b","choices":[{"delta":{},"finish_reason":"stop","index":0}]}

data: [DONE]

A heartbeat comment (: ka\n\n) is emitted every 10 seconds so intermediaries don't kill long prefill times.

Error responses¶

Code	Shape	Meaning
`400`	`ErrorResponse`	Malformed JSON body.
`401`	`ErrorResponse`	Missing / invalid bearer token.
`408`	`ErrorResponse`	Request timeout (configurable in Settings).
`413`	`ErrorResponse`	Prompt or body exceeds the configured cap.
`429`	`ErrorResponse` + `Retry-After: 5`	Queue full (default 8 in flight).
`500`	`ErrorResponse`	Server / engine error.

ErrorResponse schema:

{ "error": { "message": "Inference timeout", "type": "timeout", "code": 408 } }

Errors mid-stream are delivered as a final SSE chunk + the [DONE] sentinel — never as a silently-closed connection:

data: {"error":{"message":"Inference timeout","type":"timeout","code":408}}

data: [DONE]

Multi-turn with `session_id`¶

Pass a stable session_id and the server caches the underlying Conversation object across calls. Follow-up requests only need to re-feed new user turns; assistant turns echoed by the client are already in the model's KV cache and are skipped.

Turn 1

POST /v1/chat/completions
{
  "model": "gemma-4-e2b",
  "session_id": "alice-2026-05-10",
  "messages": [{"role": "user", "content": "Hi"}]
}

Turn 2 — same session_id, full history echoed (OpenAI convention)

POST /v1/chat/completions
{
  "model": "gemma-4-e2b",
  "session_id": "alice-2026-05-10",
  "messages": [
    {"role": "user", "content": "Hi"},
    {"role": "assistant", "content": "Hello! How can I help?"},
    {"role": "user", "content": "Tell me about kernels."}
  ]
}

Server-side rules:

Empty / missing session_id → fresh Conversation every request. Default.
Non-empty session_id → cached Conversation. The server validates the replayed prefix via a stable hash of all messages[0..seenCount]. Mismatch → rebuild from scratch.
Sampling parameter change mid-session (temperature / top_k) → rebuild. Those are conversation-construction parameters in LiteRT-LM's SamplerConfig.
Cache size is 4. Oldest evicted. Conversations are also evicted together with their parent engine when the engine is unloaded.

On a 10-turn conversation, request N pays only the cost of prefilling turn N's new user message, not the entire history each time.

Structured error envelopes¶

Chat completion failures return a RichErrorResponse with a machine-readable code so clients can react instead of regex-matching the message:

{
  "error": {
    "message": "Gemini Nano is downloading on this device.",
    "type": "aicore_unavailable",
    "code": "AICORE_DOWNLOADING",
    "aicore_status": "downloading",
    "actionable": true,
    "next_steps": ["Wait for status: Available", "Retry"]
  }
}

`code`	HTTP	`actionable`	Meaning
`AICORE_DOWNLOADABLE`	503	✅	Gemini Nano can be downloaded; open Models → tap Download.
`AICORE_DOWNLOADING`	425	✅	Download in progress; retry later.
`AICORE_UNAVAILABLE`	503	❌	Not available on this device; switch to a LiteRT model.
`AICORE_BACKGROUND_BLOCKED`	403	✅	Host app was backgrounded mid-request (ErrorCode 30).
`AICORE_RUNTIME_ERROR`	500	❌	Generic AICore runtime failure.
`LITERT_INIT_FAILED`	503	✅	Engine init failed — check the file, SoC match, vendor delegate.
`AICORE_UNKNOWN`	503	❌	Unrecognised AICore status code.

`GET /v1/aicore/status`¶

Detailed readiness probe for Gemini Nano. Surfaces the SDK status code, a human-readable label, and the Build.SOC_MODEL / Build.DEVICE / Build.MANUFACTURER so clients can decide whether to even surface AICore as an option.

curl -s "http://localhost:8080/v1/aicore/status" -H "Authorization: Bearer $LLM_KEY"

{
  "model_id": "gemini-nano-aicore",
  "status_code": 3,
  "status": "available",
  "available": true,
  "soc_model": "Tensor G5",
  "device": "frankel",
  "manufacturer": "Google"
}

status_code: 3 = available, 2 = downloadable, 1 = downloading, 0 = unavailable.

Add ?probe=all to get a per-(releaseStage × preference) breakdown — useful on Pixel 10 where STABLE may report unavailable but PREVIEW/FAST works.

`POST/GET /v1/aicore/benchmark`¶

Runs a TTFT + tokens/sec speed test against AICore. POST runs a fresh benchmark; GET serves the last cached result (404 until the first run).

curl -s -X POST http://localhost:8080/v1/aicore/benchmark \
  -H "Authorization: Bearer $LLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompts": ["Say hi.", "Count to ten."], "warmup": 1}'

prompts and warmup are both optional (sensible defaults). warmup is clamped to [0, 5].

{
  "model_id": "gemini-nano-aicore",
  "status": "available",
  "warmup_runs": 1,
  "results": [
    { "prompt": "Say hi.",        "output_tokens": 5,  "inference_ms": 124, "tokens_per_sec": 40.3 },
    { "prompt": "Count to ten.",  "output_tokens": 22, "inference_ms": 612, "tokens_per_sec": 35.9 }
  ],
  "total_tokens": 27,
  "total_ms": 736,
  "avg_tokens_per_sec": 38.1
}

Refuses (503) when AICore isn't STATUS_AVAILABLE. The Dashboard tab's AICore Benchmark card calls this endpoint and renders the last result inline.

`GET /metrics`¶

Prometheus text exposition for the on-device server. Never gated.

curl -s http://localhost:8080/metrics

# TYPE localllm_requests_total counter
localllm_requests_total 42
# TYPE localllm_requests_completed_total counter
localllm_requests_completed_total 40
# TYPE localllm_inference_seconds_avg gauge
localllm_inference_seconds_avg 3.1375
# TYPE localllm_queue_depth gauge
localllm_queue_depth 2
# TYPE localllm_inflight gauge
localllm_inflight 1
# TYPE localllm_engine_loaded gauge
localllm_engine_loaded{key="gemma-4-e2b_model_LITERT_CPU",backend="LITERT_CPU"} 1
# TYPE localllm_client_requests_total counter
localllm_client_requests_total{client="MyApp/1.0"} 35

Metrics emitted:

Counters: localllm_requests_total, localllm_requests_completed_total, localllm_requests_errored_total, localllm_requests_cancelled_total, localllm_stream_chunks_total, localllm_inference_seconds_total, localllm_client_requests_total{client}.
Gauges: localllm_inference_seconds_avg, localllm_chunks_per_second_avg, localllm_queue_depth, localllm_inflight, localllm_engines_loaded, localllm_engine_loaded{key,backend}, localllm_client_inference_seconds_avg{client}.

Point Grafana Agent / OpenTelemetry Collector at it like any other Prometheus target — Tailscale or adb forward is the easy way to get to it from off-device.

Service discovery (mDNS / NSD)¶

When Settings → Bind to LAN is on, the service advertises itself via android.net.nsd:


Service type	`_localllm._tcp.`
Service name	`LocalLLM`
TXT records	`api=openai-compat`, `path=/v1`, `health=/health`

Loopback-only binds skip the advert. Any zero-conf browser (dns-sd -B _localllm._tcp . on macOS, Bonjour Browser, Avahi) discovers it.

What's not implemented (yet)¶

The OpenAI-compat surface is intentionally partial. Missing pieces, in roughly the order they'll land:

logprobs / top_logprobs — LiteRT-LM exposes them but they're not wired through.
n > 1 — single completion per request.
stop sequences — parsed but ignored.
tools / function calling — LiteRT-LM supports it natively (ConversationConfig.tools), the HTTP layer doesn't.
Vision / audio content blocks — LiteRT-LM Content.ImageBytes and Content.AudioBytes exist but the route only extracts Content.Text.

If you need any of these, open an issue or jump to development.

HTTP API¶

Response headers¶

GET /health¶

POST /health/warm¶

GET /v1/models¶

POST /v1/chat/completions¶