Development¶

How to build, extend, and ship the app yourself.

Project layout¶

localllm/
├── app/                               main Gradle module
│   ├── build.gradle.kts
│   ├── proguard-rules.pro             LiteRT-LM keeps; minify disabled in release for now
│   └── src/
│       ├── main/
│       │   ├── AndroidManifest.xml    foregroundServiceType=specialUse
│       │   ├── java/com/localllm/app/ Kotlin sources (see Architecture page)
│       │   └── res/
│       │       ├── drawable/          ic_launcher_foreground + ic_launcher_background (adaptive)
│       │       ├── mipmap-anydpi-v26/ adaptive icon manifests
│       │       └── values/            strings, themes, colors
│       └── test/                      JVM unit tests
├── docs/                              this mkdocs site
├── gradle/libs.versions.toml          Version catalog
├── mkdocs.yml
├── settings.gradle.kts
└── .github/workflows/build.yml        CI: lint + test + assembleDebug

Local build¶

git clone https://github.com/mlnomadpy/localllm.git
cd localllm
echo "sdk.dir=$HOME/Library/Android/sdk" > local.properties   # macOS
./gradlew :app:assembleDebug
adb install -r app/build/outputs/apk/debug/app-debug.apk

Requirements:

JDK 17 (temurin works; AGP 8.7 needs >=17, the source target is Java 11).
Android SDK API 35 installed.
Gradle is wrapper-pinned (8.x) — don't install separately.

Runtime versions are all locked in gradle/libs.versions.toml:


Kotlin	2.2.21
AGP	8.7.3
Ktor	3.4.3
Compose BOM	2024.09.02
LiteRT-LM	0.12.0
ML Kit GenAI Prompt	1.0.0-beta2
ONNX Runtime Android	1.18.0
ObjectBox	4.0.3

Adding a model to the catalog¶

ModelCatalog.kt is the source of truth.

ModelInfo(
    id          = "gemma-4-e2b",                     // bare name shown in /v1/models
    name        = "Gemma 4 E2B IT",                  // human label in the Catalog UI
    description = "Instruction tuned, ~2.6 GB.",
    url         = "https://.../gemma-4-E2B-it.litertlm",
    filename    = "gemma-4-e2b.litertlm",            // .litertlm extension required
    sha256      = "181938105e0eefd105961417e8da75903eacda102c4fce9ce90f50b97139a63c"
),

A few rules:

filename must end in .litertlm. Anything else is ignored by the on-device file scan.
sha256 is optional. When set, the file is verified after download; on mismatch the file is deleted and the user is toasted. Skip for non-HF mirrors where you don't control the integrity.
For HuggingFace xet-backed artifacts the x-linked-etag HTTP header is the SHA-256 — copy it from curl -sI <url> rather than downloading 2.6 GB to hash.

Running on a real device¶

The app runs fine on emulators, but inference is very slow on emulated x86 — figure 5+ seconds per token on Gemma 4 E2B on an x86_64 AVD. For development, use a real ARM phone. Anything post-2022 should manage 10–30 tok/s on CPU.

Tested on Pixel-class devices. NPU acceleration via the Qualcomm .litertlm variants requires a Snapdragon device.

Picking a backend on your device¶

There is no AUTO chain. Each catalog entry declares its Backend directly (AICORE / LITERT_CPU / LITERT_GPU / LITERT_NPU) in ModelCatalog.kt. Side-loaded models default to LITERT_CPU.

On Google Tensor SoCs (Pixel 6 / 9 / 10), a one-shot NPU primer runs before a real LITERT_CPU / LITERT_GPU init. It's expected to fail (no vendor delegate on stock hardware) but the JNI side effects unblock a known cold-init bug. It's not a fallback — failure on the declared backend stops there with LITERT_INIT_FAILED.

NPU variants additionally check Build.SOC_MODEL against requiredSocMarker before init, so an SoC mismatch fails fast with a clear message instead of a cryptic native error.

AICore (gemini-nano-aicore) doesn't expose a backend selector — the AICore system service picks NPU/GPU/CPU internally.

To see what your engine actually loaded, curl /health:

curl -s http://localhost:8080/health | jq '.engines[0]'
# {
#   "key": "gemma-4-e2b_model_LITERT_CPU",
#   "backend": "LITERT_CPU",
#   "attempts": [
#     {"backend":"NPU-primer","result":"expected-fail: no vendor delegate","duration_ms":312},
#     {"backend":"LITERT_CPU","result":"ok","duration_ms":3168}
#   ]
# }

Tests¶

./gradlew :app:testDebugUnitTest

38 unit-test files, all JVM-runnable (Robolectric for anything that needs Context). Grouped roughly:

Core logic — RequestTrackerTest, RateLimiterTest, RateLimiterEdgeTest, LogManagerTest, MessageHelpersTest.
Settings & config — SettingsTest, SettingsRepositoryTest.
Wire types — ApiTypesTest, ApiTypesContentTest, TenantApiTypesTest.
RAG — ChunkerTest, ChunkerEdgeTest, DocumentChunkTest, DocumentStoreTest, TenantResolverTest.
Embeddings — EmbeddingServiceTest, WordPieceTokenizerTest.
Inference — EngineKeyTest, AiCoreNotReadyExceptionTest, AICoreEngineStatusTest, AICoreBenchmarkTokenCountTest, TensorSoCDetectorTest, LlmMessageConverterTest.
Model management — ModelCatalogTest, ModelDirectoryScannerTest.
Routes (Ktor testApplication) — HealthRouteTest, ModelsRouteTest, ChatRouteTest, EmbeddingsRouteTest, DocumentsRouteTest, AICoreRouteTest, BenchmarkRouteTest, MetricsRouteTest, AuthorizeTest, RouteSupportTest, NsdBroadcasterTest.
Background — WarmupWorkerTest.

There is no on-device instrumentation test for the LLM path today — end-to-end inference is verified manually via curl against a real adb forward. The roadmap entry to fix that exists (androidTest with a tiny fixture model), tracked in issues.

Baseline profiles & macrobenchmark¶

The :macrobenchmark module (under macrobenchmark/) holds the Macrobenchmark + Baseline Profile generator. It uses the com.android.test plugin and androidx.baselineprofile, targets :app via targetProjectPath = ":app", and runs alongside the target with android.experimental.self-instrumenting = true.

Two test classes are wired up:

StartupBenchmark — cold-start StartupTimingMetric with CompilationMode.None / Partial / Full, five iterations each.
BaselineProfileGenerator — walks Catalog → Dashboard → Console → Chat → Settings so the produced profile covers the hot composables.

Both require a connected device (USB or wireless ADB) — they are not part of assembleDebug. To regenerate the profile:

./gradlew :app:generateReleaseBaselineProfile
# Outputs app/src/main/baseline-prof.txt, consumed by R8 at release-build time.

Baseline profiles only apply to release builds (R8-compiled), so the perf win is invisible in :app:assembleDebug — ship a release variant to feel it.

Continuous integration¶

.github/workflows/build.yml runs on every push and PR against main:

- ./gradlew lint testDebugUnitTest assembleDebug --no-daemon --stacktrace

Gradle is cached on gradle/libs.versions.toml + **/*.gradle.kts. Lint report + the debug APK are uploaded as workflow artifacts (14 day retention). JDK 17, Temurin. No daemon — fresh VMs don't benefit and the Gradle daemon's resident heap occasionally OOMs the 7 GB runner.

GitHub Actions currently unavailable on this repo

The hosting account has Actions administratively restricted (Trust & Safety review in progress with GitHub Support). While that's pending, CI parity is enforced locally:

./gradlew lint testDebugUnitTest assembleDebug --no-daemon

Contributors run this before opening a PR; maintainers run the same command before merging. The build.yml workflow stays in tree and will auto-resume once Actions is re-enabled — no changes needed.

Docs deploy (manual, while Actions is out)¶

The mkdocs Material site publishes to GitHub Pages from the gh-pages branch (legacy branch-source mode, bypassing Actions). One-time setup per machine:

python3 -m venv .venv-docs
.venv-docs/bin/pip install -r docs/requirements.txt

To publish the current docs/ state:

.venv-docs/bin/mkdocs gh-deploy --force --remote-branch gh-pages

That builds the site, commits to gh-pages, and pushes. Pages picks it up within a minute at the repo's custom domain (http://www.tahabouhsine.com/localllm/). When Actions is restored, flip Pages source back to "GitHub Actions" and the existing docs.yml workflow takes over.

Cutting a release¶

One command:

./scripts/release.sh v1.2.3

scripts/release.sh (see source for details):

Refuses to run on a dirty working tree.
Runs ./gradlew :app:assembleDebug.
Runs mkdocs gh-deploy to publish the docs.
Tags the current commit and pushes the tag.
Reads release notes from the matching ## [1.2.3] section in CHANGELOG.md and calls gh release create with the APK attached.

The tag prefix must be v<semver>. Notes are sourced from CHANGELOG.md so the changelog stays the source of truth — forget to add a section and the release goes out with a placeholder note and a stderr warning.

Don't try to ship to the Play Store from this debug APK — minify is disabled (isMinifyEnabled = false), there's no signing config, and the version code is hardcoded. Production-ready signing + R8 are roadmap items.

Extending the HTTP API¶

The route handler lives in LLMServerService.kt inside the embeddedServer(Netty, ...) { routing { ... } } block. A new endpoint is one route registration + (usually) one Gson DTO in ApiTypes.kt.

For anything that's actually inference-shaped, mind the existing contract:

Call authorize(call) first if you want the bearer-token gate.
Atomic admission via RequestTracker.tryEnqueue(...) so the global queue cap is honored.
Use inferenceMutex.withLock { ... } — only one inference at a time per device.
Wrap the inference in withWakeLock(needWakeLock, timeoutMs) { ... }.
Use withTimeout(timeoutMs) { ... } to enforce the budget. On timeout, call conversation.cancelProcess() so the native engine actually stops burning compute.
Mind the SSE error path: if you start a streaming response, capture the writer (streamWriter = this@respondBytesWriter) and emit a writeSseError chunk on failure — don't try to call.respond after headers have committed.

Release builds and signing¶

The release build pipeline is wired in app/build.gradle.kts:

isMinifyEnabled = true + isShrinkResources = true — R8 + the resource shrinker run on every :app:assembleRelease.
signingConfigs.release reads four properties from ~/.gradle/gradle.properties or environment variables. If any of the four is missing the config is silently empty and the release build falls back to the debug signing key, so :app:assembleRelease completes for every contributor without needing access to the production keystore.
splits.abi ships per-ABI APKs for arm64-v8a plus a universal APK (isUniversalApk = true). armeabi-v7a and the x86 family are intentionally dropped — see the gotcha below.

One-time keystore setup¶

keytool -genkey -v -keystore localllm-release.keystore \
  -alias localllm -keyalg RSA -keysize 4096 -validity 10000

Move it somewhere outside the repo (the project .gitignore blocks *.keystore and *.jks, but keeping it out of the source tree is safer still). Then add the four properties to your user-level ~/.gradle/gradle.properties:

LOCALLLM_KEYSTORE_PATH=/Users/you/keys/localllm-release.keystore
LOCALLLM_KEYSTORE_PASSWORD=********
LOCALLLM_KEY_ALIAS=localllm
LOCALLLM_KEY_PASSWORD=********

Environment variables with the same names also work — handy for CI. If both are set, the Gradle property wins.

Build outputs¶

./gradlew :app:assembleRelease   # per-ABI + universal APKs
./gradlew :app:bundleRelease     # .aab for Play Store / Internal App Sharing

app/build/outputs/apk/release/ will contain:

app-arm64-v8a-release.apk — per-ABI, smallest (~28 MB)
app-universal-release.apk — fat APK with all included ABIs (~39 MB)

Verify the JNI payload of each split:

unzip -l app/build/outputs/apk/release/app-arm64-v8a-release.apk | grep '\.so$'

Only lib/arm64-v8a/... entries should appear in the per-ABI APK.

LiteRT-LM ABI gotcha¶

The com.google.ai.edge.litertlm:litertlm-android:0.12.0 AAR ships JNI .so files for arm64-v8a and x86_64 only:

lib/arm64-v8a/: libLiteRt.so, libLiteRtClGlAccelerator.so, liblitertlm_jni.so
lib/x86_64/: same three libs (emulator-only convenience)
lib/armeabi-v7a/: none — LiteRT-LM doesn't target 32-bit ARM.

splits.abi.include is set to arm64-v8a only. Re-adding armeabi-v7a would produce an APK that crashes on first inference with UnsatisfiedLinkError. x86_64 is omitted from the per-ABI split list because emulator inference is unusably slow, but the universal APK still carries x86_64 so an emulator install via the universal APK works for smoke tests.

Roadmap¶

Tracked in GitHub Issues.

Shipped this cycle

AICore (Gemini Nano) as the default engine (Settings.DEFAULT_MODEL_ID = "gemini-nano-aicore").
AUTO backend chain removed — Backend enum declared per-model in the catalog; no fallback.
Feature-sliced split of LLMServerService.kt (2287 → ~366 lines). Routes under server/routes/, engines under inference/litert/ and inference/aicore/.
Structured error envelopes (RichErrorResponse / RichErrorDetails): AICORE_DOWNLOADABLE, AICORE_DOWNLOADING, AICORE_UNAVAILABLE, AICORE_BACKGROUND_BLOCKED, AICORE_RUNTIME_ERROR, LITERT_INIT_FAILED.
GET /v1/aicore/status and POST/GET /v1/aicore/benchmark (TTFT, tok/s, total-ms).
aicore block in /health.
Multimodal image_url content blocks (LiteRT path).
Tool / function calling.
Qualcomm + MediaTek + Tensor G5 NPU catalog entries.
Release signing + R8 production build.

Open

Content.AudioBytes multimodal input (LiteRT-LM supports it; the OpenAI-compat layer doesn't yet).
Per-IP token-bucket rate limiting (currently per-User-Agent).
Persistent log buffer + Sentry/Crashlytics integration.
androidTest end-to-end with a tiny fixture model.
Multi-process isolation for engine crashes (issue #11).