Pitwall Server Improvement Plan¶
Author: Taha Bouhsine
Date: 2026-05-15
Companion: docs/reports/server-audit-termux.md (findings)
Branch: aim-mxp-yaml-pipeline
Translates the audit findings into an executable, phased roadmap. Each phase is a self-contained unit of work with a goal, a list of deliverables with effort estimates, an exit criterion that can be tested, and an explicit answer to "what breaks if we skip this phase and try to ship?"
The sequencing favours derisking the next demo / track day first, then operability, then scale, then security. Phases 1–3 should land before any unsupervised on-car run; phases 4–6 can ship incrementally afterwards.
Phase 0 — Already shipped on this branch¶
Context for what follows. None of this needs work; listed so we don't redo it.
| Item | Where | Notes |
|---|---|---|
| YAML-driven per-car pipeline | data/cars/bmw_e46_m3.yaml, formula.py, car_config.py |
22 signal pipelines, 2 cross-derived, 1 method, AST-allowlisted expressions |
| Shared formula library | data/formulas/standard.yaml |
29 formulas, zero-Python additions for new conversions |
| AiM MXP synthetic simulator | src/simulator/aim_mxp_simulator.py + --simulate flag |
Headless dev/test without a real car |
| DB corruption recovery | db.py::reset_live_session |
Rotates .corrupted-<ts> aside, recreates fresh schema, re-seeds registry |
| Lazy capabilities | bp_signals.py::session_capabilities_get |
/session/<sid>/capabilities works for _live without explicit recompute |
| Hardware spec docs | docs/reports/aim-mxp-can-validation.tex/.pdf, data/cars/bmw_e46_m3.yaml |
29/29 channel math validated against live bus |
| DuckDB out of git | .gitignore, data/pitwall_sessions.duckdb untracked |
Each clone starts with a fresh DB |
Phase 1 — Stop the bridge being its own worst enemy¶
Goal: Make python -m pitwall survive an unclean shutdown without
poisoning state, deadlocking, or dropping in-flight HTTP requests.
Why first: Every other improvement assumes the bridge can boot reliably. We just spent an hour rotating a corrupted DB during testing; that should not happen twice.
Total budget: 2 hours.
Items¶
| # | Title | Cost | File |
|---|---|---|---|
| 1.1 | SIGTERM + SIGINT handler → can_reader.stop() → simulator.stop() → DuckDB CHECKPOINT; → sys.exit(0) |
30 min | src/pitwall/__main__.py |
| 1.2 | Mirror the same body in an atexit handler (belt + suspenders) |
5 min | src/pitwall/__main__.py |
| 1.3 | state.db_lock = threading.Lock() → threading.RLock(); same for burst_lock, bundles_lock (qa_lock was removed entirely in PR #30) |
5 min | src/pitwall/state.py — shipped: locks are now RLock() at lines 57/66/70 |
| 1.4 | Swap Flask dev server for waitress: waitress.serve(app, host="127.0.0.1", port=port, threads=8, channel_timeout=30) |
1 h | src/pitwall/__main__.py:168 + pyproject.toml deps |
| 1.5 | Add --dev flag that keeps the Flask dev server path for local debugging |
10 min | src/pitwall/__main__.py |
| 1.6 | Release Termux wake-lock on graceful stop via subprocess.run(["termux-wake-unlock"], check=False) |
5 min | shutdown handler from 1.1 |
Exit criteria¶
kill -TERM <pid>followed by a fresh start produces zeroFATAL: Failed to delete all rows from indexerrors, with no.corrupted-*file generated.- Starting + stopping
--simulate10 times in a row leaves the DB size stable (no zombie WAL entries). - The bridge boot log no longer prints
WARNING: This is a development server.
Risk if skipped¶
Already realised today: one rough kill produces an invalidated DB. The
recovery path in db.py masks the symptom but loses all prior
session data when it fires.
Phase 2 — Make a stock Pixel actually able to read CAN¶
Goal: Replace the documentation lie that /dev/ttyACM0 exists on
non-rooted Termux with a code path that genuinely works.
Why before phase 3: Until this lands, no real-car deployment is
possible; everything else is academic. --simulate is the only
working path today on a stock Pixel.
Total budget: 1 working day.
Items¶
| # | Title | Cost | File |
|---|---|---|---|
| 2.1 | Add --can-fd <int> argument to CanReader + __main__ that accepts a pre-opened USB file descriptor |
1 h | src/pitwall/features/telemetry/can_reader.py, src/pitwall/__main__.py |
| 2.2 | Wrap python-can's slcan interface so it can use the FD (python-can's SerialBus accepts a serial.Serial constructed from the FD via fdopen) |
2 h | new helper in can_reader.py |
| 2.3 | Write a Termux:Boot shim script that runs termux-usb -e <handoff.sh> <vendor:product> and exec python -m pitwall --can-fd "$1" |
1 h | deploy/termux/boot/start-pitwall-with-can |
| 2.4 | Update deploy/termux/INSTALL.md: explicit "USB host on non-rooted Android" section + the three paths (termux-usb FD handoff, root + udev, separate Android app forwarding) with the FD handoff marked recommended |
1 h | docs |
| 2.5 | Add --no-car-config + --simulate to INSTALL.md as the "no-car smoke-test" path so new operators have a working bridge from minute zero |
30 min | docs |
| 2.6 | Open-on-disconnect retry: outer loop in _reader_loop with 1s → 2s → 5s → 10s exponential backoff; reset _latest cache on reconnect; surface in state() for the PWA |
1 h | can_reader.py:355-369 |
| 2.7 | frames_dropped counter exposed via state() (incremented when python-can's RX buffer overflows) |
30 min | can_reader.py |
Exit criteria¶
- A fresh Pixel 10 with Termux + Termux:API + the CANable plugged in
goes from
git cloneto "bridge ingesting real CAN frames" in under 10 minutes, following only INSTALL.md. - Unplugging the CANable for 5 s and plugging it back in resumes
ingest within 5 s without bridge restart, and the
/signals/registry?include_can_state=truesnapshot reflects the re-connection.
Risk if skipped¶
No prod track day. We've already validated the upstream chain (CAN bus on the Mac), but the Pixel-side ingest path does not exist.
Phase 3 — Operability for an unattended bridge¶
Goal: When the bridge runs unattended in the car for hours, we can answer "is anything wrong?" from logs alone, and the bridge recovers from common transient failures without manual intervention.
Total budget: 1 working day.
Items¶
| # | Title | Cost | File |
|---|---|---|---|
| 3.1 | Structured logging: top-of-main() logging.basicConfig with %(asctime)s %(levelname)s %(name)s %(message)s; replace print(...) calls throughout __main__.py and blueprints; add --log-level CLI |
1 h | __main__.py, all bp_*.py |
| 3.2 | Cap /session/<sid>/signals window: reject if rate_hz × (t_to − t_from) > 10_000 with HTTP 413 and a "narrow your window" message; default to last 60s if neither bound given |
15 min | bp_signals.py:111-123 |
| 3.3 | Periodic RSS log line every 60s; flag if growth > 10 MB/min | 30 min | __main__.py background task |
| 3.4 | LocalLLM fast health probe: HEAD /v1/models with 2s timeout before every /coach/brief call; on failure return rules-only response immediately, log warn, expose state.litert_up in /health |
1 h | src/pitwall/features/coaching/litert_coach.py (was in monolithic coach_engine.py pre PR #30) |
| 3.5 | Watchdog thread that detects a stuck reader: if frames_per_second == 0 AND last_frame_age_s > 30 AND loaded == True, log error and restart the reader |
1 h | __main__.py |
| 3.6 | _latest and _tall_id_cache bounded to 1000 entries each (LRU via collections.OrderedDict) |
15 min | can_reader.py |
| 3.7 | /health extended with can.fps, can.connected, litert.up, simulator.running, wide_rows.last_5min, tall_rows.last_5min — a single curl tells the operator everything |
30 min | __init__.py or bp_diagnostics.py |
Exit criteria¶
- A 4-hour idle simulator run produces a clean log with no errors,
bounded RSS, and a
/healthsnapshot the on-call can read in 5 seconds. - Killing LocalLLM mid-session causes coach responses to fall back to
rules instantly (no 30s stalls);
/healthflipslitert.up=false. - Unplugging the CANable mid-session triggers the watchdog within 35 s and reconnect succeeds (combines with item 2.6).
Risk if skipped¶
We won't know the bridge has degraded until the driver complains.
Phase 4 — Scale & disk hygiene¶
Goal: Bridge survives a full season of track days on a single device without manual cleanup.
Total budget: 2 working days.
Items¶
| # | Title | Cost | File |
|---|---|---|---|
| 4.1 | Session archive: POST /session/<sid>/end exports session rows to archive/<sid>.duckdb, DELETEs them from live DB, VACUUMs |
3 h | db.py, new blueprint endpoint |
| 4.2 | Automatic archive on session-end (lap detector EOL, or 30 min of zero frames) | 2 h | db.py, bp_session.py |
| 4.3 | Daily VACUUM at 03:00 local time via a small scheduler thread (only when frames_per_second < 1, so we never block live ingest) |
1 h | __main__.py |
| 4.4 | Per-thread DuckDB read connection via threading.local; one shared writer connection owned by the CAN reader thread; HTTP endpoints get read-only connections |
3 h | db.py:32 |
| 4.5 | Migrate the few HTTP endpoints that do INSERT/DELETE (notes, capabilities recompute) to queue work to the writer via a small queue, returning 202 Accepted |
2 h | bp_*.py writers |
Exit criteria¶
- After 50 simulated sessions, the live
pitwall_sessions.duckdbis under 100 MB; older sessions live inarchive/. /session/<sid>/signals5 Hz polling load test: 100 RPS for 5 minutes against the bridge with the simulator running stays under 150 ms p99 latency.
Risk if skipped¶
Slow degradation: disk fills, queries slow down, eventually the phone's Doze killer or filesystem corruption finishes the job.
Phase 5 — Security hardening¶
Goal: Bridge is safe even if other untrusted apps run on the same phone, and if the phone itself is lost.
Total budget: 1 working day.
Items¶
| # | Title | Cost | File |
|---|---|---|---|
| 5.1 | Bearer token auth: PITWALL_TOKEN env var; if set, all requests must include X-Pitwall-Token; /health exempt |
1 h | __init__.py (Flask before_request) |
| 5.2 | Rate limit /coach/* and /session/*/signals at e.g. 20 RPS per source |
1 h | __init__.py (small token-bucket middleware) |
| 5.3 | Verify Android disk encryption is on at install time (getprop ro.crypto.state) and log a warning to INSTALL if it's unencrypted |
30 min | install script |
| 5.4 | Cap request body size (1 MB) and Content-Type allow-list (application/json only on POSTs) |
15 min | __init__.py |
| 5.5 | Audit CORS: drop CORS(app) (allow-all) → CORS(app, origins=["http://localhost:*"]) so the bind-to-127.0.0.1 promise extends to the browser layer |
15 min | __init__.py:49 |
| 5.6 | Document the threat model + the assumptions we make (single-user device, no remote exposure, Android FDE) | 30 min | new docs/security.md |
Exit criteria¶
- A second app on the phone calling
/coach/briefwithout the token receives 401. - Loading 1000 requests/second against
/coach/askproduces 429s, not OOM.
Risk if skipped¶
Today the surface is benign (single-user device), but the first time
we ship a partner app or expose adb forward to a workstation, the
attack surface opens.
Phase 6 — Hot-path & latency tuning¶
Goal: Sustain the real AiM MXP rate (350 fps peak) on the Pixel without dropping frames or smearing coaching latency.
Total budget: 2 working days (mostly profiling + small fixes).
Items¶
| # | Title | Cost | File |
|---|---|---|---|
| 6.1 | cProfile capture of _consume over a 60-s 350 fps simulator run; flag anything > 100 µs/frame |
2 h | profiling |
| 6.2 | Cross-derived dedupe: skip evaluating a derived: entry if all bind: inputs have identical values to last call |
1 h | car_config.py |
| 6.3 | Lazy-load ADK agents: keep only the active "intent" agent in RAM; spawn others on demand | 4 h | src/pitwall/features/coaching/adk_agents.py |
| 6.4 | Batch tall-store inserts across multiple frames (50 ms window) instead of per-frame executemany |
2 h | can_reader.py |
| 6.5 | Move DuckDB writes into a dedicated thread fed by a queue.Queue from the reader; reader never blocks on the DB lock |
4 h | can_reader.py |
| 6.6 | Re-measure all numbers after each change; commit a docs/reports/perf-baseline.md with the before/after |
1 h | docs |
Exit criteria¶
- At 350 fps synthetic load,
frames_dropped == 0over 10 minutes. - Coach round-trip latency (CAN frame → cue arriving in PWA) under 100 ms p95.
Risk if skipped¶
Bridge keeps up at idle / pit-lane rates (the conditions we've tested), but might miss frames at hot-lap fast-IMU rates on Pixel CPU under thermal throttle.
Cross-phase: testing & CI¶
Items applicable across phases; should run in CI continuously.
| # | Title | Phase |
|---|---|---|
| C.1 | Rebuild the broken tests/features/telemetry/test_can_pipeline.py against the current AiM MXP DBC; today it expects messages that don't exist |
1 |
| C.2 | Smoke test: pytest -k simulate boots the bridge with --simulate, hits /health, queries /session/_live/capabilities, validates pipeline-derived signals are present |
1 |
| C.3 | Add a "kill-mid-write" test: --simulate for 5 s, SIGKILL -9, restart, assert no .corrupted-* file |
4 |
| C.4 | Long-run nightly: 4-hour --simulate, assert RSS bounded, no errors in log |
3 |
| C.5 | CAN reconnect test: start --simulate, swap virtual channels mid-run, assert reconnect |
2 |
Sequencing summary¶
NOW ──► Phase 1 (2h) ─► Phase 2 (1d) ─► Phase 3 (1d) ─► [first track day]
│
▼
Phase 4 (2d) ─► Phase 5 (1d) ─► Phase 6 (2d)
Phase 1 unblocks everything — required hygiene; ≤2 h. Phase 2 unblocks the first track day — required to read CAN on a stock Pixel. Phase 3 makes the first track day not panicky — operability + observability. Phases 4–6 are the "long-term ownership" investment: ship them incrementally as track-day learnings come in.
Total to "ready for first track day": ~2.25 working days.
The audit findings doc (server-audit-termux.md) is the authoritative
source for the what and why of each item. This doc is the
authoritative source for the order and acceptance.