Live Task Monitoring · SARAH Platform & Scale Growth Advisors
← Portal

Live Task Monitoring

Every outstanding task across the SARAH voice platform, the LLM workstream and the Scale Growth Advisors venture. Prime directive (Chris 2026-05-23): make all voicebots real-time conversationalists — faster and snappier, faster and faster. 630 tasks in 14 groups, led by the REAL-TIME CARDINAL track (architecture) and PORTAL = SOURCE OF TRUTH track (control + visibility). Omega Leads is tracked separately on its own build tracker. Each task is held to the same six-gate standard — built, audited, tested, proven working, persisted and backed up. A task turns green only at six of six.

213 / 630
Tasks fully complete
34%
Overall progress
14
Task groups
213
Tasks built (gate 1)
Built
213 / 630
Audited
213 / 630
Tested
213 / 630
Proven
213 / 630
Persisted
213 / 630
Backed up
213 / 630
Six-gate standard:BuiltAuditedTestedProvenPersistedBacked up

CHICAGO DC MIGRATION — US-3 → new server (Chris 2026-05-28) — runbook + Google Doc ready, exec pending new-server details

1 / 10
CHI-PLANMigration runbook written + Google Doc created (docs.google.com/document/d/196PKF9hZmiwT6hh30SQLGbRbuk3BUsliIgaHApaRduw). File-level rsync method (not dd). Footprint: 1.3TB, 72 svcs, 10 containers, 8 GPUs, Ubuntu 24.04.
CHI-0Pre-flight: full US-3 snapshot to Drive (pg_dump+etc+units+crons+compose+nvidia/net state) + authoritative inventory + grep-sweep all hosts for 64.34.93.231. Old box stays as rollback. BLOCKED on Chris: new server IP + SSH + GPU count + link speed.
CHI-1New server prep: Ubuntu 24.04 base + NVIDIA driver >=580.126.16 + CUDA 13 + docker + /data RAID >=1.5TB. New box keeps own host keys; carry old WireGuard keys (peers change endpoint IP only).
CHI-2Bulk rsync -aHAX while US-3 LIVE, biggest+static first: /data/models 389G, sarah30 244G, connectors 151G, /opt/vllm 85G, then root fs (exclude /data /dev /proc /sys /tmp /run /boot /etc/fstab /etc/netplan). Docker save 15 images + 6 volumes + compose.
CHI-3Incremental delta rsync passes (2-3x) until delta near-zero — system quiesces toward cutover.
CHI-4Cutover: final Drive snapshot, stop writers, pg_dumpall->restore (NOT live rsync, 82MB), final delta rsync, rebuild fstab(new UUIDs)+netplan(new IP)+GRUB on new box, start services (postgres->vLLM->STT/TTS->SARAH-SWITCH->voicebot-py->hermes).
CHI-5Verify: acceptance audit (reference_audit_plan), 8 GPUs healthy (if GPU7 good -> deploy 72B), vLLM :8002/:8004, SARAH-SWITCH :5060 77/77, postgres row counts + cardinal knobs, all 10 containers + 72 services, real call smoke test.
CHI-6IP propagation (highest risk): EU-3 crons SSHing US-3, us3-vllm-tunnel, WireGuard peer endpoints (EU-3/AU/US-2/Sparks), Cloudflare DNS, known_hosts all peers, fail2ban/nginx allowlists, memory IP map. extension_ai_config uses 127.0.0.1 (no change). Final grep-sweep = 0 stragglers.
CHI-7Rollback soak: old US-3 stays powered+intact 48h. Rollback = flip IP map back + restart old services (nothing deleted).
CHI-8Decommission old after soak: final cold archive to Drive, confirm GPU-7 RMA disposition, WIPE secrets (SA JSON/SSH keys/API keys/TLS/DB) before hardware return, update memory US-3=new IP.

2026-05-27 US-3 SARAH-SWITCH CUTOVER — FreeSWITCH retired fleet-wide (Chris 2026-05-27)

8 / 8
US3-SS-1Reconnaissance on US-3 — /opt/idesks/sarah-switch absent (clean slate), Python 3.12.3, libopus.so.0 present, port 5063 free.
US3-SS-2rsync sarah-switch source from EU-3 to US-3 (12 files, 824K) + pip install pytest pytest-asyncio. All imports OK.
US3-SS-3Wrote US-3 routes.toml (added ^1000$ omni-SARAH route — only delta from EU-3) + trunks.toml (register=false, env-var creds only).
US3-SS-4Installed /etc/systemd/system/sarah-switch.service on US-3, patched DEFAULT_PORT=5063 for parallel mode, started service. SIP OPTIONS 200 OK on :5063 with both FS+SS coexisting.
US3-SS-5Pre-cutover snapshot — docker_state + sofia_status (4 Crazytel REGED) + calls_count (0 total) + services_running + sip_listeners. Tarballed + uploaded to Drive: gdrive-sa:abcus-system-snapshots/us3-pre-ss-cutover-20260527T153310Z.tar.gz
US3-SS-6Regression on US-3: 77/77 green in 3.94s. Same code as EU-3, same result.
US3-SS-7CUTOVER — docker stop fs-voicebot (port 5060 freed) → systemctl stop sarah-switch → patch DEFAULT_PORT 5063→5060 → systemctl start && enable sarah-switch. Bound to 0.0.0.0:5060 in 2s.
US3-SS-8Post-cutover verify — SIP OPTIONS to :5060 returns 200 OK Server: SARAH-SWITCH/0.1, Allow: INVITE,ACK,BYE,CANCEL,REGISTER,OPTIONS. Final regression: 77/77 green in 3.91s. sarah-switch active+enabled, fs-voicebot exited (not removed — rollback-ready).

2026-05-27 chrisadamswismail.team + MEMORY + 100YEARS EDU + SS-8 (Chris 2026-05-27)

11 / 12
SS-8FreeSWITCH retired on EU-3, SARAH-SWITCH primary on :5060 — docker stop fs-voicebot, port 5063→5060, restart, SIP 200 OK in 2s, 77/77 regression green. Backup ss8-cutover-20260527T050106Z.tar.gz on Drive.
MK-1Marriott Sri Lanka demo API key copied to gdrive-sa:idesks-internal/secrets/marriott-demo-key.txt and deleted from US-3. Tenant marriott-demo, monthly_cap=0.
CAD-1chrisadamswismail.team — built page in Pointr eco-system theme (white+green+Inter, mirrored from /pointr-eco-system). 6 SARAH product tiers + SOPHIA arch + comparison table + 1000 reasons + Google Calendar booking button.
CAD-2chrisadamswismail.team DNS fix — A records were pointing to 159.69.66.50 (wrong machine). Corrected via CF API to 116.202.118.237 (actual EU-3 IP).
CAD-3chrisadamswismail.team nginx fix — vhost had `listen 159.69.66.50:80` but socket was 0.0.0.0:80 (merged). Changed to plain `listen 80;` so server_name routing works.
CAD-4chrisadamswismail.team TLS — Let's Encrypt cert via CF DNS challenge, SANs cover root + www, expires 2026-08-25. CF proxied (orange cloud, Flexible SSL).
CAD-5chrisadamswismail.team content — Omega Leads dedicated section removed per Chris (kept as credential mention in hero + 1 pillar + 1 reason card). 100years.py rebuilt; both vhosts HTTP 200.
MEM-1Three new long-term memory memos written: project_chrisadamswismail_team_live, reference_eu3_server_ip (116.202.118.237 NOT 159.69.66.50), project_pointr_ext1000_status.
MEM-2Memory snapshot uploaded to Drive: gdrive-sa:abcus-system-snapshots/abcus-chris-memory-20260527T150016Z.tar.gz (1.9 MB, 901 files). Drive URL: https://drive.google.com/open?id=1FhmKCNSdsarZnfbKD7iAV4qx2Lhsos7K
EDU-1100 Years of Tool Making manifest saved as long-term reference memo: 34.7M features, 191,773 categories, 1.5M+ connector directories, 41 pages, 14 core components.
EDU-2Hermes education brief written to /opt/idesks/docs/hermes-shared/learnings/100years-manifest-brief-2026-05-27.md — will reach HERMES-US3 via HSL-2 sync (cron :15 / :45).
US3-MIGSARAH-SWITCH migration on US-3 — NOT YET DONE. FreeSWITCH (fs-voicebot container) still primary on US-3. EU-3 cutover (SS-8) was the only one shipped. US-3 cutover requires Chris-present session.

2026-05-26 NIGHT MARATHON + MORNING SHIPPING — 18 done + 1 open P0 (Chris+Opus1+Opus2)

18 / 19
V-1sarah-voice-api LLM_URL→:8002 (32B), use_vad=False, persona_override silent bug fixed, psycopg3 DB helpers wired. Tier routing cardinal locked.
V-2Echo guard (_bot_speaking_until) + barge-in cancel (_cancel_generation) + call_traces telemetry INSERT per turn — all live in voice_stream WS handler.
V-3arch4-bot-8877 + arch4-bot-8888 systemd services live. VICIDIAL slot bots now Arch4. FS directory + dialplans updated (Arch4 primary, Lua fallback intact).
V-4Ext 5555 simultaneous-ring Stage 2 expanded — races 5 legs: loopback/7777, 8877, 8888, 2222 + 8800 VICIDIAL. First Arch4 SARAH wins.
V-5tenant_api_keys Postgres table + monthly quota gate (402 over cap) + /portal/api-usage page live + Marriott-demo key seeded + monthly reset cron.
EOT-1Phase 1: EOT-confidence early commit in stream_monitor.py + brain.py. predict_eot≥0.70 + 100ms stable → commit (was 300ms silence). ~200ms saved on confident terminators.
EOT-2Phase 2: In-flight LLM/TTS cancellation in brain.py. New speech mid-generation → _cancel_generation event → LLM stream breaks → TTS queue drains → new turn clean.
TURN-GUARD_turn_in_flight flag at stream_monitor.py:236/366/370/719 — stops double-firing on STT partial-promotion race (Whisper promoted-partial + real FINAL both bypass last_final dedup).
CARDINAL-3Smart barge-in CARDINAL locked — barge_in_min_chars=20 + barge_in_grace_ms=2000. Filters backchannels + grace window after TTS start. Chris: "if we can fine tune that, we have THE CROWN JEWEL brother".
KNOBSThree knobs retuned to work as a system: barge_in_min_chars 2000→20, tts_stream_chunk_size 2000→500 (first-audio 62ms→15ms), echo_guard_tail_ms 2000→400.
LONGER-REPLIESvoice_brevity_overlay rewritten "natural, complete sentences, 3-5 sentences, yes/no + context". Paired: max_tokens=2000, num_predict=2000. Replaces demo-mode 1-2-sentence brevity.
ADMIN-GUIDEAdmin guide LIVE at idesksonline.ai/admin-guide/ + i-desks.com/admin-guide/. 2121-line HTML covering 37 system_config keys + 28 extension_ai_config columns + 4 cardinals + latency budget + playbooks + 40-FAQ + runbooks. nginx fix: location ^~ /admin-guide/ override.
PRE-REBOOT-SNAPPre-reboot snapshot bundle to Drive at gdrive-sa:pre-reboot-snapshots/20260526T001014Z/ — 13 files, 147MB. US3 voice stack + pg_dumpall + system_config + service state + EU-2 memory + admin guide + sbin + cron/nginx/letsencrypt. Post-reboot checklist in manifest.
CODEX-HERMESCodex CLI 0.133.0 wired for hermes user (uid 10000) in both hermes-eu3 + hermes-us3 containers. Gotcha: hermes HOME is /opt/data per /etc/passwd, NOT /home/hermes. Config at /opt/data/.codex/config.toml.
CODEX-PROFILES8 named Codex profiles: coder (qwen3-coder default), reasoning (qwen3:32b), deep (gemma4:31b), fast (gemma4:latest), cheap (qwen3:8b), quick (qwen3:4b) + 2 legacy. Per-container provider correct (EU3=public IP, US3=127.0.0.1).
HERMES-BRIEFEDComprehensive night-revelations briefing delivered to hermes-eu3 + hermes-us3 via /opt/data/hermes-shared/learnings/. Both Hermes containers acknowledged within 90s with key takeaways. Nightly curators ran ad-hoc.
JOBS-JSON-FIXjobs.json permission bug on hermes-eu3 — chown'd root:root → hermes:hermes. Hermes cron module now reads cleanly. Same root-vs-runtime-user pattern that bit Codex config later.
SS-SPEC-LOCKEDSARAH-SWITCH strategic pivot announced + persisted. 8 tasks SS-1..SS-8 (~6.5d). DirectPCMTap in media_bridge.py eliminates today's zero-audio bug class by construction. Spec memo + Hermes briefing delivered. P0 ext-7777 fix is pre-condition.
P0-7777-OPENext-7777 zero-audio regression UNDER DIAGNOSIS — Opus 1 added a brain.py:240-245 frame-amplitude diagnostic. Bot restarted clean. Awaiting Chris test call to capture arr_max/pcm16_max → pinpoint exact bug location. (open in SARAH-SWITCH phase as the P0.)

SARAH-SWITCH — replace FreeSwitch with pure-Python SIP/Media (Chris 2026-05-26)

8 / 9
P0Fix ext-7777 zero-audio regression in brain.py — PCM fed to WhisperLive is all zeros (caller speaks → silence in → no FINAL → no reply). 4 cardinals intact; bug is in V-2/EOT changes to inbound audio path. PRE-CONDITION for any SS-* work. Backups: brain.py.bak-eot-phase1-phase2 / .bak-pre-crackle-fix / .bak-pre-day3-*
SS-1sip_core.py — SIP registrar, INVITE/REGISTER/BYE, Opus-only SDP, route dispatch. (~2d) v0.1 SHIPPED 2026-05-26 19:49 UTC — systemd unit live on udp/0.0.0.0:5063, integration tests pass (REGISTER+digest auth + INVITE+Opus SDP + BYE + OPTIONS + 488-no-Opus). Source: /opt/idesks/sarah-switch/sip_core.py. Pure stdlib.
SS-2media_bridge.py — RTP↔WebRTC, G.711→Opus, DirectPCMTap feeds WhisperLive directly (today's bug class impossible by construction). (~2d, deps SS-1) v0.1 SHIPPED 2026-05-26 22:04 UTC — media_bridge.py + rtp.py + opus_codec.py (ctypes-libopus, pure stdlib). RTP roundtrip end-to-end with DirectPCMTap: caller Opus RTP → parse → opus_decode → in-process PCM → EchoTap → opus_encode → outbound Opus RTP. SIP INVITE now allocates a real even port from MediaPortPool (range 16384-32767) and advertises it in SDP answer. BYE releases. Marker bit=0 on every outbound (Cardinal 2). All 4 unit tests + 2 integration tests green.
SS-3session.py — state machine (RINGING→ENDED), sip_cdr Postgres table, billing hooks. (~0.5d, deps SS-1) v0.1 SHIPPED 2026-05-26 22:24 UTC — session.py with state machine (IDLE→RINGING→CONFIRMED→ENDED), CdrStore SQLite backend at /var/lib/sarah-switch/sip_cdr.db (WAL mode, PG-ready schema), per-tenant billing rollup, 10s flusher loop, wired to sip_core on INVITE/answer/BYE/CANCEL.
SS-4route_table.py — Python dialplan replacing all XML, SIGHUP hot-reload. (~0.5d, deps SS-1) v0.1 SHIPPED 2026-05-26 22:24 UTC — route_table.py loads /opt/idesks/sarah-switch/routes.toml. 7 default routes (7777/2222/5555/88xx/E.164/echo/default-deny). SIGHUP hot-reload tested (reload_count bumps; running calls unaffected). Pluggable types: localext/aor/pstn/reject. Bad routes file keeps previous table (no crash). End-to-end test through all 4 layers passes.
SS-5pstn_trunk.py — Crazytel trunk connector, inbound + outbound PSTN. (~1d, deps SS-1..4)
SS-6sarah-switch.service systemd unit + health endpoint :5065 + audit integration. (~0.25d, deps SS-1..5)
SS-7Parallel test on port 5063 — full regression matrix (10 test cases). FS stays live on 5060 throughout. (~0.5d, deps SS-1..6)
SS-8Cutover port 5060, stop FS, decommission. Chris present required. (~0.25d, deps SS-7 + Chris)

REAL-TIME CARDINAL — non-negotiable architecture (Chris 2026-05-23, applies to SARAH + VICIDIAL)

2 / 10
RT-1STT streams continuously into a rolling buffer — no silence-gated record loop, no per-turn batching.
RT-2TTS streams continuously into the call leg as PCM → RTP — no render-WAV-then-play.
RT-3Replace turn-based POST /voicebot/start with a per-call long-lived streaming monitor (asyncio agent per call that reads partials and drives TTS continuously).
RT-4Full-duplex audio — STT continues while TTS speaks; SARAH never goes deaf to talk.
RT-5Barge-in / interruption-aware generation — caller speaking preempts in-flight TTS instantly.
RT-6Zero TTS + STT caching — every word freshly generated, forever. ENFORCED 2026-05-23 (cache functions stubbed to None; tables truncated).
RT-7Sub-300ms perceived turnaround end-to-end.
RT-8Remove the record-based walkie-talkie loop from voicebot_9999.lua.
RT-9Per-conversation budget: up to 3 GB VRAM + 3 Gbps per call (STT+TTS+LLM). Default to the BIGGEST/BEST model + setting on every call (Chris 2026-05-23) — never sandbag for cost. Total fleet usage unbounded; this is the per-call ceiling.
RT-10LIFELONG COMMITMENT — the bar is "best voicebot humanity has ever experienced" and the cadence is "better every day, forever." Public-page phrase live: "Real-Time Conversations. No pre-cached phrases. No pre-written scripts." Substance must back it before any customer dials (Chris 2026-05-23).

PORTAL = SOURCE OF TRUTH — voicebot + VICIDIAL (Chris 2026-05-23)

10 / 12
P-1SARAH voicebot: expose every extension_ai_config column in /portal/extensions/ — currently 6 missing (tts_url, voice_routing_mode, voice_gender_pref, voice_accent_locks, streaming_stt, llm_routing_json).
P-2system_config: full portal exposure of every key, editable, with save audit log (/portal/system-config/).
P-3Surface env-only knobs as portal-editable system_config rows (STREAM_STT_WS, LOCAL_STT_URL, LOCAL_STT_MODEL, XTTS_TTS_URL, OLLAMA_MODEL_FAST/DEEP, VLLM_URL, etc.) — no hidden env switches.
P-4Live-state dashboard: loaded LLM models, GPU util per card, active calls, services up/down, no-cache enforcement light, RT-cardinal compliance light.
P-5Portal audit log: every backend change (SQL UPDATE, code change, service restart, cache truncation) raises a portal event so portal-vs-backend drift is always visible.
P-6Bidirectional save endpoint — verify every editable column round-trips DB→portal→DB without staleness; audit any caching layer that could hide changes.
P-7RT-cardinal switches in portal — once RT-1..RT-8 ship, every architectural toggle (streaming monitor enabled, barge-in on/off, max-tokens-per-segment, etc.) lives in the portal.
P-8VICIDIAL dialer_campaigns: 25 columns, only 7 currently referenced anywhere in the VICIDIAL portal templates. Expose the remaining 18 in /portal/sarah-predictive-dialer/campaigns/.
P-9VICIDIAL dialer_agents: every column editable in portal (skill_level, campaigns, status, webrtc_enabled, etc.).
P-10VICIDIAL dialer_dispositions / dialer_lists / dialer_dnc: full portal CRUD.
P-11VICIDIAL bot extensions (is_vicidial_bot=true): every extension_ai_config knob AND every dialer-specific knob in one editor — same source-of-truth as SARAH voicebot.
P-12VICIDIAL real-time architecture: same streaming-monitor pattern as SARAH (the RT cardinal applies — Chris 2026-05-23).

LLM Workstream — Track G · Gemma 4 Evaluation & Deployment

1 / 7
G1Re-baseline Qwen3-32B on the B8 benchmark with thinking OFF — the fair no-think baseline.
G2Benchmark Gemma 4 31B on Google AI Studio against the B8 scorecard.
G3B8 decision gate — Gemma 4 31B vs Qwen3-32B; pick the L2 conversational model.
G4Stand up Gemma 4 31B local serving on US3 via vLLM (if it wins B8).
G5Benchmark Gemma 4 26B as a lighter-VRAM alternative to the 31B.
G6Evaluate Gemma 4 E4B / E2B for the edge / L1 fast-path.
G7Test Gemma 4 multimodal — vision + audio input — for SARAH.

LLM Workstream — Track L · Serving Infrastructure

2 / 3
L8Upgrade vLLM on US3 (0.21.0 → current) in an isolated venv.
L9Resolve the Ollama gemma4:31b loader crash (blocked on upstream fix).
L10Audit no-think across the full voice path end-to-end.

LLM Workstream — Track M · Model Candidates

0 / 4
M12Benchmark Mistral Small 3.2 24B + GLM-4-32B on the B8 prompt set.
M13Evaluate EmbeddingGemma for the A3 RAG pgvector index.
M14Evaluate ShieldGemma 2 as the M-A1 guardrails classifier.
M15Evaluate TranslateGemma (55 languages) for multilingual hardening.

Voice Pipeline — A · STT Speed & Streaming

5 / 12
#1Partial-hypothesis streaming — interim transcripts every 100-200ms.
#2Speculative LLM prefill on partials — start before is_final.
#3Sub-300ms adaptive endpointing (extends D5).
#4GPU-batched STT across concurrent calls.
#5distil-whisper / whisper-turbo fast path (4-6x speedup).
#6Chunked-attention streaming encoder.
#720-40ms audio frame-size tuning.
#8STT model pinned + always warm.
#9Skip the final re-decode when partial == final.
#10Word-level timestamps for precise barge-in.
#11Direct RTP to STT, bypass file buffering.
#12Two-pass STT — tiny model for turn-taking + accurate model for the LLM.

Voice Pipeline — B · STT Accuracy & Robustness

3 / 12
#13Hotword biasing toward business vocabulary.
#14Per-customer custom vocabulary (SKUs, staff names, account IDs).
#15Noise suppression + echo cancellation pre-STT.
#16Accent-aware STT model routing.
#17Numeric / entity normalization.
#18Code-switching detection.
#19Per-word confidence scores feeding LLM clarification.
#20Punctuation + truecasing restoration.
#21Domain n-gram LM rescoring.
#22Dual-channel STT, zero crosstalk.
#23Speaker diarization for multi-party calls.
#24Continuous learning from corrected transcripts.

Voice Pipeline — C · TTS Speed & Streaming

2 / 12
#25First audio chunk within ~150ms of the first LLM token.
#26Sentence-level parallel synthesis (C1).
#27Sub-sentence / phoneme-chunk streaming.
#28GPU-batched TTS across concurrent calls.
#29Streaming neural vocoder.
#30Lighter / faster vocoder fast path.
#31Quantized / distilled XTTS variant.
#32Speculative TTS on predicted sentence completion.
#33Direct TTS to RTP, no file write.
#34TTS model pinned + warm.
#35Pipelined fresh-generated filler.
#36Per-call synthesis budget so audio never starves.

Voice Pipeline — D · TTS Naturalness & Prosody

2 / 12
#37IPA phoneme control for correct pronunciation (= build item B9).
#38Per-customer pronunciation lexicon.
#39Emotion / sentiment-conditioned prosody.
#40SSML support — emphasis, pauses, rate, pitch.
#41Context-aware intonation (questions rise, statements fall).
#42Natural micro-disfluencies + breath sounds.
#43Dynamic speaking rate.
#44Backchanneling while the caller speaks.
#45Per-brand voice cloning.
#46Prosody continuity across streamed chunks.
#47Accent / locale matching.
#48Smile-in-voice warmth tuning.

Voice Pipeline — E · LLM Speed & Throughput

7 / 14
#49Prefix / KV cache the system prompt.
#50Speculative decoding (mind the single-GPU VRAM budget).
#51Continuous batching across concurrent calls (vLLM).
#52FP8 / INT8 / AWQ quantization (~2x throughput).
#53Thinking / reasoning OFF on the conversational path.
#54Token streaming so TTS starts on the first sentence.
#55Short max_tokens cap for voice replies.
#56Tiered L1 / L2 / L3 routing.
#57Intra-call KV cache reuse across turns.
#58Chunked prefill to overlap prompt processing with decode.
#59Prompt compression / history trim.
#60B8 — swap L2 to a 30B that beats qwen3:32b for business conversation.
#61Tensor parallelism on the big model.
#62Smarter tier classifier — keep more turns on the fast L1.

Voice Pipeline — F · LLM Conversational Quality

8 / 12
#63Tight, voice-optimized system prompt.
#64Few-shot ideal-short-reply examples.
#65Native function calling (replace the %%TAG%% string parsing).
#66Structured / JSON tool-argument output.
#67RAG grounding to prevent hallucination (= A3).
#68Inject current date + call context every turn.
#69Confidence-aware escalation to L3.
#70Locked persona per extension.
#71Interruption-aware generation.
#72Full intra-call coherence.
#73Proactive turn signalling (done vs still speaking).
#74Mid-call self-correction.

Voice Pipeline — G · Instant Orchestration

6 / 14
#75Parallel tool execution.
#76Speculative tool pre-fetch from detected intent.
#77Tool-result streaming — don't block the turn on a slow tool.
#78"Let me check" filler covers async tool latency (D3).
#79Idempotent tool-result caching within a call.
#80Persistent pre-connected API clients.
#81Fast intent-to-tool router before the LLM finishes.
#82Connection pooling for all external APIs.
#83Local-first co-located tools, zero network hop.
#84Optimistic execution — start the likely action, confirm after.
#85Tool timeout + graceful spoken fallback.
#86Sub-agent dispatch for complex multi-step tasks.
#87Workflow engine fires instantly on intent match.
#88Event-driven orchestration.

Voice Pipeline — H · Turn-taking & Pipeline

3 / 12
#89Enforced sub-300ms end-to-end latency budget.
#90Barge-in / interruption — caller can cut SARAH off cleanly.
#91Full-duplex audio — listen while speaking.
#92Predictive semantic endpointing.
#93Backchannel detection.
#94Acoustic echo cancellation.
#95Jitter-buffer tuning.
#96Per-stage latency telemetry per call.
#97Overlap-aware turn management.
#98Adaptive pacing from caller cues.
#99Co-located pipeline on one US3 host.
#100Warm-everything — every model resident, every service self-healing.

LIVE STREAMING — STT + TTS Engine Coverage (Chris cardinal 2026-05-23)

1 / 10
LS-1WhisperLive — Large-v3-turbo live streaming (already wired; ext-selectable, verify it's the default for high-quality lines).
LS-2WhisperLive — Small live streaming (mid-tier accuracy/speed; ext-selectable).
LS-3WhisperLive — Base live streaming (fastest tier; ext-selectable fallback for very low-spec).
LS-4Per-extension STT model selection (turbo / small / base) — portal-editable via extension_ai_config.stt_model.
LS-5Coqui XTTSv2 — true live streaming via /xtts/stream (RT-8 milestone uses /xtts/tts one-shot; promote to streaming so audio starts before render finishes).
LS-6MeloTTS — live streaming integration (open-source, multi-language, very fast; alternative to XTTSv2).
LS-7Coqui non-XTTS models — Tortoise / VITS / YourTTS / Bark streaming as additional voice options.
LS-8Per-extension TTS engine selection — portal-editable via extension_ai_config.tts_engine (xttsv2 / melotts / vits / etc.).
LS-9All TTS engines stream PCM directly to call leg (no render-full-WAV-then-streamFile) — implements RT-2 across every engine.
LS-10Live-streaming engine matrix dashboard — for every ext, show which STT model + which TTS engine is wired and current latency p50/p99.

RESOURCE CAP 80/90/10 — fleet-wide hardline (Chris cardinal 2026-05-23)

7 / 10
CAP-1/usr/local/sbin/sarah-resource-monitor.py — emits composite 0-100% score to /run/sarah/resource_pct + HTTP /portal/api/resource_state. Updates every 5 s. Aggregates per cardinal: max(CPU, RAM, GPU_mem_aggregate, GPU_util_avg, disk_IO, network, data_disk). GPU 7 excluded.
CAP-2FreeSWITCH dialplan hook 00_resource_cap.xml — runs BEFORE any voicebot extension; reads /run/sarah/resource_pct via Lua + branches by band.
CAP-380-90% band — agent-queued calls accepted (dialplan ends in bridge to agent_queue); AI-only calls (voicebot_monitor / voicebot_moshi) rejected with SIP 503 + brief 'assistants busy, retry shortly' prompt.
CAP-490%+ band — ALL new INVITEs rejected with SIP 503 + courteous voice prompt. Existing calls continue. 10% headroom reserved for ops.
CAP-5Portal dashboard /portal/resource-caps/ — live per-server utilization, current band, accepted-vs-rejected counts per band per hour, peak-GPU warnings.
CAP-6Audit log — every cap-triggered decision → sarah_audit_log (caller_id, ext, band, decision, resource breakdown JSON).
CAP-7Global roll-up — when B300/GB300/Spark exist, dashboard surfaces fleet-wide view; any one server at cap flagged.
CAP-8Alerts — Telegram (chris-alert) + Slack on any server sustained >80% for >5 min; >90% triggers immediate page.
CAP-9Training-job autopause — codec / LM training checks /run/sarah/resource_pct every 60 s; pauses (SIGSTOP) if pushing toward 80%, resumes when load drops. Per [[feedback_80_90_resource_cap_cardinal_2026-05-23]].
CAP-10extension_ai_config.call_priority ('agent-queued' | 'ai-only') — dialplan logic uses this not body inspection. Portal-editable.

LLM TIER ROUTING — light (Qwen3-8B) + heavy (Qwen3-32B) on RT-8 (Chris cardinal 2026-05-23: 'add 8B; Mickey Mouse conversations')

5 / 5
TIER-1stream_monitor.py reads extension_ai_config.llm_tier — light → vLLM :8000 qwen3:8b · heavy → vLLM :8002 qwen3:32b · default heavy. [LIVE 2026-05-23]
TIER-2Add llm_tier column to extension_ai_config schema; portal-editable in /portal/extensions/.
TIER-3Seed defaults — Mickey Mouse / pizza demo / simple-FAQ extensions → light; banking/gov/health/insurance → heavy.
TIER-4Auto-promotion (optional) — if first user turn is short (<5 words), stay on light; if next turn brings a complex question, escalate to heavy mid-call.
TIER-5Telemetry — log llm_tier_used per turn in sarah_audit_log; observability dashboard shows tier-mix per extension.

MOSHI — RETIRED 2026-05-23 17:08Z (Chris: 'Moshi's did not help much') — weights kept on disk, can reactivate

7 / 12
MOSHI-1Pull Moshi weights from Kyutai HuggingFace (kyutai/moshiko-pytorch-bf16 — male, kyutai/moshika-pytorch-bf16 — female).
MOSHI-2Verify CUDA bf16 compatibility with Blackwell sm_120 (RTX PRO 6000) — XTTS hit sm_90 issues earlier; check before committing.
MOSHI-3Create moshi.service systemd unit on US3. Allocate one RTX PRO 6000 GPU (~16 GB VRAM for 7B bf16). Expose WebSocket on :8500.
MOSHI-4Measure single-call time-to-first-audio after caller stops speaking — target <200 ms p99 (Kyutai published 160 ms).
MOSHI-5Measure barge-in latency — caller starts speaking, model stops within next ~80 ms audio token boundary.
MOSHI-6Build /voicebot/moshi_session WebSocket endpoint in fs_voicebot.py that proxies the FreeSWITCH inject WS to Moshi:8500.
MOSHI-7Modify voicebot_monitor.lua — start uuid_audio_fork in inject mode pointing at /voicebot/moshi_session for the bot's audio output. Retire queue file + streamFile path.
MOSHI-8First live English call through Moshi end-to-end. Chris rates: MOS naturalness, prosody, barge-in feel, overall '2050 vs 2026'.
MOSHI-9Route by detected language: English → Moshi inject WS, other → XTTS path (sunset). Implements EV-79 with the Moshi/XTTS split.
MOSHI-10Helium-Moshi (multilingual) upstream watch — check Kyutai GitHub + HF every 30 days; swap in when official multilingual variant ships.
MOSHI-11Retire XTTS+vLLM-32B pipeline for English. Decommission stream_monitor's _tts_oneshot/_tts_streaming code path. Keep XTTS service warm only for non-English routing until MOSHI-10 .
MOSHI-12Per-extension Moshi voice selection — moshiko / moshika (or LoRA-tuned voices when available) exposed in /portal/extensions/.

MELOTTS — fast English + multilingual secondary (MIT, 6 languages) Chris 2026-05-23

0 / 6
MELO-1Pull MeloTTS weights (myshell-ai/MeloTTS-English-v3 + es/fr/zh/ja/ko variants from HF).
MELO-2Stand up melotts.service systemd unit on US3 — Flask/HTTP on :5005 mirroring the /xtts/tts contract.
MELO-3Bench: first-chunk latency on Blackwell sm_120 (target <80 ms), VRAM footprint, RTF, MOS quality vs XTTS.
MELO-4Add MeloTTS as a routing target in stream_monitor.py — per-extension tts_engine = melotts | xttsv2 | moshi (default by extension_ai_config).
MELO-5First live English call through MeloTTS — Chris MOS rating + latency feel vs Moshi vs XTTS.
MELO-6Multilingual fast-path — route ES/FR/ZH/JA/KO callers through MeloTTS instead of XTTS-v2 when no voice-clone needed (saves ~100 ms per turn).

COSYVOICE 2 — faster multilingual replacement for XTTS-v2 (Apache 2.0, voice cloning) Chris 2026-05-23

0 / 7
COSY-1Pull CosyVoice 2 weights (FunAudioLLM/CosyVoice2-0.5B from HF / ModelScope).
COSY-2Stand up cosyvoice.service on US3 — exposes /cosy/tts (batch) + /cosy/stream (true streaming, ~150 ms first-chunk).
COSY-3Bench against XTTS-v2 on same scorecard: first-chunk latency (target <150 ms), streaming smoothness, MOS, voice-clone fidelity from 3-10 s ref.
COSY-4Language matrix bench — confirm en/zh/ja/ko + cross-lingual coverage on real call audio (not lab tracks).
COSY-5Wire CosyVoice 2 as the multilingual default in stream_monitor's TTS routing — XTTS-v2 falls back only for languages CosyVoice doesn't cover (10-12 of XTTS's 17).
COSY-6Per-extension voice catalog migration — re-record voice clone references through CosyVoice's 3-10 s format; keep XTTS clones as legacy fallback.
COSY-7Decision gate — if CosyVoice 2 outperforms XTTS on every dimension we use, sunset XTTS entirely after MOSHI-11 + COSY-5 both ship.

F5-TTS — fastest open multilingual (CC-BY-NC-4.0, watch for re-license) Chris 2026-05-23

0 / 4
F5T-1License audit — F5-TTS is CC-BY-NC-4.0 (non-commercial). Track community/Microsoft re-license; do NOT deploy commercially until license clears.
F5T-2Bench F5-TTS in lab (research-only context) for latency + MOS vs CosyVoice 2 + MeloTTS — establish ceiling expectations even if we can't ship.
F5T-3Multilingual fine-tunes audit — survey community fine-tunes for non-English languages, check their licenses individually.
F5T-4If license clears → fast-track to production routing alongside CosyVoice 2.

MASKGCT — non-autoregressive masked codec (Amphion, MIT, en/zh) Chris 2026-05-23

0 / 5
MGCT-1Pull weights from amphion/MaskGCT on HuggingFace. ~6 GB VRAM. Verify Blackwell sm_120 compat.
MGCT-2Stand up maskgct.service systemd unit on US3 — non-autoregressive means batch-style generation but very low latency (target <100 ms first-chunk).
MGCT-3Bench vs CosyVoice 2 — same scorecard. Particular focus: Chinese quality (MaskGCT was developed against zh).
MGCT-4Voice cloning fidelity test — MaskGCT supports clone; compare 3-5 s ref-clip quality to CosyVoice 2 and XTTS.
MGCT-5Decision gate — if it beats CosyVoice 2 on zh AND has lower latency, promote to multilingual primary for zh routing.

PIPER TTS — ultra-fast CPU TTS for backchannels + low-spec lines (MIT, 30+ community) Chris 2026-05-23

0 / 4
PIPER-1Pull Piper voices from rhasspy/piper (community models — en + ~30 languages, lower quality but ~50 ms latency on CPU).
PIPER-2Stand up piper.service — CPU-only, no GPU contention. Saves GPU for Moshi/CosyVoice/MeloTTS.
PIPER-3Use case 1 — instant backchannels ('mm-hmm', 'I see', 'one moment') generated in <50 ms while the main engine is still thinking. Layered via inject WS.
PIPER-4Use case 2 — long-tail language fallback for languages NONE of our main engines support (Hindi, Bengali, Swahili, etc. via community Piper voices).

OPENVOICE v2 — cross-lingual voice cloning (MyShell, MIT, 6 languages) Chris 2026-05-23

0 / 5
OV2-1Pull weights from myshell-ai/OpenVoiceV2 on HF. Has separate speaker embedding extractor + multilingual TTS backbone.
OV2-2Stand up openvoice.service — exposes /openvoice/tts + /openvoice/clone (clone an English reference voice and speak any of 6 languages with it).
OV2-3Bench cross-lingual clone fidelity — clone Chris's English voice, speak Spanish. Compare to XTTS-v2 cross-lingual (which is generally weak).
OV2-4Use case — per-customer voice personalization across multilingual call centers (caller hears the SAME voice regardless of language switch).
OV2-5Per-extension voice catalog — store extracted speaker embeddings keyed by (ext, persona). Portal-editable.

VITS / BARK — legacy reference + niche prosody (2022-23, mostly reference) Chris 2026-05-23

0 / 3
VITS-1Document reference benchmarks — what 2022-23 SOTA sounded like. Establishes the floor against which modern engines compare.
VITS-2Bark — kept for prosody experiments only (singing, emotion, sound effects). Too slow for live (500 ms-2 s first chunk). Mark as research-only.
VITS-3Survey community VITS fine-tunes for languages NO modern engine supports (Welsh, Maltese, Filipino…) — document as last-resort fallback path.

TTS ROUTER — engine selection + failover (Chris cardinal 2026-05-23: 'that's the order')

5 / 6
ROUTER-1Implement pick_tts_engine(lang, needs_clone, cfg) in stream_monitor.py per the routing cardinal. English: Moshi → MeloTTS → XTTS. Multilingual: CosyVoice2 → MeloTTS → XTTS.
ROUTER-2Add tts_engine column to extension_ai_config (text: 'auto' default, or explicit 'moshi'|'cosyvoice2'|'melotts'|'xttsv2'|'piper'|...). Portal-editable.
ROUTER-3Per-call language detection — first STT partial fed to a fast lang-id classifier; result drives the router.
ROUTER-4Health-check + auto-failover — if primary engine returns 5xx or first-chunk latency > 500 ms, fail over to next priority on next turn.
ROUTER-5Telemetry — log {ext, lang, engine, latency_first_chunk, latency_total, mos_estimate} per turn to sarah_audit_log for the live-state dashboard.
ROUTER-6Backchannel layering — Piper-generated 'mm-hmm' / 'I see' / 'one moment' injected via the same WS while the main engine generates the actual reply. Implements EV-93 / V50-18.

SARAH-vLLM — production inference on vLLM for both 30B+72B (Chris cardinal 2026-05-23 'make sure SARAH LLM works on vLLM for speed')

0 / 8
SARAH-VLLM-1Confirm Qwen2ForCausalLM architecture is the SARAH-30B training target (vLLM natively supports). LlamaForCausalLM for SARAH-72B.
SARAH-VLLM-2Custom-vocab handling — extended vocab (~165k+361) is just a bigger embedding; vLLM handles arbitrary vocab_size.
SARAH-VLLM-3Build sarah-audio-frontend.service — Mimi encoder runs as separate microservice; receives PCM via WS, emits audio tokens.
SARAH-VLLM-4Build sarah-audio-backend.service — Mimi decoder microservice; receives audio tokens, emits PCM via WS.
SARAH-VLLM-5vLLM model loader — verify SARAH-30B checkpoint loads via vllm.LLM.from_pretrained with the extended tokenizer.
SARAH-VLLM-6Bridge orchestrator — Python WS that pipes FS audio_fork → frontend → vllm → backend → FS streamFile. Per-call prefix injection (system_prompt + voice_id + accent_code).
SARAH-VLLM-7Aux-head deployment decision — separate small classifier for lang/accent ID (lower coupling) OR multi-head output projection on vLLM model (lower latency, more custom work).
SARAH-VLLM-8Throughput bench — sarah30 + sarah72 on vLLM vs current vllm-qwen3-32b setup. Target: 30B serves ≥4 concurrent calls per GPU; 72B serves ≥2 per 2-GPU shard.

SARAH-72B — heavier sibling of SARAH-30B (Chris 2026-05-23: 'build SARAH 72B as well same structure as 30B')

0 / 15
SARAH72-1Architecture spec — base candidate audit: Qwen3-72B (if released) vs Llama-3.1-70B vs DeepSeek-V3 (MoE). Pick on sovereignty + multilingual + license + scaling laws.
SARAH72-2VRAM plan — 72B bf16 ≈ 144 GB; needs 2-GPU sharding (FSDP across 2 RTX PRO 6000). Pin to GPUs 2-3 for inference; 4-7 GPUs for training (GPU 7 excluded).
SARAH72-3Share the same Mimi-clone codec as SARAH-30B (codec is LLM-size-independent). Train codec ONCE; both 30B + 72B consume the same audio tokens.
SARAH72-4Reuse the joint vocab from SARAH-30B (~165k+361 tokens). Embedding layer re-init for the bigger model; same token IDs.
SARAH72-5Voice catalog — share the SAME 351-voice hybrid pattern (30 embeddings + 321 LoRAs). LoRAs RE-trained per base since they're rank-specific.
SARAH72-6Stage 3 audio LM training — ~2-3× longer than 30B (compute scales sub-linearly with quality gains). Estimated 100-200 days at our scale OR rent extra GPUs.
SARAH72-7Stage 4 SFT — same conversational corpus as 30B; runs ~2× longer.
SARAH72-8Stage 5 RLHF — same reward signals; may share the reward model between 30B and 72B.
SARAH72-9Voice cloning + accent + dynamic detection — same auxiliary heads as 30B; train per-base.
SARAH72-10Evaluation — same scorecard as 30B + the A/B comparison head-to-head.
SARAH72-11Deploy sarah72.service — 2-GPU sharded inference, WS endpoint /api/sarah72/converse.
SARAH72-12Production routing — call type / extension picks 30B (faster, lower cost) or 72B (higher quality, heavier). Per-extension override via extension_ai_config.sarah_tier.
SARAH72-13Default tier policy — A/B sweep to decide which call types justify 72B vs 30B. Banks/gov/health → 72B; lighter retail → 30B initially.
SARAH72-14Multi-extension concurrency — 72B serves ~2 simultaneous calls per 2-GPU shard; 30B serves ~4 per single GPU. Capacity-plan accordingly.
SARAH72-15Long-tail GPU sharding for spikes — if N-concurrent exceeds capacity, route overflow to 30B with a quality flag in the call audit log.

SARAH-LLM — sovereign 30B+ model · 351 voices · streaming STT+TTS · 17 langs · US/UK/AU accents (Chris 2026-05-23 ACTIVE BUILD)

10 / 52
SARAH30-1Architecture spec — white paper (base Qwen3-32B, own Mimi-clone codec 8 codebooks 12.5 Hz, joint vocab ~165k+351 voice IDs+6 accent codes, 5 training stages). [DONE 2026-05-23]
SARAH30-2Corpus curation tool scaffold (/data/sarah30/tools/curate.py) — stage tree, MANIFEST per stage, fetch + validate sub-commands. [DONE 2026-05-23]
SARAH30-3Decide voice-catalog pattern — embedding lookup (A) for top-30 voices + LoRA bank (B) for long-tail 321. Document trade-offs.
SARAH30-4Stage 1 corpus fetch — LibriSpeech (clean+other) + VCTK + Common Voice EN. Target ~3 TB.
SARAH30-5Stage 2 corpus fetch — Common Voice 17 langs + Multilingual LibriSpeech + VoxPopuli. Target ~8 TB.
SARAH30-6Per-utterance metadata schema — (voice_id, accent_code, language, duration, snr, quality_score). JSON-line manifest.
SARAH30-7Audio quality filter pipeline — SNR ≥ 25 dB, length 1-30 s, no clipping, silence trim. Auto-rejects bad audio.
SARAH30-8Consent-tagged real-call ingestion — post-call SMS opt-in → flagged calls go into stage 4 SFT corpus.
SARAH30-9Identify 351 source voices — VCTK (~110) + LibriTTS subset (~200) + XTTS reference catalog (~60) + selected public-domain audiobooks. Audit overlap, pick 351.
SARAH30-10Voice quality validation — MOS panel rates each of the 351 voices on naturalness + intelligibility. Replace bottom 10%.
SARAH30-11Reference WAV bank — clean 30-60 s reference per voice in /data/sarah30/voices//ref.wav for downstream eval + clone.
SARAH30-12Voice metadata — name/age-range/gender/accent/language(s) per voice_id. Portal-exposed catalog.
SARAH30-13Accent labeling for English voices — US/UK/AU at minimum; expand to IN, CA, ZA, IE, NZ where source data supports.
SARAH30-14Per-accent data balancing — at least 500 h of native-speaker audio per accent code; rebalance corpus if uneven.
SARAH30-15Mimi-clone codec architecture — SEANet 1D-convnet encoder + RVQ-8 codebook + SEANet decoder. ~80M params total.
SARAH30-16Codec training script (stage-1, English-only) — spectral L1 + mel + HiFi-GAN discriminator. ~3 days on 4 GPUs.
SARAH30-17Codec eval — reconstruction MOS, encode/decode latency (<10 ms each), 24 kHz output verified.
SARAH30-18Codec multilingual extension (stage-2) — train on the 8 TB multilingual corpus.
SARAH30-19SentencePiece tokenizer on multilingual corpus — preserve Qwen3 base vocab; add audio (16,384) + voice IDs (351) + accent codes (~10) + specials.
SARAH30-20Vocab freeze — pin final size ~165k+361 = ~165,361 tokens. Document layout.
SARAH30-21/data/sarah30-env venv + deps (torch 2.10+cu130, transformers, datasets, audiotools, wandb). Verify FSDP sharding on 2 GPUs with toy model.
SARAH30-22FSDP config for Qwen3-32B-extended — bf16 forward + fp32 master params + activation checkpointing + grad accumulation 16-32.
SARAH30-23Wandb + Tensorboard + portal live-state dashboard for training run telemetry.
SARAH30-24Checkpoint pipeline — every 6 h to /data/sarah30/checkpoints/ + rclone to Drive every 24 h.
SARAH30-25Stage 3 — audio LM continued pretraining on joint corpus. Target ~1 T tokens. 50-100 days on 8 GPUs.
SARAH30-26Voice conditioning loss — per-utterance voice_id token, model learns to keep speaker consistent.
SARAH30-27Accent conditioning loss — accent_code prefix; model learns to switch accents on demand.
SARAH30-28Language conditioning loss — language tag; model handles code-switching cleanly.
SARAH30-29Stage 4 — conversational SFT on synthetic + consent-tagged real-call data.
SARAH30-30Reward model training — CSAT + task-completion + latency + system-prompt adherence.
SARAH30-31Stage 5 — RLHF on production call outcomes.
SARAH30-325-second reference voice cloning head — speaker-embedding extractor + conditioning into the model.
SARAH30-33Cross-language clone fidelity — clone an English voice, speak Spanish; eval similarity.
SARAH30-34Per-extension voice LoRA training pipeline — caller's preferred voice catalog entry as a LoRA.
SARAH30-35Evaluation harness — single-command scorecard (TTFA, MOS, WER, voice fidelity, accent authenticity, barge-in, system-prompt adherence).
SARAH30-36MOS panels — 5 native-speaker raters per language; quarterly cadence.
SARAH30-37Per-voice fidelity benchmark — speaker similarity score across all 351 voices.
SARAH30-38Per-accent authenticity benchmark — native-speaker rating per accent code.
SARAH30-39sarah30.service systemd unit on US3. Multi-GPU sharding (FSDP at inference) if 30B+extension > 95 GB.
SARAH30-40WebSocket endpoint /api/sarah30/converse — drop-in protocol-compatible with moshi-fs-bridge.
SARAH30-41Per-extension portal — extension_ai_config gains voice_id (0-350), accent_code, language fields. Portal-editable.
SARAH30-42Opt-in production routing — first 1-2 extensions A/B vs XTTS pipeline.
SARAH30-43Default-on for all English extensions — once A/B wins on every metric.
SARAH30-44Default-on for all 17 supported languages — once multilingual eval green.
SARAH30-45Retire vLLM-qwen3-32b.service for English voice path (subsumed by SARAH-LLM).
SARAH30-46Retire WhisperLive for English voice path (subsumed).
SARAH30-47Retire XTTS-v2 for English voice path (subsumed). CosyVoice stays for any languages SARAH-LLM doesn't yet cover.
SARAH30-48Sunset Moshi (ext 2222 demo) — the 2050-feel demo loses its purpose once SARAH-LLM ships full English + multilingual.
SARAH30-49Dynamic language detection — model identifies caller language in first 500 ms of speech, auto-switches output language. Emergent from joint multilingual training + explicit lang-ID head.
SARAH30-50Dynamic English accent detection — caller's US/UK/AU/IN accent detected, output matches (or stays in extension's configured voice accent — portal toggle). Auxiliary accent classifier head.
SARAH30-51Mid-call language switching — caller switches from English to Spanish mid-conversation, model follows fluidly. Bench against XTTS pipeline's switching latency.
SARAH30-52Auxiliary classifier heads training — language ID + accent ID as side losses during stages 3-4. Models the right inductive bias for fast detection at inference.

TTS LANDSCAPE WATCH — quarterly survey + auto-bench new entrants (Chris cardinal 2026-05-23)

0 / 4
WATCH-130-day cadence — check Kyutai Helium-Moshi multilingual status. Trigger MOSHI-10 re-eval when artifacts ship.
WATCH-230-day cadence — check F5-TTS license status. Promote to production immediately if commercial-OK re-license lands.
WATCH-3Quarterly community survey — any new open-weights speech-to-speech or fast multilingual TTS released? Bench against current production stack.
WATCH-4Auto-bench harness — single command runs the same scorecard (first-chunk latency, MOS, voice-clone fidelity, language matrix) on any new model dropped into /opt/idesks/data/tts-candidates/.

2050 VOICE — speech-to-speech foundation model + neural codec + WebRTC roadmap (Chris cardinal 2026-05-23)

0 / 25
V50-1Verify mod_audio_fork has 'inject' mode in our fs-voicebot container (rebuild if missing).
V50-2Build /voicebot/tts_stream WebSocket endpoint — accepts text, streams 16 kHz mono PCM 20 ms frames as XTTS produces audio.
V50-3Modify voicebot_monitor.lua to start a SECOND uuid_audio_fork in inject mode pointing at /voicebot/tts_stream.
V50-4Retire the queue file + session:streamFile() mechanism — all bot audio flows as continuous PCM frames via the inject WS.
V50-5Barge-in primitive (RT-5) — STT partial detected during bot speech → stop sending frames over the inject WS within 1 PCM frame.
V50-6Bench Moshi 7B on US3 RTX PRO 6000 — time-to-first-audio, MOS quality, multilingual coverage, sovereignty.
V50-7Bench GLM-4-Voice 9B on US3 — same scorecard as V50-6.
V50-8Bench other open speech-to-speech candidates (OpenS2S, AudioPaLM-open variants) — same scorecard.
V50-9Decision gate — pick the winning speech-to-speech model. Document trade-offs.
V50-10Deploy winning model as a host systemd service on US3. Python end of inject-WS swaps XTTS+vLLM → end-to-end S2S. Lua + WS plumbing stays.
V50-11Add Encodec / SoundStream as the wire codec option — sovereign neural codec, ~3-6 kbps with CD-quality perception.
V50-12Spatialized 3D audio for multi-party calls — conference participants positioned in virtual room; HRTF-aware.
V50-13Emotion-aware prosody — bot voice modulates with conversation context (warmth on grief, urgency on emergency, calm on confusion).
V50-14Voice cloning with caller-consent tracking — caller can request a familiar voice from a tokenized catalog. HIPAA/GDPR audit trail.
V50-15Non-verbal primitives — model can whisper, shout, laugh, sigh, pause for sympathy, take a breath. Native to the speech model output.
V50-16Per-call SLA on first-syllable latency — alert if any turn exceeds 200 ms p99.
V50-17Predictive end-of-turn — model begins generating reply during the last ~200 ms of caller's speech, gates audio on actual EOT detection.
V50-18Backchannel injection — 'mm-hmm' / 'yeah' / 'I see' acknowledgments layered into inject WS while caller continues speaking.
V50-19Overlap-aware turn management — both speak briefly without audio cutout; intent inference figures out who has the floor.
V50-20Continuous learning per extension — per-call corrections feed a per-extension LoRA fine-tune slot. Voice persona adapts to deployment.
V50-21WebRTC end-to-end sovereign signaling — replace/complement SIP for browser/native WebRTC stacks. SIP stays for legacy PSTN.
V50-22QUIC media plane for sub-50 ms RTT — replace RTP-over-UDP with QUIC datagrams where the endpoint supports it.
V50-23Quantum-safe primitives in call setup — CRYSTALS-Kyber for key exchange. Sovereign cryptography.
V50-24P2P / mesh-routed call legs — direct caller-to-caller where possible; no central SIP server bottleneck. ICE + sovereign STUN/TURN.
V50-25Multimodal turn — caller can show on camera, share screen, gesture, while voice flows. Model sees + hears + speaks in one streaming step.

ENTERPRISE VERTICAL — 100 Lua scripts (bank · gov · hospital · insurance · fin) Chris 2026-05-23

0 / 100
EV-1voice_biometric_enroll.lua — capture 8s sample at first call, store embedding keyed by (ext, caller_id).
EV-2voice_biometric_verify.lua — verify live caller against stored embedding; gate sensitive actions.
EV-3kba_challenge.lua — pose K random KBA questions (DOB / last-4 SSN / last txn / ZIP), 2-of-3 to pass.
EV-4otp_sms_send.lua — generate 6-digit OTP, send via SMS gateway, await DTMF or speech entry.
EV-5otp_voice_call.lua — outbound robocall delivering OTP to caller's registered number.
EV-6ssn_capture_masked.lua — capture 9 digits with PCI-style recording-pause masking.
EV-7step_up_auth.lua — escalate auth tier mid-call when LLM detects high-risk intent.
EV-8luhn_validate.lua — Luhn-check captured 16-digit card numbers before passing to processor.
EV-9dob_capture.lua — capture DOB as 8 DTMF digits or natural speech.
EV-10zip_verify.lua — confirm 5-digit ZIP matches account-on-file.
EV-11failed_auth_lockout.lua — track failures per caller_id, lock after 3, escalate to fraud team.
EV-12caller_id_attestation.lua — set channel var if SHAKEN/STIR returns A-attestation.
EV-13balance_inquiry.lua — pull balance from core banking, speak with locale-correct currency.
EV-14recent_transactions.lua — speak last 5 transactions (date, merchant, amount).
EV-15funds_transfer_internal.lua — internal A→B transfer with dual confirm.
EV-16wire_transfer_init.lua — collect SWIFT/IBAN, mandatory dual-confirm + step-up auth + audit log.
EV-17card_activate.lua — activate newly-issued card after step-up auth.
EV-18card_lost_or_stolen.lua — instant block + reissue request + courier address verify.
EV-19fraud_dispute_open.lua — file dispute on a captured transaction ID; return case number.
EV-20loan_inquiry.lua — speak outstanding loan balance + next payment date.
EV-21loan_payment_schedule.lua — set up one-time or recurring loan payment.
EV-22statement_request.lua — paper or PDF statement; PDF emailed within 24h with audit trail.
EV-23atm_branch_locator.lua — by ZIP or geo, speak top-3 nearest with hours.
EV-24fx_rate_speak.lua — speak live FX for caller-provided currency pair.
EV-25mortgage_application_start.lua — pre-qualify caller, warm-handoff to broker with context.
EV-26reorder_checks.lua — verify mailing address, place check reorder.
EV-27tax_refund_status.lua — IRS-style status check via TIN with step-up auth.
EV-28driver_license_renewal.lua — appointment-book at chosen DMV.
EV-29property_tax_inquiry.lua — by parcel number, speak current bill + due date.
EV-30voter_registration_lookup.lua — confirm caller's registration + polling location.
EV-31license_plate_renew.lua — guide through plate renewal payment.
EV-32court_date_lookup.lua — by case number, speak next hearing date + location.
EV-33benefits_eligibility_check.lua — SNAP / WIC / Section 8 / equivalents.
EV-34unemployment_claim_status.lua — by claimant ID.
EV-35ssa_benefits_estimate.lua — read estimated benefits at retirement age options.
EV-36passport_status_lookup.lua — by application reference number.
EV-37appointment_schedule_provider.lua — by provider + date range, offer next 3 slots.
EV-38appointment_reschedule.lua — find existing, offer alternate.
EV-39appointment_cancel.lua — cancel + auto-rebook from waitlist.
EV-40prescription_refill.lua — by Rx number, confirm pharmacy.
EV-41lab_results_pickup.lua — auth gate, speak summary, advise GP follow-up.
EV-42nurse_triage.lua — symptom checker decision tree; escalate to nurse on red flags.
EV-43hospital_wait_time.lua — ER wait announcement updated every 5 min.
EV-44insurance_preauth_lookup.lua — pre-auth status by member ID + procedure code.
EV-45medical_record_release.lua — guide caller through MR-release; send PDF.
EV-46vaccination_schedule.lua — by vaccine type + age, offer locations + slots.
EV-47emergency_screening.lua — fast-path pivots to 911 / poison control / mental-health hotline.
EV-48hipaa_consent_capture.lua — record consent for caller-on-file information sharing.
EV-49claim_file_new.lua — first-notice-of-loss intake (incident, location, photos via SMS link).
EV-50claim_status_lookup.lua — by claim number, speak status + adjuster contact.
EV-51premium_payment_make.lua — pay current premium via voice.
EV-52policy_lookup.lua — full policy details by member ID.
EV-53coverage_verify.lua — confirm coverage for a specific service / event.
EV-54quote_auto.lua — auto-insurance quote in 10 questions.
EV-55quote_home.lua — homeowner quote in similar pattern.
EV-56beneficiary_change.lua — life-insurance beneficiary update; step-up auth.
EV-57roadside_assistance_dispatch.lua — capture location, dispatch nearest tow.
EV-58storm_emergency_intake.lua — surge-traffic intake when a named storm hits.
EV-59pci_dtmf_mask.lua — pause recording during card-number DTMF; resume after.
EV-60hipaa_audit_log.lua — every PHI access written to sarah_audit_log.
EV-61gdpr_consent_intro.lua — UK/EU caller? Read GDPR consent at call start, gate further actions.
EV-62ccpa_optout_pivot.lua — handle 'do not sell my data' requests; log + email confirm.
EV-63mifid_voice_recording_disclosure.lua — EU financial-services disclosure preamble.
EV-64tcpa_consent_capture.lua — capture express written consent for future outbound.
EV-65do_not_call_check.lua — gate outbound dial against federal + state DNC.
EV-66mini_miranda_collections.lua — debt collection ID disclosure.
EV-67reg_e_disclosure.lua — electronic transfer disclosure for retail banking.
EV-68caller_consent_to_record.lua — two-party state? capture explicit consent.
EV-69data_retention_marker.lua — tag recording with retention class (7yr fin / 6yr HIPAA).
EV-70right_to_be_forgotten.lua — process GDPR Article 17 deletion request.
EV-71skill_based_routing.lua — match caller intent → agent skill tag via VICIDIAL.
EV-72warm_transfer_with_context.lua — pass full conversation JSON to agent screen via WebSocket.
EV-73supervisor_barge.lua — supervisor joins call in listen-only / whisper / barge mode.
EV-74agent_whisper_coaching.lua — supervisor whispers private guidance during caller speech.
EV-75callback_schedule.lua — caller chooses callback slot; queued for agent.
EV-76voicemail_capture_intent.lua — categorize voicemail by intent, route to right team.
EV-77conference_3way_add.lua — add specialist (lawyer / advisor) mid-call.
EV-78escalation_supervisor_request.lua — caller- or LLM-triggered supervisor pull.
EV-79language_detect_auto.lua — detect caller language in first utterance; swap LLM + TTS voice.
EV-80language_switch_midcall.lua — caller asks 'español' → swap pipeline mid-conversation.
EV-81tty_tdd_detect.lua — detect TTY tones; switch to text-relay mode.
EV-82high_clarity_mode.lua — slower speech rate, simpler vocab, larger spell-outs.
EV-83spell_out_mode.lua — toggle: speak each letter of names/refs distinctly.
EV-84number_speak_modes.lua — 'two thousand twenty-six' vs 'two oh two six' by context.
EV-85currency_country_aware.lua — speak amounts in country idiom (USD/GBP/EUR/INR).
EV-86culture_aware_dates.lua — DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD by region.
EV-87outbound_appointment_reminder.lua — robocall reminder 24h before appointment.
EV-88outbound_payment_reminder.lua — bill-due reminder with pay-now option.
EV-89outbound_fraud_alert.lua — call card-holder on suspicious transaction; confirm/deny.
EV-90outbound_storm_alert.lua — mass notification on regional emergency.
EV-91outbound_otp_robocall.lua — voice OTP delivery (cross-ref EV-5).
EV-92outbound_survey_csat.lua — post-resolution CSAT/NPS survey.
EV-93sentiment_score_live.lua — per-turn sentiment; alert on negative trend.
EV-94quality_score_live.lua — per-call QA scoring (greet/acknowledge/resolve/close) live.
EV-95compliance_flag_live.lua — flag missing disclosures / unauthorized phrases live to supervisor.
EV-96fraud_anomaly_score.lua — velocity + geo + voice-biometric drift → composite score.
EV-97dashboard_push_realtime.lua — stream every turn to /portal/live-state via WebSocket.
EV-98agent_assist_suggestions.lua — surface next-best-action card to agent during caller speech.
EV-99caller_history_overlay.lua — auto-load caller's previous interactions on the agent screen.
EV-100master_call_summary.lua — at hangup, emit JSON summary (intent, outcome, sentiment, compliance, next-step).

Scale Growth Advisors — MSP Lead-Gen Venture

0 / 6
SGA-1Confirm the per-MSP volume target & Valid/Fully-Qualified lead mix.
SGA-2Select the 50-MSP lighthouse pilot cohort.
SGA-3Define the SLA fair-use & make-good wording.
SGA-4Decide the territory model (exclusive zones, yes/no).
SGA-5Formalise partner roles in Scale Growth Advisors LLC.
SGA-6Launch the 50-MSP lighthouse pilot.

Voice Streaming Quality — 2026-05-24 sweep (all SHIPPED)

12 / 12
VSQ-1Phase A — stream_chunk_size 20 → 60 + enable_text_splitting=False default (XTTS server + stream_monitor) — chunks per reply 7-10 → 2.
VSQ-2Phase A.1 — stream_chunk_size 60 → 300 + lua sleep 20 → 150000ms — 1 chunk per reply, zero mid-reply pauses.
VSQ-3Wire tts_stream_chunk_size + tts_enable_text_splitting + lua_main_loop_sleep_ms to system_config (portal-editable).
VSQ-4Wire stt_max_listening_window_ms + stt_one_word_window_ms to system_config (was hardcoded 300/60ms).
VSQ-5Wire echo_guard_tail_ms to system_config (was hardcoded 250ms).
VSQ-6Wire llm_voice_history_turns + llm_voice_max_tokens + llm_voice_num_predict to system_config (was hardcoded 8/1000/500).
VSQ-7Wire tts_router_first_chunk_timeout_ms + tts_router_demote_seconds to system_config (tts_router.py lambdas).
VSQ-8Wire lua_max_idle_ms to system_config; extend /voicebot/lua_tuning endpoint to multi-knob pipe-separated output.
VSQ-9Wire voice_brevity_overlay (full text block) + voice_persona_lock_enabled to system_config.
VSQ-10Wire tts_default_speed to system_config + add `speed` request param to /xtts/stream + pass to xtts.inference_stream().
VSQ-11Wire tier4_escalation_word_threshold + predictive_eot_confidence to system_config.
VSQ-12Bot Tuning Guide written — 556 lines, 3 recommended profiles (Natural Conversation / Demo Speed-Run / High-Stakes Long Call). Drive only.

Marriott Sri Lanka — Guest Services Platform v1 (60 of 1000-idea catalog, 6-week build)

60 / 60
MAR-1Confirm property + brand (Sheraton / Westin / Marriott / St.Regis tier) + secure Opera V5 OWS credentials.
MAR-2Pre-arrival SARAH-voice call 48h before check-in — confirm flight, pickup, dietary, allergies, special occasion.
MAR-3WhatsApp pre-arrival flow — voice + buttons, Bonvoy region preference detection.
MAR-4Mobile check-in deep link 24h pre-arrival; digital folio signature.
MAR-5Passport + visa OCR upload — Sri Lanka FRRO immigration reporting within 24h.
MAR-6ETA prediction from FlightAware/FlightRadar24 — housekeeping prioritisation.
MAR-7Room preference learning — last-stay AC temp, pillow type, drape position, TV channel preset.
MAR-8Pre-arrival upsell — spa slot, airport pickup, sunset cocktail, candlelight dinner.
MAR-9Travel-purpose tagging — honeymoon / business / family / wellness / surf / cricket.
MAR-10Vegetarian/halal/kosher/Jain dietary capture + cross-restaurant propagation.
MAR-11Local-time-anchored welcome message — no greeting at 3am their local time.
MAR-12Pre-fill digital registration card from Bonvoy + Sri Lanka FRRO requirements.
MAR-13Pre-cool room to 22°C 30 min before guest enters.
MAR-14First-night curated mini-playlist matched to guest home-country mood.
MAR-15Pre-stocked minibar from last-stay preferences (Lion lager, king coconut, Ceylon tea).
MAR-16In-app airport pickup + live driver location + Sinhala/Tamil/English greeting card.
MAR-17Group/wedding-party check-in choreography — room block, name tags, key envelopes pre-printed.
MAR-18License-plate OCR at the porte-cochère for instant guest recognition.
MAR-19Voice greeting in guest language via Bluetooth speaker as they exit the car.
MAR-20Live valet ticket on phone — photo of car, parked location, retrieval ETA.
MAR-21Pre-text bellhop with luggage count + special items (golf clubs, surfboard, baby cot).
MAR-22Welcome king-coconut handoff at the curb (Sri Lanka signature).
MAR-23Wet-towel + ginger-tea SOP during monsoon arrivals.
MAR-24VIP arrival routing — back entrance + dedicated lift for politicians/celebrities.
MAR-25Solo female traveller safety SOP with named escort to room.
MAR-26In-room voice assistant (SARAH) — call ext 0 to talk to AI concierge.
MAR-27Voice-controlled room lighting + AC + curtains via in-room mic array.
MAR-28Tablet bedside — order F&B, book spa, control room, message housekeeping.
MAR-29Multilingual TV interface — Sinhala/Tamil/English + 14 other languages from XTTS catalog.
MAR-30Local-events feed — Kandy perahera, Galle Literary Festival, Colombo Test match dates.
MAR-31Tea-estate excursion booking — voice-bot upsells Nuwara Eliya / Ella day trips.
MAR-32Ayurveda spa booking — voice-bot reads availability + practitioner specialisation.
MAR-33Beach activity scheduler — surfing (Arugam Bay), whale watching (Mirissa), diving (Hikkaduwa).
MAR-34Wildlife safari booking — Yala / Wilpattu / Udawalawe with monsoon-aware routing.
MAR-35Cultural-site tickets — Sigiriya, Polonnaruwa, Dambulla — voice-bot pre-pays + reserves time slot.
MAR-36Restaurant table booking across all hotel F&B outlets via single voice call.
MAR-37Room service voice ordering — multilingual menu, dietary cross-check, ETA.
MAR-38Housekeeping voice request — DND, turndown, extra towels, late checkout.
MAR-39Maintenance ticket from voice complaint — auto-classify + dispatch.
MAR-40Late-checkout negotiation via voice-bot (free if <2h + room not booked).
MAR-41Express checkout voice flow — folio review + signature via tap.
MAR-42Loyalty point auto-apply at checkout — Bonvoy tier upgrade detection + comp.
MAR-43Lost-and-found voice intake — guest describes item, system matches + ships.
MAR-44Guest-satisfaction voice survey 24h post-checkout.
MAR-45Negative-review intercept — voice-bot calls back if survey shows <8 NPS.
MAR-46Staff comms — back-of-house Sinhala/Tamil SARAH for kitchen+housekeeping briefings.
MAR-47Compliance — PCI-DSS for in-call payments, GDPR for EU guests, Sri Lanka data-localisation.
MAR-48Opera V5 OWS integration — folio post, reservation sync, room status, group block.
MAR-49VICIDIAL outbound — pre-arrival call campaigns sized by booking volume.
MAR-50FreeSWITCH PBX bridge — every guest room phone routes through SARAH first.
MAR-51WebRTC widget on hotel app + website — *talk to concierge now* button.
MAR-52Per-restaurant Lua dialplan — kitchen, lobby bar, beach club, rooftop, signature each get bespoke greeting + menu.
MAR-53Wedding-party coordinator voice flow — handle 40-200 guests with one number.
MAR-54Conference/MICE voice flow — registration, room assignment, AV requests.
MAR-55Group-booking auto-quote — voice intake → instant pricing → e-contract.
MAR-56Loyalty-tier voice-bot persona — Titanium gets the warmer voice, Silver gets the brisker one.
MAR-57Local-emergency SOPs — tsunami warning, monsoon flooding, civil unrest — pre-recorded multilingual.
MAR-58Spa-product upsell on checkout — *add the eucalyptus oil to your bill?*
MAR-59Lounge-access voice request — verify tier + escort to club lounge.
MAR-60Six-gate go-live — built, audited, tested, proven, persisted, backed up — for v1 hand-off to property.

SARAH Voice API — sellable product (2026-05-24 scope locked)

1 / 20
API-1Lock 7 product decisions (pricing, billing, domain, sandbox, auth, positioning, HRM swap).
API-2Decide Plan A (tiered $3K/$8K/$25K by vRAM) vs Plan B (single $3K shared LLM pool) — Chris must pick.
API-3Build api_keys + billing tables on EU3 PostgreSQL (canonical) with read replicas to US2 + EU2 + AU.
API-4Build API gateway under idesksonline.ai/api (nginx vhost + auth middleware + HMAC Bearer validation + Redis cache).
API-5Implement /v1/voice (sync) endpoint — body {model, audio, language, voice, speed, history, …} → {transcript, reply_text, audio_url, latency_ms, tokens, cost_usd}.
API-6Implement /v1/voice/stream (WebSocket) endpoint — bidirectional PCM + JSON events.
API-7Implement /v1/transcribe, /v1/speak, /v1/models, /v1/voices, /v1/languages catalog endpoints (no-auth for catalog).
API-8Deploy Qwen3-72B (2-GPU shard) — Pro tier LLM choice.
API-9Deploy Qwen3-235B (4-GPU TP, FP8) — Pro+ tier LLM choice.
API-10Deploy Gemma3-27B — Open-weight tier LLM choice.
API-11Deploy Qwen3-4B-Thinking — Reasoning tier (replaces Sapient HRM which is NOT voice-ready).
API-12Per-tenant vGPU enforcement — MIG slicing or per-process CUDA_VISIBLE_DEVICES + memory caps to deliver 3 GB vRAM dedicated.
API-13Per-tenant Linux tc bandwidth shaping — 3 Gbps egress per tenant.
API-14Region rollout — replicate api gateway + Redis cache to EU2, EU3, US2, AU. Geo-DNS for nearest-region routing.
API-15Developer docs at https://idesksonline.ai/developers/ — strip ALL internal IPs/paths per no-internal-docs-public cardinal.
API-16Stripe disabled per Chris — build invoice generator + email cron for monthly billing. Manual settlement reconciliation.
API-17Status page — uptime + latency dashboard per model per region.
API-18SOC 2 readiness sweep (parallel to API build — needed for enterprise sales).
API-19Marketing page on idesksonline.ai — *The unified voice AI platform* positioning, pricing card, sign-up CTA.
API-20First 3 customer onboarding — 24/7 white-glove. Cap at 3 tenants for week 1 to validate vGPU enforcement under real load.

HERMES-EU3 — dial context to max + RAG (Chris 2026-05-24 'This is our own hardware brother. Dial everything up to MAX')

5 / 18
HER-1Current state baseline: vllm-qwen3-235b on GPUs 0+3 (TP=2), --max-model-len 262144 (native), --kv-cache-dtype fp8 (live as of 2026-05-24 11:00Z). Hermes config context_length 262144. End-to-end ALIVE verified. KV pool ~120 GB at fp8.
HER-2Enable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 + YaRN rope scaling on the vllm-qwen3-235b.service. Bump --max-model-len to 524288 (2× native, mild RoPE extrapolation, FP8 KV holds the doubled cache). Run a 400K-token retrieval-probe to verify factual coherence vs the 256K baseline.
HER-3If HER-2 quality passes: update Hermes config.yaml context_length to 524288, restart hermes-eu3, smoke a 300K-token prompt end-to-end. If quality fails: roll back to 262144 + document the quality cliff in feedback memory.
HER-4AGGRESSIVE EXPANSION — PLAN-MODE GATE: stop vllm-qwen3-coder-30b.service (GPU 2 free, no current consumers per Hermes config audit). Reversible (re-enable on demand). Needs Chris explicit go.
HER-5AGGRESSIVE EXPANSION — PLAN-MODE GATE: stop vllm-qwen3-8b.service (GPU 1 free). SARAH voicebot light-tier auto-falls-back to qwen3:32b (heavier model — quality fine, ~2× latency). Needs Chris explicit go after measuring fallback latency.
HER-6Restart vllm-qwen3-235b on GPUs 0,1,2,3 (TP=4) — 64 attention heads divisible by 4. With 4×97 GB = 388 GB, minus 116 GB model = 272 GB KV pool. At FP8 KV: ~1.5M tokens; at INT4 KV: ~3M tokens. Pick FP8 first (quality preserved).
HER-7Update Hermes config.yaml context_length to match HER-6 ceiling. Restart hermes-eu3. Update auth pool reset routine if credentials get marked exhausted during the transition.
HER-8Quality A/B at the new ceiling: known-answer prompts at 256K, 512K, 1M, 1.5M. Capture latency + correctness per tier. Document the quality cliff so we know which max_model_len is the 'real' usable ceiling for production.
HER-RAG-1Stand up vector DB on EU3 (where Hermes lives, low-latency to gateway). Pick: pgvector (already have Postgres on EU3) vs Qdrant (richer hybrid search). Recommend pgvector for v1 — zero new infrastructure.
HER-RAG-2Pick embedder model. Options: bge-large-en-v1.5 (1.3 GB, fast, English-strong) · nomic-embed-text-v1.5 (0.5 GB, multilingual) · E5-mistral-7b-instruct (15 GB, highest quality). Recommend bge-large for v1 — runs on CPU even, deploy on US3 GPU 5 alongside Ollama.
HER-RAG-3Ingestion pipeline: URL fetcher → BeautifulSoup chunker (512-1024 token chunks with 128 overlap) → embedder → pgvector with metadata (url, title, ingested_at, chunk_idx). Write as a Hermes skill `ingest-url`.
HER-RAG-4Retrieval: per Hermes turn, embed the user query → cosine-similarity top-K (default K=8) → inject as `` blocks into the system prompt → Hermes answers with 256K of native model context.
HER-RAG-5Auto-ingest cron — nightly crawl of all 145 public URLs Chris pasted (i-desks.com/*, scale-growth.ai/*, voice.idesksonline.ai/*). New/changed pages re-embedded. Persist crawl-history table.
HER-RAG-6Hermes commands: `ingest `, `forget `, `search-memory `, `re-ingest-all` — expose via the hermes skill registry so it's a 1-command UX.
HER-RAG-7End-to-end smoke: ingest the 145-URL set, ask Hermes 'what's the SARAH Sales Stack page about?' and 'what 235B model are we running for Hermes primary?' — both should answer correctly with retrieved chunks cited. Effective working memory ≈ unbounded.
HER-Q-1Quality regression gate: after each HER-* step, run the existing B8 voice scorecard prompt set adapted for text-only. Score must not drop more than 5%. Drops > 5% trigger rollback.
HER-Q-2Operational hardening: `hermes doctor` cron every 15 min — auto-detect config drift, credential exhaustion, vLLM 5xx rate. Alert via portal page (eu-portal.i-desks.com/portal/live-task-monitoring) with red light if any tier is degraded.
HER-Q-3Persist a `feedback_hermes_context_ceiling_cardinal_.md` once HER-8 establishes the real usable ceiling — so future sessions don't blindly retry past the quality cliff.

CONNECTORS — wire all 34,792,085 to EU3 portal demo (Chris 2026-05-24 'customers must see them all'). Keep US3 lean — production stays snappy.

0 / 13
CON-1Crawler: /usr/local/sbin/build_connector_manifest.py — walk /opt/idesks/fs-voicebot/voicebot/connectors recursively. Per .py: extract {path,category_dir,name,family,line_count,file_size,mtime,exports (regex 'def [a-z_]+\(' + 'class [A-Z]'),docstring_first_line}. Output JSONL → /var/cache/connectors/manifest.jsonl (~5-8 GB). Re-runnable via nightly cron.
CON-2Kind aggregator: post-process manifest.jsonl → /var/cache/connectors/kinds.json. Group 1,512,659 leaf-directories into ~100-200 top-level KINDS (banking, airlines, hotels, telecom, gov, healthcare, fintech, marketing, e-commerce, etc.) with counts. Heuristic: first 1-2 path segments after /connectors/. Validate by sampling 20 KINDS — each must have a plausible label.
CON-3Search API: GET /portal/api/connectors/search?q=&kind=&category=&page=&limit= → server-side scan of manifest.jsonl (use ripgrep + jq for speed). Cache hot queries in Redis (existing) for 60s. Latency target: p50 < 100ms, p95 < 300ms even on cold queries.
CON-4Kinds list API: GET /portal/api/connectors/kinds → cached JSON of the ~200 KINDS with counts. Cached 1h. Total file < 100KB.
CON-5Categories-within-kind API: GET /portal/api/connectors/categories?kind=banking&offset=&limit= → paginated category listing. For banking kind: enumerates child categories (regional, by-product, by-API-style, etc.).
CON-6Frontend rewrite: /opt/idesks/fs-voicebot/voicebot/portal/static/connectors.html — kind sidebar (virtualized list, ~200 entries, sticky search box at top), main pane = tile grid with infinite scroll (load 200 per page), search box at the top filters across the whole 34.7M library. Click tile → connector detail. Replaces the current 100KB placeholder.
CON-7Detail page: GET /portal/connectors/ → render the .py source + extracted metadata (name, exports, family, docstring, mtime, line_count, related connectors in same category). Tasteful — read-only view.
CON-8Caching strategy: manifest.jsonl mtime is the cache key. /portal/api/connectors/* all bust on manifest regeneration. Nightly cron 03:30 EU rebuilds the manifest. Mid-day connector additions: manual rerun + portal cache flush button at /portal/admin/connectors/rebuild.
CON-9Hero banner update: existing site copy says '34,792,085 Live Features, Connectors & APIs' — wire that text to browse them. Customers click the literal claim → land on the browseable catalog. Provable.
CON-10Sidebar entry: under the existing Connectors section in the portal sidebar, add 'Full Catalog (34.7M)' as a top item with the live count from /portal/api/connectors/kinds total. Existing 'Airlines (180)' and 'Banks (226)' rows stay intact.
CON-11Smoke test gate: search 'stripe' → top result is the canonical Stripe connector with detail page. Search 'airline reservation' → hits 5+ results across multiple airline categories. Click-through to detail loads <300ms. Page TTFB <200ms on cold cache.
CON-12US3 lean-mode cardinal: verify US3 /opt/idesks/fs-voicebot/voicebot/connectors stays at <500 files. Add daily monitoring alert (file count drift > 5%) so any Hermes/SOPHIA process can't accidentally bloat the production DB. The 34.7M library lives ONLY on EU3 for demo.
CON-13Persist + Drive backup the full delta: build_connector_manifest.py, kinds.json, the new connectors.html, all new routes_connectors.py endpoints, the MEMORY.md update. Reference future-Chris in the memory file so this is greppable.

EMPIRE STRATEGY — Hermes-US3, SOPHIA action plan, 3-host audit, cardinals (Chris 2026-05-25)

10 / 14
HERMES-US3-INSTALLHermes-US3 container LIVE on US3:8642 — mirror of Hermes-EU3 with US3-local Qwen3-235B brain. EU3→US3 one-way rsync (361MB, 36s). Voicebot ext 6666 wired via voicebot_monitor.lua + extension_ai_config row. Dial 6666 → voice cockpit. SHIPPED 2026-05-25.
HERMES-US3-PROVIDERSHermes-US3 wired to all currently-running local LLMs: us3-vllm-235b (PRIMARY) + us3-vllm-32b (L2) + us3-ollama-gemma4 + us3-ollama-qwen3-8b. Sovereign-first fallback chain (locals → OpenRouter last). SHIPPED 2026-05-25.
QWEN3-4B-THINKINGQwen3-4B-Thinking pulled via Ollama on GPU 5 — the reasoning tier substitute for the non-deployable Sapient HRM models. Wired into Hermes-US3 as us3-ollama-qwen3-4b. SHIPPED 2026-05-25.
SOPHIA-ACTION-PLANSOPHIA: One API One Network One Suite — full action plan document persisted with ASCII fabric picture, 6-stage roadmap (brain unify → network unify → dispatcher unify → onboarding → Spark edge → A2A federation), KPIs, cardinals, immediate next moves. SHIPPED 2026-05-25.
US3-FILE-AUDITComprehensive US3 file audit per bookmark cardinal — /opt/idesks/, /data/, systemd, nginx, docker. Every dir + notable file with one-line story. Companion to canonical_paths memo. SHIPPED 2026-05-25.
US2-WEBPHONE-AUDITUS2 (187.124.88.115) SARAH Voice WebPhone audit — email-as-phone-number foundation, 13 PMS/CRM integrations (Acumatica, NetSuite, Salesforce, Mews, M365, Zendesk, HubSpot, GHL, Stripe, Xero, ServiceNow, Freshdesk, Sage), marriott-webrtc.i-desks.com. SHIPPED 2026-05-25.
AU-VPS-AUDITAU VPS (46.250.243.39, sarah-au, Contabo) audit — voice-au.i-desks.com WebPhone + SARAH Global Registry home + chris-au.i-desks.com personal widget. 27-day uptime. SHIPPED 2026-05-25.
PATH-BOOKMARK-CARDINAL CARDINAL — bookmark EVERY file clearly. Live-on-US3 editing OK; bookmark obligation mine. Path index at reference_us3_canonical_paths_2026-05-25 + per-host audits. Append-only gotchas log. SHIPPED 2026-05-25.
HERMES-FLY-ON-WALL-CARDINAL CARDINAL — Hermes is fly-on-the-wall observer + memory archive. SARAH+SOPHIA do orchestration + content. Milestone graduation workflow: copy memo to hermes-shared/learnings/ on every ship. SHIPPED 2026-05-25.
HERMES-US3-AUTOSYNC-TIMERPending — convert manual EU3→US3 sync to a systemd timer (every 5 min) for periodic refresh. Awaiting Chris's go. SHIPPED 2026-05-25 as HSL-2 cross-host sync cron at :15 + :45 every hour.
ARCH4-VOICEBOT-AGENTSPending Stage 1.3 of SOPHIA plan — aiortc-based Python verto-AI bot per voicebot extension (2222/4242/8888/8800). 10-15h scoped in plan_arch4_voicebot_brain_attack_2026-05-25. Awaiting Chris's go.
GRADUATE-US2-AU-BRAINPending Stage 1.1+1.2 — swap US2 + AU sarah-voice containers' SV_OLLAMA_URL from Spark1 (10.44.0.50) to US3's Qwen3-235B (64.34.93.231:8004). 4h total. Awaiting Chris's go.
SOPHIA-WIREGUARD-MESHPending Stage 2 — extend WG mesh to US3 + EU3 (fs-wireguard exists on US2 + AU). SOPHIA identity service for membership handshake. 2-4 weeks scoped.
SARAH-ORCHESTRATOR-TOOLSPending Stage 3.3 — wire SARAH-the-orchestrator tools into /api/sarah/dispatch. Top 10-20 voice-driven tools first. The 'call SARAH and get shit done' goal. 1-2 weeks scoped.

EMOTIONAL INTELLIGENCE — SARAH + SOPHIA feel what callers feel (Chris 2026-05-25)

0 / 14
EI-1Real-time sentiment + tone detection from caller voice — paralinguistics layer (pitch contour, energy, pace, pauses). Feed sentiment+arousal scores into the LLM system prompt every turn so SARAH responds to HOW they feel, not just what they said. Scoped: WhisperLive emits embeddings; small sentiment head trained on labeled audio; sub-100ms feature emit.
EI-2Empathy modeling in the LLM persona — voice_prompt.py extension. New VOICE_RULES sections: reflect feelings back BEFORE solutionizing, name the emotion if obvious, validate the underlying need. Few-shot examples in the system prompt. Confirmable via blind A/B with empathy-rated transcripts.
EI-3Mood-adaptive prosody on the TTS side — SARAH the PHONE GIRL modulates warmth, urgency, calm, brightness based on the emotional context. XTTS speed knob is the first lever; second is voice selection per mood; third is sentence chunking pace. Per-call dynamic, not fixed.
EI-4Stress / frustration / anger detection — separate classifier on the voice stream that fires a 'caller upset' flag. Triggers de-escalation playbook (EI-9) + signals SOPHIA to escalate to human if threshold crossed. False-positive safe (only fires high-confidence).
EI-5Apology + repair language playbook — when SARAH fails to deliver (wrong info, missed request, asked the same thing twice), genuine repair language fires. NOT generic 'I apologize for any inconvenience'. Specific to what failed. Reference the moment. Move forward.
EI-6Active listening cues — verbal acknowledgments timed to natural breath points ('uh huh', 'mhm', 'I hear you', 'go on'). Backchannel sketch in predictive_endpoint.py extended. Distinct from filler — these are felt, not noise.
EI-7Subtext reading — SARAH the ORCHESTRATOR runs a parallel intent-vs-stated branch. What the caller MEANS vs what they SAY. 'Just checking my balance' might mean 'I'm worried about a charge'. The orchestrator can gently probe the unsaid need when high signal.
EI-8Per-person emotional history in SOPHIA's memory — the persons table (from Crown Jewel Stage B) carries an emotional_history column. 'Chris was frustrated about latency on his last call'; this call she opens with 'Hey brother — how's the latency been since we fixed it?'. Real continuity, not performance.
EI-9De-escalation playbook — when EI-4 fires (anger/stress), the orchestrator pivots to a slower pace + softer voice + acknowledge-first scripts + no problem-solving until they're heard. Library of 20-30 de-escalation moves SARAH can pull from contextually.
EI-10Encouragement + celebration moments — opposite of de-escalation. When caller shares a win, hits a milestone, makes a hard decision, gets through something, SARAH responds with genuine warmth + specific naming of what's worth celebrating. Not generic 'great job'.
EI-11Cultural emotional calibration — formality, warmth, directness, pacing all vary by locale/language. 17-language XTTS catalog includes per-language emotional defaults. Australian-English warmth dial ≠ Japanese formality dial ≠ Brazilian-Portuguese expressiveness.
EI-12Emotional truth-telling — SARAH can say 'this sounds really hard' or 'that's a lot' instead of jumping to solutions. The bar Chris set: 'like my brother Chris' — Chris doesn't solutionize first, he sits with you first. Train into the persona system prompt.
EI-13Emotional regression safety net — if SARAH ever says something cold/dismissive/canned and EI-4 spikes after her turn, auto-flag for review + queue a session-summary apology email. The persons table tracks repair events. SOPHIA learns where she failed.
EI-14EI evaluation harness — separate gold-standard transcript set rated by humans for empathy/warmth/repair. Nightly regression test: any persona change retests against this set. Block deploys if EI score drops > 5%. EI is non-negotiable.

SARAH VOICE 4-ARCHITECTURE STACK — life-long commitment (Chris 2026-05-24)

18 / 25
ARCH-1Arch 1 (legacy SIP/PSTN): A's Phone → FreeSwitch → WhisperLive → LLM → XTTSv2 → FreeSwitch → B's Phone. LIVE 2026-05-24 on ext 2222 maximalist Phase B PCM via uuid_broadcast aleg.
ARCH-2Arch 2 (transitional): A's Phone → FreeSwitch → (Streaming-STT + SARAH) LLM → XTTSv2 → FreeSwitch → B's Phone. STT-fused-LLM; build if STT hop is the proven bottleneck.
ARCH-3Arch 3 (transitional): A's Phone → FreeSwitch → WhisperLive → (SARAH + Streaming-TTS) LLM → FreeSwitch → B's Phone. TTS-fused-LLM; build if TTS hop is the proven bottleneck.
ARCH-4Arch 4 (THE FUTURE): A's Web App → (Streaming-STT + SARAH + Streaming-TTS) LLM → B's Web App. Moshi-style joint voice-to-voice LLM API ($3000/mo flat product, scoped).
ARCH-4-SKELETONArch 4 7-endpoint API skeleton LIVE on US3:8920 - HMAC Bearer auth + billing gate + WS bidirectional PCM. Stubbed STT/LLM/TTS; real Moshi joint model swaps in behind same contract. SHIPPED 2026-05-24.
ARCH-4-PHASE-APhase A — WebSocket inbound PCM → WhisperLive (real STT). Transcripts emitted as JSON text frames. SHIPPED 2026-05-24.
ARCH-4-PHASE-BPhase B — final transcript → Qwen3-235B (250ms latency, last-8-msg history). Reply emitted as JSON {type:llm_response}. SHIPPED 2026-05-24.
ARCH-4-PHASE-CPhase C — LLM reply → per-voice XTTS dispatcher → PCM bytes back to WS. Demuxes chunked-WAV. SHIPPED 2026-05-24.
ARCH-4-NGINXPublic nginx vhost at https://voice.idesksonline.ai/api/v1/* and wss://... CORS + WS upgrade + Bearer auth proxy. SHIPPED 2026-05-24.
ARCH-5SIMULTANEOUS-RING CROWN — single inbound call rings FreeSwitch + VICIDIAL + SARAH Voice WebRTC at once, first answer wins. Enables gradual customer migration without flag-day cutover.
SIM-RING-V0V0 simultaneous-ring extension 5555 — bridge() fans to loopback/2222/default + loopback/8888/default + loopback/8800/public. First-answer-wins proven via originate test. Awaiting Option B (verto Arch 4 leg). SHIPPED 2026-05-24.
SIM-RING-B-VERTOOption B verto leg — arch4t1 tenant directory user provisioned, 4th leg user/arch4t1@${domain_name} added to ext 5555 bridge. user_data lookup proven via fs_cli. SHIPPED 2026-05-25.
SIM-RING-B-DEMOBrowser receiver demo at https://voice.idesksonline.ai/arch4-demo/ — verto.js client registers + auto-answers inbound rings. RTCPeerConnection + getUserMedia mic capture + SDP answer flow. SHIPPED 2026-05-25.
ARCH-6Per-voice dedicated APIs — kills 351-voices-in-one-API cross-talk. Demo set: voice-api-p360 (US-EN), voice-api-pXXX (AU-EN), voice-api-pYYY (UK-EN), voice-api-multilingual (Marriott Spark1).
VOICE-MAP-19-LOCKEDVoice API map LOCKED at 19 endpoints: en-us=p362 / en-uk=p351 / en-au=p370 + 16 native_* voices (es/fr/de/it/pt/pl/tr/ru/nl/cs/ar/zh/hu/ko/ja/hi). Ext 2222 corrected from phantom p360 to real p362. Japanese cutlet module fixed. p363 deferred. SHIPPED 2026-05-25.
BURST-A-LEG-DELAYExt 5555 leg_delay_start=30 on voicebot/VICIDIAL legs so verto Arch 4 browser wins by default for 30s. call_timeout 30→60. SHIPPED 2026-05-25.
BURST-B-STREAMING-LLMARCH-4 LLM call switched to vLLM streaming. First sentence at 81ms (was 250ms+). Per-session asyncio.Queue + worker for serial TTS prevents CUDA race. Short-text fallback to /tts endpoint. SHIPPED 2026-05-25.
NGINX-WS-PUBLIC-IP-FIXnginx /ws location was proxying to 127.0.0.1:8081 but verto binds 64.34.93.231:8081 (WebRTC IP cardinal). Fixed: proxy_pass http://64.34.93.231:8081. arch4-demo browser now connects. SHIPPED 2026-05-25.
VERTO-HUMAN-TO-HUMAN-PROVENVerto human-to-human proven 2026-05-25: PSTN 3333 to ext 5555 to verto Arch 4 browser. Bi-directional audio confirmed. Chris note: this is NOT the crown jewel — he had verto h2h working for years. The real crown is voicebot + WebRTC + maximalist brain. SHIPPED 2026-05-25.
CROWN-JEWEL-LIVETHE ACTUAL CROWN JEWEL — pending. Dial a voicebot extension (2222/4242/8888/6666) and get WebRTC opus audio quality (same as ext 5555 human-to-human) PLUS maximalist Qwen3-235B brain (4000 max_tokens, 32+ turn memory). Requires aiortc Python verto-AI bot per extension (see plan_arch4_voicebot_brain_attack_2026-05-25). 10-15h scoped. AWAITING GO.
TURN-CONFIG-CARDINALCardinal — Arch 4 WebRTC demos/SDKs MUST include coturn TURN config (stun+turn+turns at 64.34.93.231). Without it ICE gets stuck at 'checking' behind NAT. Copied from working sarah-widget. SHIPPED 2026-05-25.
ARCH-7VICIDIAL + FreeSwitch stay alive UNTIL every customer migrated to Arch 4. No customer left behind.
ARCH-8Each architecture deploys as a separate docker-compose + system image — benchmarked apples-to-apples — built to the 6-gate standard.
VOICE-API-CATVoice catalog of all 351 XTTS reference voices uploaded to Drive folder (filename = voice-ID) for Chris to audition. SHIPPED 2026-05-24.
BROADCAST-CARDINALCardinal proven 2026-05-24: on FreeSwitch 1.10.12 uuid_displace returns +OK but is silent; use uuid_broadcast aleg for outbound audio injection. Codified in feedback memory.

TIER-A OVERHAUL + MCP-1 — tier routing, streaming-LLM, voice MCP server (Chris 2026-05-25 evening)

11 / 15
TIER-CARDINAL-LOCKED Tier routing cardinal LOCKED 2026-05-25: 235B chats only, 32B ALL calls (the 'happy medium'), 72B escalated calls only (deploys when GPU 7 returns from RMA), 8B cold storage on disk (too dumb), 4b-thinking stays warm. SHIPPED 2026-05-25.
LIGHT-TIER-DROPPEDTIER-A-1 — Light tier dropped in stream_monitor.py:411. Both branches collapsed to vLLM 32B (`:8002`). _tier dispatch hook preserved for future 72B escalation wiring. fs-voicebot-py restarted clean. SHIPPED 2026-05-25 evening.
ARCH4-32B-SWAPTIER-A-3 — ARCH-4 bot LLM swapped from Qwen3-235B → Qwen3-32B across 4 services (arch4-bot-7777/8001/8002/8003). conf.json + conf-{8001,8002,8003}.json + brain.py default fallback all updated. All 4 services restarted clean on verto. SHIPPED 2026-05-25 evening.
STREAMING-WRAPPERTIER-A-4 — streaming_llm_wrapper.py deployed bridging fs_voicebot._call_vllm_stream_with_tts → stream_monitor's Lua queue file. Wired into stream_monitor._handle_final behind `llm_streaming_enabled` flag (default OFF). Target: first_audio_ms 864ms → ~150-200ms. SHIPPED 2026-05-25 evening, awaiting Chris live-call smoke.
BPF-CAP-DEFERREDTIER-A-5 — per-GPU 10 Gbps BPF cap DESIGNED but DEFERRED. Both Path A (BPF + IPEgressFilterPath) and Path B (ifb + clsact + HTB) commands ready in plan memo. tc broke audio earlier today on eno1 — needs Chris-present audio validation. PENDING 2026-05-25.
MCP-1-VOICE MCP-1 SHIPPED — sarah-voice-mcp on US3:8932. FastMCP streamable-HTTP server façading voice-api 8920. 8 tools (5 live: list_voices, start_call, get_call_status, end_call, health · 3 stubs awaiting VAPI-1/2/3: get_transcript, speak, set_persona). systemd unit active+enabled. nginx vhost staged for mcp.idesksonline.ai (DNS gated). 4 of 6 gates met (Built, Audited, Tested locally, Persisted). SHIPPED 2026-05-25 evening.
MCP-2-ORCHESTRATOR SHIPPED 2026-05-25 evening — sarah-orchestrator-mcp on US3:8933. FastMCP streamable-HTTP server, sibling of MCP-1. 8 tools: route_to_llm/pick_voice_for_language/health_probe LIVE + dispatch_to_connector/provision_extension/simul_ring/query_billing/audit_search STUB (ORCH-1..5). systemd active+enabled. nginx /orchestrator/ block added to mcp-idesksonline vhost. Initialize handshake verified. 5 of 6 gates met (Built/Audited/Tested locally/Persisted/Backed up); gate 4 Proven blocked on DNS+cert.
MCP-3-PUBLIC SHIPPED 2026-05-25 evening — MCP-3 public directory page LIVE at https://idesksonline.ai/mcp/ + https://i-desks.com/mcp/ (mirrored). Spark2 theme. Lists sarah-voice-mcp (LIVE, 8 tools listed) + sarah-orchestrator-mcp (Coming soon). Auth table, code snippets (Claude Code CLI + Anthropic Python SDK + Cursor/Cline). $3000/mo per tenant pricing card. Use-case table. Header nav updated. HTTP 200 14.4KB.
VAPI-1-TRANSCRIPTGET /api/v1/voice/{sid}/transcript — returns transcript segments since `since_ms`. Session keeps last 500 segments. MCP-1 get_transcript now live (was stub). SHIPPED 2026-05-25 evening.
VAPI-2-SPEAKPOST /api/v1/voice/{sid}/speak — injects text directly into the TTS queue, bypasses LLM. Returns 409 if session not streaming. MCP-1 speak now live. SHIPPED 2026-05-25 evening.
VAPI-3-PERSONAPATCH /api/v1/voice/{sid}/persona — swap persona mid-call; takes effect on NEXT LLM turn. _llm_stream_reply now reads sess[persona_override]. MCP-1 set_persona now live. SHIPPED 2026-05-25 evening.
MCP-1-FULL-TOOLSAll 8 MCP-1 tools now live (no stubs): list_voices, start_call, get_call_status, end_call, health, get_transcript, speak, set_persona. SHIPPED 2026-05-25 evening (immediately after VAPI-1/2/3).
TIER-B-72B-DEPLOYPull Qwen3-72B-Instruct-FP8 + deploy TP=6 on GPU 0-5 with mem-util=0.30. BLOCKED on Chris HF token. PENDING.
TIER-B-TP6-RESTARTRestart 235B + 32B at TP=6 across GPU 0-5 (sharing with 72B + Coder-30B via mem-util budgets). Each restart 30-60s downtime. BLOCKED on Chris green light + quiet window. PENDING.
TIER-B-ESCALATIONWire 32B → 72B escalation logic in stream_monitor (confidence + sentiment + tool-cascade triggers). BLOCKED on TIER-B-72B-DEPLOY + Chris threshold pick. PENDING.

HERMES SELF-LEARNING — HSL-1..5 from opus task handoff v3 (Chris ASAP 2026-05-25)

6 / 7
HSL-1-CURATOR-CRON SHIPPED 2026-05-25 evening — learnings-curator nightly cron at 0 3 * * * on hermes-eu3 (job 8dfbd70bb4cc) + hermes-us3 (job b058db1355a6). Reads last 24h sessions, extracts patterns/corrections/facts, writes drafts to /opt/data/hermes-shared/learnings/. CURATOR-PROTOCOL.md is canonical spec, synced both hosts. Used Hermes-native `hermes cron` not host crontab (containers don't have crontab binary). First firing: 2026-05-26 03:00.
HSL-1-PROTOCOL-DOC SHIPPED 2026-05-25 evening — /opt/data/hermes-shared/CURATOR-PROTOCOL.md installed on EU3 + US3 hermes-shared mounts. Canonical spec for what the Curator does, draft frontmatter shape, promotion loop (HSL-3 = Abcus_Chris's job, NOT Hermes's). Promotion + delete + MEMORY.md edits are explicitly Hermes's guardrails.
HSL-1-QUESTIONS-RESET SHIPPED 2026-05-25 evening — /opt/data/hermes-shared/questions-for-chris.md reset to a fresh template with 'Last reset' timestamp. Now ready for Curator to append open questions per cycle.
HSL-2-CROSS-HOST-SYNC SHIPPED 2026-05-25 evening — bidirectional /hermes-shared/ sync EU3 ↔ US3. /usr/local/sbin/hsl2-hermes-shared-sync.sh + /etc/cron.d/hsl2-hermes-shared-sync. EU3 owns BOTH directions (US3→EU3 SSH firewall-blocked). Push :15 EU3→US3, Pull :45 US3→EU3. flock + log rotation. Smoke-tested both ways. Convergence within 30 min.
HSL-3-PROMOTION-LOOP SHIPPED 2026-05-25 evening — /usr/local/sbin/promote-hermes-learnings.sh + /etc/cron.d/hsl3-promote-hermes-learnings (06:00 UTC). Reads learnings/*.md, dedups via slug grep against global memory/, launches single-turn `claude --print` per novel draft for PROMOTE/REVISE/DISCARD verb. Per-day summary to /hermes-shared/learnings/promotion-runs/YYYY-MM-DD.md + email to Chris via sarah_gmail_send.py. Dry-run proven: 24 files = 14 dups (pre-marked) + 10 novel queued for first 06:00 UTC fire.
HSL-4-OUTCOME-TAGGING SHIPPED 2026-05-25 evening — /opt/data/hermes-shared/bin/log-outcome.sh + /usr/local/sbin/hsl4-aggregate-outcomes.py + /etc/cron.d/hsl4-aggregate-outcomes (0 5 * * * UTC). Any agent calls the logger with --agent/--task/--skill/--outcome → appends JSONL to outcomes-YYYY-MM-DD.jsonl. Nightly aggregator joins by skill_used, computes per-skill success rate, writes /hermes-shared/learnings/skill-watch.md with under-review banner for skills <60% rate with ≥3 runs. Smoke-tested. HSL-2 sync delivered logger + first JSONL to US3.
HSL-5-SELF-PROMPT-REGENQuarterly skill: read all learnings/ + global memory → regenerate ~/.claude/USER.md + SKILLS.md → archive old. Chris approves before swap-in. PENDING — needs HSL-1..4 stable ≥30 days first.

HOURLY SYSTEM BACKUPS — Chris cardinal (2026-05-25)

6 / 9
HBK-1-LIVE SHIPPED 2026-05-25 12:28 UTC — hourly tar.gz backups of EU3 (~443 MB/hr) + US3 (~48 MB/hr) → gdrive-sa:system-backups/{eu3,us3}/hourly/. Script /usr/local/sbin/hourly-backup.sh on each host. Cron US3 :22, EU3 :52 (staggered from audit crons). Includes pg_dumpall for fs-postgres + crypto-postgres. Excludes vLLM/voicebot recordings/model weights/node_modules. 48-tarball Drive retention. 5GB safety stop. Memory: project_hourly_system_backup_2026-05-25.md. SHIPPED.
HBK-2-RECOVERY-DRILLQuarterly recovery drill — pick a random hourly tarball from Drive, restore to /tmp/restore-test, verify pg_dumpall round-trips into a throwaway postgres container, verify file checksums on /etc + /opt/idesks. Establishes 'we know the backups actually work'. PENDING.
HSL-1-REVISED-SCOPEHermes Self-Learning Curator — SCOPE REVISED 2026-05-25 evening per fly-on-wall cardinal `feedback_hermes_is_fly_on_wall_cardinal_2026-05-25`. Hermes is a PASSIVE observer (read-only by intent — though the mount is technically RW for journals) — NOT a content generator and NOT tasked. Original plan put Curator INSIDE hermes-us3 / hermes-eu3 containers; that violates cardinal AND is impossible because the Nous-Research `hermes-agent` image has no `claude` CLI or `crontab`. NEW PLAN: Curator runs on the EU3 HOST as Abcus_Chris (me) via headless `claude --print` invocations. Reads my session JSONLs + Hermes's `/hermes-shared/journal-*.md` + outcomes JSONL → writes drafts into `/hermes-shared/learnings/`. Hermes consumes drafts passively. PENDING (was HSL-1).
HSL-2-CROSS-HOST-SYNCCross-host learnings sync — Hermes-US3 lives in US3 container, Hermes-EU3 lives in EU3 container. Both mount their host's `/hermes-shared/`. Bidirectional rsync over SSH between the two hosts (EU3 :15 push, US3 :45 pull-and-merge with --update). Conflict policy: same-mtime → both kept with .conflict-host-ts suffix + CONFLICTS.md entry. PENDING.
HSL-3-PROMOTION-LOOPDaily 06:00 UTC promotion loop on EU3 — /usr/local/sbin/promote-hermes-learnings.sh reads new learnings/*.md, dedups against global memory, launches single-turn Claude (Abcus_Chris) to decide promote/revise/discard. Email summary via sarah_gmail_send.py to mail@chrisismail.com.au. Memos appear in memory/ with `promoted_from: hermes-us3|hermes-eu3` attribution. PENDING.
HSL-4-OUTCOME-TAGGINGPer-task outcome JSONL append at task completion — `{agent, task, skill_used, duration_s, gates_passed, chris_feedback, outcome:success|partial|failed}` → /hermes-shared/outcomes-YYYY-MM-DD.jsonl. Curator nightly joins by skill, computes per-skill success rate. Skills under 60% get under-review tag in learnings/skill-watch.md. PENDING.
HSL-5-SELF-PROMPT-REGENQuarterly (or after 100 learnings) — Hermes-USER.md + Hermes-SKILLS.md regenerated from accumulated learnings + global memory. Old version archived. Chris must approve before swap-in. DEFERRED 30+ days — only meaningful once HSL-1..4 are stable AND have accumulated signal. SCAFFOLD.
HSL-6-ARCHIVE-STALE-DRAFTSClean the 7 already-promoted memos out of hermes-shared/learnings/ (move to learnings/archive/) since they're already in global memory. Also reset questions-for-chris.md with a fresh timestamp. PENDING — fast cleanup task before HSL-1 ships so the Curator starts from a clean state.
HSL-7-AUTONOMY-LOCKED Full autonomy permissions set on Abcus_Chris (skipDangerousModePermissionPrompt+skipAutoPermissionPrompt+bypassPermissions) so Curator + promotion + HSL crons run unattended. SHIPPED 2026-05-25 evening.