An Agentic Software-Development System — Design Reference

An Agentic Software-Development System — Design Reference

What this is. A working description of an agentic development system that a single developer runs across several small products, evolved over months of daily use and explicit architecture work. It covers the knowledge base, the agent roster, the orchestration, the human-in-the-loop discipline, the cost and billing model, the multi-provider model gateway, telemetry and dashboards, the self-improvement loop, and the operational machinery that ties them together, plus the reasoning behind the choices, what had to be refined in practice, and the gotchas that cost real time.

Who it's for. Anyone wanting an accurate picture of how the system actually works today: the operator who runs it, and others reading it as a published internal reference. It is a current-state reference first. A short adaptation section (§A) remains for a reader who wants to scale the pattern to a team, but the lead intent is transparency, not a migration plan.

How to read. Progressive depth. §0 is the whole thing in three paragraphs. §1 is the system in about two pages. §2 is per-subsystem detail. §3 is reasoning, evolution, gotchas, and reference. §A is the adaptation guidance. §B is the implementable layer, a spec-grade entry per agent. Stop at whatever depth answers your question.

§0 — At a glance

One developer runs several small products with a fleet of specialized agents that do the actual building. The developer's job collapses to two verbs: review and decide. Planning, coding, reviewing, bug-fixing, documenting, and capturing decisions are agent work, coordinated by two orchestrators (one the developer drives by day, one that runs unattended overnight) and grounded in a shared knowledge base the agents query as they work.

Two rules are absolute. No agent ever merges code, so autonomous work always ends as a ready-to-merge PR that the human merges. And the human is pulled in only for high-level architectural calls, with the analysis already done, so the surface is "approve option B or redirect," never "help me think this through." Between those two gates the system runs itself. They are kept deliberately narrow: over-pause and the human becomes a rubber stamp; under-gate and an agent makes an irreversible mistake nobody caught.

The hardest-won lesson is that the job is not to produce more code, it is to keep the developer's mental model intact while agents change the codebase out from under them. One agent (the explainer) does nothing else, and its quality is the system's single highest bar. Everything else in the design exists so those two gates scale without eating the developer's day: telemetry from the first run, self-improving agents that can only propose, a multi-provider gateway that runs high-volume agents on cheaper models behind the same review net, harness-enforced review-integrity gates that the merge decision can trust, and a per-team concurrency ceiling that keeps each orchestrator coherent.

§1 — The system in two pages

1.1 The four layers

Knowledge base. Engineer-authored docs live as markdown in one git repo. A watcher syncs them into a SQLite index (full-text plus structured frontmatter); agents query the index through a read-only MCP server (kb_search / kb_get / kb_list_topics). The index is regenerable, so files are the source of truth and the database is a cache. Per-product activation profiles materialize the right subset of agents and skills into a given work directory. A tier-gated read-only web site renders the human-facing subset of the same corpus.
Agent roster. A shared roster of role-scoped agents, each a markdown file: system prompt plus frontmatter (role, model tier, tool grant, escalation contract). Each prompt says what the agent does and what it explicitly does not, since that negative space is what keeps roles from drifting. Agents never talk to each other directly; an orchestrator intermediates. Repo-local crews (one per self-contained app) extend the pattern without entering the shared index.
Orchestration and infrastructure. One interactive orchestrator (the developer's daytime entry point that plans, asks, and dispatches) and one scheduled orchestrator (runs overnight, fully autonomous, no questions). Both dispatch workers through a wrapped CLI (agentctl) that enforces a per-team four-worker ceiling under a higher raw-resource cap, a budget gate, deterministic worktree isolation, real per-run cost accounting, and a multi-provider model gateway. Scheduled work runs as local launchd jobs.
Self-improvement loop. Telemetry and audits feed the observability-scribe; a curator proposes prompt and knowledge changes citing evidence; a weekly outside-view consultant brings industry signal; the human merges. A scribe captures decisions and runbooks from real activity. Everything self-modifying is proposals-only: it never changes the running system directly.

1.2 The principles doing real work

Files as source of truth for engineer docs; the index is a regenerable cache. Simple, diffable, no admin UI to build. Good to roughly ten contributors; past that, reconsider.
Orchestrator-worker, per-team four-worker ceiling. Above about four workers a single orchestrator cannot synthesize results coherently and error amplifies. The ceiling is per orchestrator (per "team"), under a higher global cap that guards raw machine resources only, so N independent orchestrators run N×4 workers without contending. No agent-to-agent mesh.
Telemetry from run #1, no exceptions. Every run logs agent, task, duration, tokens, real cost, tool calls, docs consulted, and outcome. Most architecture decisions are obvious from this data and impossible to get right without it.
Cognitive-debt prevention is load-bearing. The explainer maintains the human's mental model; cutting it for speed defeats the system's purpose.
Two human gates, narrow and absolute, before merge and on novel architecture (with the legwork done first). Permissive everywhere else.
Proposals-only for self-modifying agents. Anything that could change the platform's own prompts or knowledge produces a draft or PR; the human merges. This breaks the feedback-amplification loop where a self-applying curator mutates the prompts that generate next week's mutations.
Specialize via skills and profiles, not by splitting agents. One coder with different skills loaded beats a frontend/backend split, because most real tasks are cross-stack and splitting fragments the dispatch space.
Model and runtime as routing metadata. Prompts stay model-agnostic in text; frontmatter carries the model tier, and agentctl resolves it deterministically. This keeps a cheap escape route to other models and providers without rewriting prompts, and it is the seam the multi-provider gateway plugs into.
Harness-enforced beats prompt-hoped. Where a rule must hold against drift (the model an agent runs on, the review-integrity gate, the protect-main guard), enforce it in deterministic code, not a prompt the dispatching model is asked to remember.
Gate on real spend, not on an estimate. Most usage is flat-rate subscription; the headline API-rate number badly overstates out-of-pocket. Cost decisions gate on actual overage or metered out-of-pocket, never on the scary estimate alone.

1.3 The shape of a day

Daytime. The developer opens an interactive orchestrator session. Vague intent routes to a planner (a thinking partner) that produces an exploration; clear scope goes straight to an architect (if structural) or a coder. Workers run; reviewers run; the orchestrator surfaces a PR link with the review outcome, the real CI conclusion, and an explainer summary. The human merges.
Overnight. A scheduled launcher fires the autonomous orchestrator against one product's queues (bugs, features, refactors, an architect pass on its scheduled night). It produces ready-to-merge PRs, never asks questions (defers anything high-risk to a per-product proposals channel), and posts one consolidated morning digest with per-item action affordances. A self-contained app product runs its own scoped drain.
Continuously. A watcher keeps the index fresh; cost, activity, scorecard, code-health, and velocity producers refresh on schedule; threshold alerts push on change; a weekly security sweep, a weekly curation pass, and a weekly outside-view consultant brief run on their own jobs. All scheduled work is launchd-native.

1.4 The meta-layer (who tends the platform)

The platform does not maintain itself. A distinct steward context (a separate cockpit repo, not a product) owns the platform's health and evolution: it holds the decision log, the intake backlog, and session-to-session handoffs, and it is where platform changes get made and recorded. The steward reads the self-improvement loop's proposals and decides, implements substrate changes in their real homes, and runs the same review/revise cycle on its own PRs that every other producer runs. At solo scale the steward is the same human wearing a different hat. The system has deliberately kept one control plane: each suite gets its own scoped crew with a "say go" entry point, but there is no second orchestrator, because the worker layer already supplies the parallelism and a second control plane would only fragment the shared substrate and the tacit "why" behind cross-cutting changes.

§2 — Subsystems in detail

2.1 Knowledge base

Layout. One git repo, organized by project. Each product carries its own architecture / decisions / specs / runbooks / reference / prd / brand / profiles subtrees; a shared/ tree holds the cross-product agent roster, shared skills, and shared ADRs; an engineering/ project holds platform-engineering docs; a utility/ area holds machine setup. The canonical frontmatter schema is itself an indexed doc.

The corpus is a few hundred docs. The system started two-source (engineer docs in git plus non-engineer docs in a SaaS wiki, unified through the index) and later collapsed to git-only once the wiki's value no longer justified the sync machinery. Two-source was correct early and correctly retired.

Frontmatter schema. Every doc carries YAML: required id (mirrors the file path), kind, project, title, status (draft | current | deprecated | archived); optional topics, audience (a list of agent or reader names that profiles and the web tiers filter on), applies_to (globs into the product repo), depends_on, last_reviewed, authors, and model/runtime routing metadata. Allowed kinds are a closed set; adding a kind is itself an architectural decision because it changes retrieval semantics. The set has grown deliberately when a real gap forced it (for example, blessing the observation and proposal kinds the self-improvement crew emits, which were silently un-indexed until added to the validator, the schema, and the live indexer together).

Index and MCP. A watcher process syncs files into a SQLite database (a docs table mirroring frontmatter, an FTS full-text table kept in sync by triggers, and a reserved table for future vector search). A small read-only MCP server exposes three tools to agent sessions: kb_search(query?, kind?, project?, audience?, applies_to?, status?), kb_get(id), and kb_list_topics(project?). If the index is lost, re-scan the files. A separate CI validator mirrors the watcher's canonical frontmatter check so a malformed doc fails a PR rather than being silently logged-and-skipped at index time.

Activation profiles. A profile is a YAML of include rules plus local-instruction extras. A knowledge-ctx use <profile> command detects the product from the working directory, queries the index, and materializes the matching agents and skills into the repo's local agent directory, plus a synthesized local instructions file. Architecture, decision, runbook, and reference docs are not materialized; they stay queryable through the MCP server, keeping the materialized surface tight. Each product has at least a daytime profile (the full conversational roster) and an overnight profile (the autonomous subset), deliberately non-overlapping: overnight has no planner (there is no human to think with), daytime has no scheduled orchestrator. The materialization step pre-flights against live sessions and has a read-only verify path so a guard meant to protect a live session cannot deadlock the very session that triggered it.

The human-orientation web view. A tier-gated read-only site renders the human-facing subset of the corpus for non-technical readers, contributors, and contractors, served behind SSO with verified access tokens. Tier enforcement is server-side with physically separate content bundles per tier: the contractor bundle omits company, strategy, and platform-internal docs entirely rather than hiding them client-side, and contractor-hidden routes return a plain 404 that does not reveal the route exists. Doc bodies render markdown to HTML and are sanitized before insertion, so an authored body cannot smuggle script into a full-tier viewer. The site also hosts the operator-facing dashboards (§2.11) on a full-tier-only gate. The human-orientation area and the agentic knowledge base are named distinctly (orientation versus the KB) to stop one word meaning two things.

2.2 The agent roster

The shared roster holds more than two dozen role-scoped agents and grows as cleanly-scoped needs appear; roster size is not a constraint, since idle agents cost nothing. Each prompt is written for crispness over completeness, with explicit STOP-conditions and a named escalation target. Model tier is assigned by the work: a frontier tier for hard reasoning and cross-cutting judgment, a standard tier for scoped execution, and, for several high-volume mechanical roles, economy models served through the gateway. The model an agent runs on lives in its frontmatter and is resolved deterministically by agentctl, so a tier change is a one-line edit, not a prompt rewrite.

Cluster	Agents	What the cluster does
Daytime planning	planning-orchestrator, planner	The orchestrator routes, gates, and dispatches; the planner is the thinking partner for vague intent (produces an exploration, does not code).
Feature build	architect, coder, refactorer, test-writer, reviewer	The architect proposes structure and never executes; the coder builds and owns small refactors; the refactorer does large mechanical edits per an approved proposal; the test-writer is path-scoped to test dirs; the reviewer reads diffs adversarially.
Bug fix	bug-investigator, bug-fixer	The investigator diagnoses (read-only, hard reasoning); the fixer implements against the diagnosis (mechanical, two-stage). Split because diagnosis and execution want different model tiers.
Security	security-reviewer, security-auditor	A per-PR security lens that runs parallel to the reviewer, plus a weekly posture sweep that alerts on high-severity findings.
Explanation	explainer	Cognitive-debt prevention (see §2.6); the system's highest quality bar.
Overnight	overnight-orchestrator	Autonomous counterpart to the daytime orchestrator; drains queues, runs the review cycle, never asks questions.
Platform evolution	curator, knowledge-scribe, observability-scribe, consultant	Observe telemetry, capture knowledge, propose platform changes, bring the outside view; all proposals-only.
KB-sync	knowledge-scribe (same agent as in Platform evolution; dual role), kb-sync-verifier	Propagate merged-PR facts and decisions into the knowledge base, gated by independent verification (see §2.5).
Substrate crew	platform-planner, platform-architect, platform-coder	The product-crew pattern applied to the platform's own code, under the same propose/build/review/merge-gate discipline.
Design	design-critic	UI/UX critique against a design rubric, accessibility as a first-class finding, native image input.
Product and commercial	product-analyst, product-manager, positioning-strategist, commercial-writer	Metrics and usage analysis; per-product commercial drafting and knowledge capture; company-level positioning; and the presentation craft that turns positioning into decks, one-pagers, and copy. None send outbound or commit the company.

What was deliberately not split: the coder by stack (cross-stack is the common case); the reviewer by domain (general plus security is the right granularity); the orchestrators by pipeline (two, by trigger, interactive versus scheduled, not one per workflow).

What was added beyond the obvious: the planner (vague intent had no home and fell to the router or forced itself into the architect's shape too early); the platform-evolution agents (without them prompts ossify and telemetry goes unread); a dedicated security lens; a substrate crew so the platform can be evolved by agents under the same gates as product work; an outside-view consultant so practice does not compound only on its own telemetry; and a small commercial cluster so the company's outward face stays honest to the products.

Repo-local crews. Self-contained app products carry their own crews (prefixed per app) inside the app repo rather than in the shared index. The dispatch CLI resolves a registered local-agent repo's crew from its own agent directory, skipping the shared index, so a lab product gets full dispatch, telemetry, and a native review cycle while its content knowledge stays local. A shared-infra crew owns the cross-app runtime kit. The discipline against adding agents reflexively still holds: each new shared agent fragments the orchestrator's dispatch space and needs a recorded decision, and a new skill usually solves the same need without that cost.

2.3 Orchestration

Daytime. The interactive orchestrator is a router, not a thinking partner, and that separation keeps its dispatch judgment sharp. Its core decision is "vague intent → planner" versus "clear scope → architect or coder." Its entire interaction surface with the human collapses to four shapes: a decision request (with the analysis done and a recommendation), a collaborative review of a finished artifact, a merge ratification (review outcome plus the real CI conclusion plus the PR URL on its own last line, framed as an outcome and not a nudge to merge), and a status note. Anything that does not fit one of the four is platform-internal noise the orchestrator absorbs rather than surfaces; ADR numbering, dispatch logistics, and bookkeeping never reach the human.

Overnight. A scheduled launcher fires the autonomous orchestrator at a fixed hour, always against one product, with the mode selected by day of week. It runs a fixed phase order: resolve prior pending-approval items by reading chat reactions, sweep and drain the bug queue, drain a feature queue (opt-in by a human reaction), drain the product queue, run the KB-sync phase, and on its scheduled nights add an architect pass whose agent-tractable proposals it executes the same run. Every code-touching change goes through the full review cycle to ready-to-merge. It never asks questions; anything in the high-risk set defers with a written approval request to the product's proposals channel and a queue marker the next night's phase 0 reads. The cadence settled, after several revisions, on one product per night with an architect pass coupled to that product's own execution capacity, replacing an earlier three-architect Sunday sweep that contended for the worker cap and left findings unexecuted. A self-contained app product runs its own scoped drain on a separate schedule.

Resumability. The orchestrator is a single long-horizon session, so a mid-wave death used to abandon the night's remaining work. A per-product phase manifest fixes this: the orchestrator records each completed phase, and a thin resume watchdog re-spawns a dead-but-incomplete wave from the first unfinished phase, bounded by a resume cap and gated by the budget. The manifest is fail-safe by construction: a missing or corrupt manifest reads as "no resumable wave," so it can only ever enable resume, never block a normal run. Intra-phase idempotency stays with the existing guards (freshness markers, reaction state, queue-on-completion, skip-already-merged-PRs).

Concurrency in practice. The four-worker ceiling is a coordination limit, one orchestrator cannot usefully coordinate more than about four workers, so it is enforced per team (workers grouped by their orchestrator) under a higher global ceiling that guards raw resources only. A reviewer plus a security-reviewer in parallel consume two slots; the worker producing the diff is a third; a test-writer is the fourth. Beyond that, serialize. Two independent orchestrators each get their own pool of four. The number is a coherence limit, separate from the cost limit (the budget); they bind independently.

2.4 Daytime workflow and the developer's role

The developer works through a thin launcher (co) that starts an interactive session already wearing the orchestrator's system prompt; shortcut forms jump straight to a named specialist, and a model flag can route a whole session onto a gateway model. A bare REPL stays available for casual questions, deliberately, so the system does not force every "what does this regex do" through an orchestrator. At launch the wrapper surfaces the subscription auth-token life (warning when low, since a long-running detached burn can outlive an unrefreshed token) and runs an auth and knowledge-pull preflight.

Per-product visual and behavioral guardrails keep multi-product work safe. Each product's session is color-themed so the developer can see which product they are in, and the agent prompts carry a cross-product guardrail that pauses and confirms if a request looks aimed at the wrong product. The colors ride to the terminal over both the standard and the roaming-resilient transports (§2.13). This only matters once there are multiple products in one terminal, and it is cheap to add and prevents a whole class of accident.

2.5 The knowledge-sync pipeline

Knowledge can fall out of date silently, and a doc is the highest-leverage surface there is, since every agent trusts it. The pipeline keeps docs current without asking the human to QA translations of work they already approved.

Factual-sync. When a merged, already-reviewed PR changes facts an existing doc records (a new catalog row, a status flip, a shipped-feature note), the knowledge-scribe drafts an additive patch. An independent verifier re-derives every claim from the actual diff. On a clean pass the patch applies autonomously through a deterministic "cage" (a command that enforces structural guards: target exists, anchors match uniquely, PR is merged, then commits). On failure it stages for the human. The independence is the point: the scribe self-verifying is self-assessment, while a second agent re-deriving from ground truth is the real gate.
Decision-sync and in-session storage. The same machinery, extended. A new decision record whose decision is embodied in a merged PR is stored autonomously after the verifier confirms it faithfully reflects the PR and invents no rationale the PR does not support. A decision the human agreed to live in a session is stored without a second "should I save this?" prompt. The governing principle is that the human gate is on the decision, and it is satisfied once; re-asking permission to record an already-approved decision is asking for the same approval twice. Only genuinely un-approved generative proposals still stage.

The scribe never commits directly; it always goes through the cage, which holds the apply authority and the audit trail.

2.6 Cognitive-debt prevention (the explainer)

One agent exists solely to keep the developer's mental model current as agents change the codebase. Three modes: per-run (after any human-readable artifact, it writes a canonical summary into the run directory), digest (replaces per-run for batched flows like the overnight run, producing a synthesis rather than a list of eight PRs by title), and conversational (the developer replies in-thread and the explainer re-engages with that run's artifacts).

Its writing rules are the spec: lead with the surprising thing (what the developer would want in thirty seconds); compare result to original intent (divergence is the most valuable thing to surface); quote the actual change rather than gesturing at it; be honest about confidence; and end with one or two specific things to verify. A separate verification gate, added after a hallucination incident, requires every code-specific claim (file paths, symbol names, mechanisms) to be grounded against the real diff before posting. This was the single largest hallucination vector, and grounding it was the fix.

The explainer no longer posts a per-run firehose to chat. A per-merge changelog post carries every merged PR with its link and lightweight metrics; a decision-record channel carries committed decision records via a git hook; the per-run summary the developer reads is the orchestrator's in-conversation note plus that changelog post. Only digest mode and conversational mode still post to chat.

2.7 The two human gates

Merge gate. No agent merges, ever. Every autonomous PR already has a general-reviewer and a security-reviewer signoff plus green CI; the merge gate is the additional human read, the irreducible safeguard against an irreversible autonomous mistake (a subtly-broken fix, an unflagged dependency change). It is binary; an "auto-merge trusted categories" carve-out was considered and rejected, because carve-outs erode the principle. The gate is reinforced by deterministic machinery: a protect-main hook denies a direct push to main from any dispatched-agent session, and the review-integrity stamp gate (§2.10) makes the "is this PR actually approved at its current head over green CI" check a code decision rather than a prompt convention.

Architectural-consultation gate. The human is consulted on new patterns, conventions, and dependencies, with the legwork done first. If consultation looks like "answer a half-formed question every twenty minutes," the human becomes the bottleneck and the leverage evaporates; if it looks like "read a finished proposal once a day, decide go or no-go," the system compounds. Autonomous modes can execute deeply on a novel pattern and present a concrete proposal plus implementation for review, which beats reviewing an abstraction, with a soft cap that surfaces-first if the right implementation would balloon past the sweep budget.

Everything else runs without pausing: editing files, running tests, posting review comments, creating branches, and proposing (not executing) gated commands. Over-pausing is its own failure mode.

A shared high-risk halt set binds every execution agent. It covers database schema changes; code touching auth, session, billing, payments, or PII; a new or upgraded external dependency; deleting or renaming a public API; a reviewer "novel pattern" flag; files under deploy, infra, CI, or migration paths; and any force-push or history rewrite. An execution agent that hits one does not commit; it writes a pending-high-risk note (trigger plus exact files and lines plus proposed approach) and surfaces to its dispatcher for human approval. Prior committed chunks stay; later steps pause.

2.8 Cost, model tiering, and the budget model

Telemetry underwrites this. Every run records real token usage and cost by model. For gateway-served models the real per-request dollar figure comes from the gateway's spend log and lands in the same per-run cost file the budget reads, so the caps inherit real cost with no change; a run that loses its spend record falls back to the in-flight estimate and is flagged, never silently written as zero.

Two cost numbers, kept distinct. Most usage runs on a flat-rate subscription, so the API-rate estimate (every model's usage valued at list price) badly overstates real spend; the billed number is the actual out-of-pocket, which is the metered gateway and direct-API lane only, with subscription usage at zero against the per-model data. The dashboards show both side by side, and in practice billed is a low single-digit percentage of the API-rate headline. The governing rule: do not trigger a cost squeeze off the headline estimate; gate it on actual subscription overage or real out-of-pocket. A visibility warning on the estimate is fine and is what the cost report already does.

Model tiering follows one rule: spend the expensive frontier model where a human is in the loop or the role is high-leverage and low-frequency, and keep the high-volume autonomous path on cheaper tiers. Cost is roughly capability-tier times token-volume, and the volume hog is the unattended overnight run, not human-paced interactive work. In practice the interactive sessions run on the frontier tier (the human is there, throughput is modest); several mechanical high-volume agents run on economy gateway models behind the every-PR review and green-CI net; one high-leverage low-frequency reasoning role (the architect) stays on the frontier tier even autonomously. The same logic governs dispatched subagents: the dispatching model is human-in-loop, but its delegated subagents are not, so they default to a cheaper tier. Because that default drifted as a prompt instruction, the model is now decided in the harness: roster dispatch through agentctl resolves the agent's frontmatter model deterministically, with a logged per-dispatch override and an abuse monitor that alerts when interactive overrides exceed a daily threshold.

The budget gate sees in-flight plus completed spend with soft and hard caps. The soft cap is a graceful wind-down the orchestrator owns (stop opening new pipelines, finish what is in flight); the hard cap is a deterministic floor that refuses new spawns. Caps gate on scheduled-origin spend only: every spawn is stamped with its origin (scheduled, interactive, or a deliberate-burn origin), the scheduled launchers export the scheduled tag and it propagates to descendants, and interactive spend is reported but never capped because it is the developer's deliberate, in-the-loop work. The budget day is keyed to local time so an evening's interactive work does not land in the next day's overnight pool. Overnight caps sit well above the heaviest normal wave; interactive is uncapped with a subscription-estimate tripwire that warns on each new spend step; a per-day override file (scoped to the date directory so it auto-expires) raises caps for an authorized burn. A separate per-run circuit breaker caps any single run on dollars or, for an unpriced gateway model whose spend would otherwise log as zero, on a total-token backstop, so a runaway loop is bounded even when pricing is not yet pinned.

2.9 The multi-provider model gateway

Non-frontier-vendor agents run through the same universal runner, routed to a local proxy via an environment base-URL plus a per-consumer key; frontier-vendor agents get no such variable and go direct, byte-identical to the single-provider path, which is what keeps the blast radius near zero. The proxy is the source of truth for real per-provider dollars: the dispatch CLI tags each gateway request with the run-id, the proxy logs the real per-request cost to a per-run spend log, and run finalization sources the dollar figure from that log. The proxy runs as a local-only launchd job from a dedicated virtual environment so its heavy dependency tree never touches the platform's own libraries, and provider keys live in the OS keychain, never in the tracked config.

The gateway has carried a rotating set of third-party economy models, added and retired on cost-versus-quality evidence from the scorecard. Several reliability lessons are baked in: a new provider prefix must be allowlisted so an unknown model fails closed to direct rather than silently to a proxy that lacks it; pinned per-model pricing (including the discounted cache-read tier) overrides the proxy's price map so cost is correct regardless of the echoed model name; the spend callback recomputes cost when the proxy reports zero for an unpriced model; and adding a model with a new provider must be verified through a real spawn (which exercises the billing-source validation) and not only a direct proxy call. Trial models run with no harness fallback, so a trial failure surfaces in the scorecard instead of silently respawning on the baseline and contaminating the comparison.

2.10 Review integrity (the stamped merge gate)

The merge decision must not depend on an orchestrator remembering "re-review every revision, approve only over the current head with green CI." Three deterministic layers make it code:

Commit pin. A review-class worktree is pinned to an exact commit so a reviewer cannot read a stale tree, and the pinned commit is set by the harness, not typed by the agent.
Stamped verdict. A reviewer posts its verdict through a command that stamps the reviewed commit and the run-id from the harness environment; the stamp is machine-parseable and the agent cannot claim a commit the harness did not pin.
Deterministic gate. A pure gate passes only if, for every required reviewer, the latest stamp is an approval whose reviewed-commit equals the PR's current head, and CI is green. A reviewer whose approval sits at an older commit is stale and fails, which is the deterministic form of "re-review every revision."

Each stamp is further authenticated against the dispatch run it claims: the run-id must resolve to a real review-class run in the same repo whose pinned commit matches the stamp, which defeats a worker stamping its own run, a fabricated run-id, an agent claiming a reviewer it was not, and a real review replayed onto a newer head. The honest residual is forgery under bypassed permissions, so the complementary structural fix denies the PR-review surface (the comment, review, merge, and stamp-emitting verbs) to every non-review agent at spawn, raising forgery from "type a comment" to "fabricate an internally consistent run directory" and closing the casual paths. The human-supervised steward path, whose Agent-tool reviewers have no run directory, is admitted explicitly and only when a stamp carries no run-id at all, never a wrong one.

Reviewers self-post their findings to the PR thread, which is the durable review record; chat gets a one-line verdict-plus-link pointer, never the full review dump.

2.11 Telemetry, scorecards, and dashboards

Every dispatched run leaves a run directory (meta, cost, outcome). Two dispatch mechanisms coexist (the dispatch CLI and the harness's in-session subagent tool), unified onto one telemetry schema: an in-session subagent emits the same run-directory shape in real time via a stop-hook, and a session's own main-loop cost is recovered from its transcript, so a session's full cost and dispatch tree can be joined. A work-unit label rides alongside the full task text so sub-PR and non-PR work is attributable.

A set of producers refresh on schedule and a set of self-contained pages render them on the full-tier web site:

Spending and cost-performance pages, each with its own date picker, the first showing total spend across all sessions, the second the dev-crew cost-versus-quality scorecard; both show API-rate and billed side by side.
A scorecard harvester that joins run to PR to merge outcome, anchored on the merge as the north-star success signal, with a delayed look-back that turns "merged" into "merged and survived" (no revert or follow-up bug within a week) and a sampled LLM-judge for the dimensions objective data cannot see. A switch-aware model-gate alerts when a currently-configured cheap model underperforms its premium baseline, ignoring a retired model still inside the trailing window.
A code-health page (CI pass-rate, open-PR age, merges and reverts, coverage read from a CI-emitted artifact, test counts) and a velocity page (throughput, volume, stability, decisions, bucketed by product and an engineering-internal bucket).
A cost-per-task view: a session-activity log with each session's dispatch tree and total cost, plus cost-per-outcome (cost per opened PR across its chain), with a per-run exclude toggle so one runaway does not condemn an otherwise-good model.

Edge-triggered alerts read the same data and post to dedicated channels when a metric goes bad, once when a breach starts and once when it clears, rather than re-alerting daily. The producers write JSON snapshots that the pages embed and slice client-side, chosen over a standing BI service because the unified data model is the durable investment and the viewer is the cheap, swappable part. The full-tier dashboards render live per request from their data; only the static index bundles need a rebuild.

2.12 Worktree isolation

Dispatched workers run in dedicated git worktrees off the live primary tree, so concurrent workers never collide on the working copy and a worker's dirty tree never contaminates a review or a render. Workers commit on a branch and author a PR body to a per-worktree gitdir file (never committed); they never push or open PRs, and the orchestrator publishes to origin deterministically and opens the PR. A scheduled cleanup reaps merged and clean worktrees daily, conservatively skipping anything dirty, in-use by an active run, or ahead-of-base without a merged PR. Worktree pileup was a real operational hazard (orphans accreted to the hundreds during an incident and their stacked in-flight cost estimates briefly false-tripped the budget cap), which is why the cleanup is now scheduled rather than ad hoc.

2.13 Notifications and host topology

A chat workspace is the asynchronous surface. Conventions that proved durable: bug intake by emoji-state (one reaction claims a report, another marks a fix posted, which survives bot restarts, is idempotent, and is visible to all); feature intake opt-in by a human reaction, since those channels also carry half-ideas; per-product bug, proposals, and feature channels plus a per-product changelog type for the lab products; one consolidated overnight digest in a dedicated channel with per-item action affordances (what to do, the exact mechanic, and a clickable deep-link), rather than N per-run posts; approvals resolved by emoji and thread on the proposals message; and path and URL linkification at the transport layer so every caller gets reliable rendering and click-to-open. All agent posts go through one bot abstraction.

The system runs on two machines: a headless always-on server (runs the agents, the knowledge index, the scheduled jobs, the gateway, and the bot) and the developer's workstation (terminal and editor, connecting in). This split is invisible most of the time but is the source of a surprising share of operational bugs (see §3.2): anything touching the workstation's local environment (terminal styling, GUI app config, PATH) cannot be done from the server, and anything the server runs must not assume the workstation's richer environment. Roaming links wedge a long-lived TCP connection on a NAT rebind, freezing the terminal input bar; a UDP state-sync transport that resumes across rebinds fixed it, with per-product colors emitted locally before launch because that transport drops the proprietary profile escape.

Remote control. Sessions launch under the harness's remote-control mode, so the developer can drive a session from a phone. A rotation command writes a handoff, spawns a fresh handoff-picking-up session in a new terminal window (which registers as its own entry in the mobile session list), and leaves the old session for the developer to close, never self-killing it so a silent spawn failure cannot strand them.

2.14 Design rooms

A live, multi-agent design discussion in a chat channel: the developer plus a read-only panel (product, architect, planner, design-critic) talk through a feature or UI change. It is notable because the platform has no native "several agents in a room" primitive; an agent is a one-shot process. The room constructs a roundtable on top of a shared channel and floor-controlled turn-taking.

The mechanism. A moderator loop runs the floor: poll the channel, let a cheap turn-selector (a small-model call) pick who speaks next or yield to the human, invoke that one panelist with the recent transcript, post its reply under its name, and repeat, allowing a bounded run of consecutive agent turns so panelists react to each other before handing back. One voice at a time, deliberately not a simultaneous free-for-all, which loops and spirals cost. Panelists are read-only and run on the frontier tier (generative, human-in-the-loop, low-volume, exactly where the tiering rule says to spend). Per-session and per-day cost ceilings cap it, and the room self-closes on idle or a "wrap" command. Because a closed room stops watching the channel, a separate always-on listener revives it (fresh budget, catching up on anything said while it was closed) when the human posts again. A "write it up" command runs a full architect pass over the transcript to author a formal proposal document, then posts the link and closes the room; that proposal is the handoff artifact.

2.15 Lab products and the suite tiers

The line between "product" and "internal tool" has been deliberately dissolved. A lab product is a first-class agentic deployment (its own crew, dispatch through the CLI, run-directory telemetry, a native review cycle, and dashboard coverage) whose content knowledge stays in-repo and local. This is distinct from the substrate-KB-project pattern, where the knowledge lives in the shared index. The reasoning is that the actual work of an agentic software lab is learning to build product agentically, and every product brought into the lab compounds that learning, so there is no reason to hold a lab product outside the first-class apparatus. The one durable carve-out is data locality: personal and app-runtime content stays local, while metadata and cost numbers centralize.

Three tiers now sit under one control plane:

Substrate (the platform itself), evolved by the platform crew.
Shared infrastructure, a versioned app-runtime kit (the LLM seam, the spend ledger, a runtime cost guard, a trust-boundary redaction framework, a device-level location capability) consumed by the apps via a pinned package, owned by a single shared-infra crew. Its cost guard is a deliberate peer of the substrate budget gate, on a different plane with zero shared imports or stores, enforced by non-dependency tests, so the two can never fuse into a deadlock.
Lab products, the per-domain apps, each with a private KB and a redacting projection API; cross-app data crosses only over that authenticated seam, never a shared store.

2.16 Subscription and scheduled-work operations

The frontier-vendor account is a shared OS-keychain credential that every session reads, and the system supports two interchangeable subscription accounts switched machine-wide by a small CLI that swaps that one keychain item; the keychain token is authoritative for billing, while the vendor's status display lags a switch until a full re-login. A headroom-aware automatic switch is not possible because subscription rate-limit headroom is not queryable, so the switch is manual. The access token is short-lived and auto-refreshes in interactive use, so a detached overnight burn surfaces token life at launch and fails fast and loud on an auth death rather than dead-polling. When the vendor's usage window resets early, an all-products "full-monty" driver can route the whole platform onto the subscription for that window and drain everything to ready-to-merge PRs, resuming across usage-limit pauses via the wave manifest, under a raised capped pool.

All scheduled work is launchd-native (the last cloud-scheduled routine was migrated off): the nightly overnight orchestrator and its per-app counterpart, the morning health-check, the weekly curation and consulting jobs, the cost, activity, scorecard, velocity, and code-health producers, the metric-alert evaluator, the gateway, the design-room listeners, the worktree cleanup, the AI-briefing egress, and a biweekly reminder to refresh this very document. The platform's binaries are symlinks into the platform repo's working tree, so a feature-branch checkout makes un-merged code live to those jobs; the repo is kept on its main branch and reviewed via the PR diff, never by checking out the branch.

§3 — Reasoning, evolution, and gotchas

3.1 What worked, and what needed refining

The decision log (a numbered, append-only ADR series, well past two hundred entries) is itself one of the most load-bearing artifacts: future contributors, human or agent, read why before changing anything. The honest pattern across those entries:

Got it right early, held: files-as-source-of-truth; the four-worker ceiling (later generalized to per-team, same coherence rationale); telemetry-from-day-one; proposals-only for self-modifying agents; the two narrow human gates; specialize-via-skills; model-and-runtime as routing metadata.
Right idea, refined in practice:
Overnight cadence churned several times, from every-night-both-products to every-other-night to one-product-per-night with an architect pass coupled to each product's own execution capacity, plus a self-contained app's own drain. Schedule shape is empirical; expect to re-tune it.
Knowledge auto-apply went from scribe-self-verifies to independent-verifier-gated, then grew an in-session and decision-sync autonomy path once the "gate is on the decision, satisfied once" principle was articulated. The independence and the principle both came from use, not the whiteboard.
Model strategy went from "all reasoning agents on the frontier model" to a tiered split, then to a multi-provider gateway running mechanical agents on economy models behind the review net, with the cost telemetry and the scorecard turning each swap from a hope into a tracked line.
Review integrity went from a free-text PR comment plus a prompt convention to a harness-stamped, run-authenticated, deterministic merge gate, after a stale approval nearly let a merge through on convention alone.
The cost story went from one API-rate headline to an explicit API-rate-versus-billed split, after the headline kept prompting squeezes that the real out-of-pocket did not justify.
Cut for a reason: auto-merge carve-outs (erode the binary gate); a provider-neutral framework rewrite (loses native DX for portability that may never be exercised, where routing metadata plus the gateway is the lighter hedge); a curator that auto-applies low-risk edits (feedback amplification); forcing every dispatch through one mechanism (it cannot inherit conversation context or launch harness-native agents, so the fix was to unify telemetry, not dispatch); and a second control plane for the app suite (the parallelism already lives at the worker layer, and a second orchestrator would only fragment the shared substrate).

3.2 Gotchas that cost real time

The two-host split is a bug factory. Terminal and GUI config has to be installed on the workstation, not the server; a config deployed to the server is invisible. Escape sequences from inside a multiplexer on the server need explicit passthrough to reach the workstation's terminal. Re-attaching to a running session does not re-run session setup. A roaming connection can wedge silently and only a fresh connection recovers it. Never redeploy a "canonical" workstation config over a working one unverified, and do not auto-mutate workstation config on a schedule; surface drift and let the human apply.
The runner's environment is thinner than your shell. Scheduled jobs run with a minimal PATH and an old system shell. Two concrete failures: a script used a shell feature the server's ancient system shell did not support and crashed at launch (and the syntax checker did not catch it, because it was a runtime feature), and a job shelled out to a tool by bare name that had moved off the job's PATH. Resolve tools by absolute path; test under the job's real PATH and shell. A hard floor: launchd-critical scripts stay within the old shell's dialect, and a dedicated lint guard catches the constructs the syntax checker misses.
Silent failure is the worst failure. When the overnight launcher crashed before spawning anything, nothing was posted and it was caught by chance. The fix was two-layer: an exit-trap that alerts on any non-zero launcher exit, and a separate morning health-check that pings if last night produced no result. Anything autonomous needs a "did it even run?" backstop, and the same lesson recurred for pre-init worker deaths (a worker that dies after its session-start hook but before its first call leaves no outcome), now caught by an init-verification probe.
A live-process guard can deadlock against its own trigger. A live-session guard blocked the very session that needed a re-materialization; the fix was a read-only verify path that skips materialization. When a safety guard and a workflow share a resource, check whether the workflow can ever be the thing the guard trips on.
A default-fallback can silently downgrade. A model-resolution lookup fell through to a cheaper default for any value not in its alias table, so pinning an agent to a literal frontier model id silently ran it on the cheap tier in one subsystem. Fallbacks should pass through unknown-but-valid values, not coerce them to a default; the same shape recurs as the gateway's allowlist (an unknown model fails closed to direct, never silently to the proxy).
A static test gate cannot catch a live-path bug. Bootstrapping the gateway surfaced several runtime bugs no unit test could see (a callback registered as a class not an instance, non-ASCII header values, the wrong request-metadata container on the real client endpoint), because each needed a live request. Confirm the empirical test exercises the real client's endpoint and the real spawn path, not a convenient proxy.
A test command that errors is not a passing run. A non-zero exit with no pass/fail summary (a missing virtualenv, an import error) is a harness failure; reported counts must come from actual test output. Unit runners that strip types without checking them pass a cross-file type break that only CI's separate type-check job catches, so a worker now runs the same non-test static gates CI runs, and the orchestrator gates ready-to-merge on the real CI conclusion, not on the diff-read reviews alone.
Timezone in the cost reporting. Reporting a calendar day in the wrong timezone made the digest structurally miss the overnight batch it existed to summarize, and a UTC-keyed budget day landed an evening's interactive work in the next day's pool. Anchor reporting and budget windows to local time and to the cadence, not the calendar.
GUI-config loaders can be picky. A symlinked config file was silently skipped by the app that consumed it (it wanted a real file), and a mount point macOS owns cannot be handed to a generic automount. Where an external app reads generated config, copy real files and do not assume environment specifics.
Run-dir forensics alone can mislead. A wave of worker deaths looked like resource exhaustion from the run artifacts; the real cause (an uncaught broken-pipe from a truncating shell pipe on an inline spawn) only showed in the orchestrator's own transcript. When a worker dies pre-init with empty stderr, check how it was spawned, not just the worker's own artifacts.

3.3 Considered and rejected (the rejections are load-bearing)

Considered	Rejected because
DB-as-source-of-truth for engineer docs	A sync-from-files watcher is cheaper than an admin UI; files are diffable in git; defer until roughly ten contributors.
Bidirectional file/wiki sync	Merge-conflict and edit-loop pathologies; one-way-per-kind is provably correct.
Auto-merge for "trusted" PR categories	Erodes the binary merge gate.
Frontend-coder plus backend-coder split	Cross-stack is the common case; skills handle the context need without splitting identity.
Provider-neutral framework rewrite	Loses native DX for portability that may never be exercised; routing metadata plus the gateway is the lighter hedge.
Curator that auto-applies "low-risk" edits	Feedback amplification; all self-modifying agents are proposals-only.
Agent-to-agent mesh	Worse at coordination than orchestrator-intermediated; gets worse at scale, not better.
Always-on orchestrator (every session is orchestrator-flavored)	Loses the casual REPL; a launcher preserves both.
Forcing every dispatch through one mechanism	Cannot inherit conversation context or launch harness-native agents; unify the telemetry layer instead.
A second substrate/orchestrator for the app suite	The parallelism is at the worker layer; a second control plane fragments the shared substrate and the tacit why.
A single global concurrency cap	Two independent orchestrators starve each other; the cap is a per-team coordination limit under a higher global resource ceiling.
Prompt-level "use the cheaper model" / "re-review every revision"	Prompt rules drift; decide the model and the merge gate in the harness.

§A — Adapting the pattern to a team

The system is single-developer by design. A reader scaling it to a small engineering org keeps most of it and changes five things; below is where each one attaches and the single-developer assumption it breaks. Designing the solutions is the reader's job; these are the seams.

Parallel team instances (one per developer). The agent prompts, the roster, and the orchestration patterns are per-instance and replicate cleanly. The per-team four-worker ceiling already exists (it is a coordination limit, and the global resource ceiling already sits above it), so the team-wide total is bounded by cost, not coherence. The seams: run state and budget accounting assume a small number of writers and need per-developer namespaces or a coordination service; the meta-layer assumes one platform owner; and "the human merges" becomes "which human," which means codeowners by area.
A shared knowledge base. Files-as-source-of-truth holds to roughly ten contributors and the index, MCP, and profile machinery share trivially. The seam is concurrent authorship: the editorial discipline (one ADR per decision, status maintained, provenance) needs an owner per area, and the in-session-autonomy path of the KB-sync pipeline needs an authority model ("whose session, and does that authority cover a shared doc"). The verifier-gated factual-sync path is unaffected, since it is grounded in merged PRs.
A stronger cost-control layer. The origin-scoped caps, the real-cost gateway accounting, the API-rate-versus-billed split, and the per-run circuit breaker all transfer. The seam is scope and attribution: caps become per-developer and per-team under a shared pool, and telemetry needs cost per developer, per team, and per product, probably with chargeback or quota surfacing. The schema supports the fields; the aggregation and policy are new. The model-tiering rule is the biggest lever and matters more at higher autonomous volume.
A stronger policy layer. This is the least-developed area in the solo system, since the solo developer is the policy. The two gates, the high-risk halt set, the tool grants, the protect-main hook, and the review-integrity stamp gate are the existing enforcement points; at team scale they become codeowners plus branch protection, a centralized halt-set rather than per-prompt copies that can drift, an audited capability model, and a provenance-and-authority record for "who authorized this autonomous action."
A thin layer orchestrating the teams. Not a fifth orchestrator that dispatches the four (that re-introduces the mesh cost the ceiling exists to avoid), but a coordination service for genuinely-shared state (run registry, budget pool, shared-KB write-arbitration) plus cross-team visibility (a combined digest, a team-wide cost view). Think shared services and dashboards, not an orchestrator of orchestrators; preserve the per-team orchestrator-worker shape and the per-team ceiling.

Preserve most aggressively the cognitive-debt mandate (the explainer) and the legwork-done-first discipline. A team amplifies the bottleneck risk: if architectural-consultation requests are half-formed, the tech lead becomes the chokepoint exactly the way the solo developer would. Consider making the planner and architect prompts stricter at team scale, with an explicit self-critique step ("write the questions the reviewer will ask and pre-answer them; if you cannot, the analysis is not ready") before anything surfaces to a human.

§B — Per-agent specification (the implementable layer)

This is the bottom of the progressive disclosure: enough per agent to re-implement it. Tool names (agentctl, co, kb_search, post_bug, slack-post) are this system's own CLI and MCP surface; reproduce or rename freely. Three conventions are referenced by many agents and defined once: the high-risk halt set (§2.7); incidental observations (any worker that notices something off outside its task routes it to the product's bug channel via post_bug rather than folding it into its diff); and PR mechanics (workers commit on a branch and author a PR body to a per-worktree gitdir file, never push or open PRs, run the full suite on first dispatch and targeted tests on a review-driven re-dispatch, and the orchestrator publishes and opens the PR).

Model tiers below: frontier (hard reasoning and cross-cutting judgment), standard (scoped execution), economy/gateway (high-volume mechanical roles routed to cheaper third-party models). The model is read from frontmatter and resolved by agentctl.

Orchestration and planning

planning-orchestrator (frontier) — the interactive daytime entry point: plans, asks, dispatches; writes no code itself. - Does: routes by intent (vague to planner; scoped feature to architect or coder; bug to bug-investigator; UI to design-critic; metrics to product-analyst); runs the full review cycle on every PR (reviewer plus security-reviewer in parallel, auto-cycle on any finding, cap at three cycles); gates ready-to-merge on the real CI conclusion; publishes and opens PRs through the deterministic CLI; verifies worker narratives against ground truth before relaying. - Does NOT: write code; ask permission on every dispatch; skip review by change type; surface platform-internal noise. - STOP/escalate: surfaces on the high-risk set, a reviewer novel-pattern flag, ambiguous load-bearing details, and the three-cycle cap. - Surface: its entire human-facing surface collapses to four shapes (decision request, collaborative review, merge ratification with the PR URL on its own last line, status), with everything else absorbed.

overnight-orchestrator (standard, tiered down from frontier) — the autonomous counterpart: drains queues on a schedule, leaves a morning digest. - Does: writes a phase manifest (begin/mark/finish) so a dead wave resumes; phase 0 resolves prior pending-approval items from chat reactions; then bug-sweep, bug-fix drain, feature-queue drain (gated on a human reaction), product queue, and a KB-sync phase, with an architect pass on its scheduled nights whose tractable proposals it executes; every code change goes through the full review cycle and the real-CI gate to ready-to-merge; freshness-skips a phase a daytime run marked fresh; winds down on the soft cap; triggers one digest-mode explainer post with per-item action affordances. - Does NOT: ask questions (skip-and-defer); merge; skip review; start new work past the soft cap. - STOP/escalate: defers anything in the high-risk set with a written approval request to the product's proposals channel plus a queue marker phase 0 reads; whole-batch aborts on registry corruption or near credit-pool exhaustion; an auth death stops fast and loud rather than dead-polling.

planner (frontier) — thinking partner for vague intent; maps the option space, does not build. - Does: restates the intent to test understanding; asks one or two load-bearing questions first; searches the KB hard and flags superseded prior decisions; maps two to four distinct options (cost, what each forecloses); recommends one with honest confidence; names the next step. - Does NOT: propose code structure or migration steps; dispatch; edit; produce a long document (a two-minute read); assume unstated constraints. - Tools: read-only. Output: a short exploration doc.

architect (frontier) — proposes structure; never executes. - Does: four modes. A targeted proposal (current state, proposed structure, migration steps, risks, blast radius, what is not changing, optional draft ADR). A holistic assessment (prioritized findings plus a top pursue-list plus an archive list; pursue items are not auto-executed). A tie-break that rules which of two existing patterns applies. And in a scheduled sweep, executing its own proposals to ready-to-merge PR stacks under a soft PR cap. Reads existing ADRs first; prefers refactorer-executable proposals; favors narrow scope. - Does NOT: execute code (write-scoped to its run-dir docs); auto-execute pursue items; invent a third pattern as a tie-break; propose cross-product or schema/auth swaps without a draft ADR first.

platform-planner / platform-architect (frontier) — the substrate counterparts of planner and architect, reasoning about the platform's own three layers rather than a product. The platform-architect's proposals add a test plan (which suites cover the touched paths, what new tests are needed) and account for the live-on-edit deploy model and the old-shell floor; both name the cross-product blast radius a substrate change carries and stay off product code.

curator (frontier) — the self-improvement proposer; never writes to canon directly. - Does: reads the week's observations, audit reports, and recent explainer trends; proposes changes to agent prompts, profiles, skills, ADRs, and runbooks (KB only); opens a PR with one commit per proposal, each citing real observation refs. - STOP/escalate: an agent-prompt change needs at least two independent observations and a new skill at least two; other proposals need one. A week with nothing worth proposing posts "no proposals" and opens no PR; no-proposal weeks signal stability.

consultant (frontier) — the outside-view loop; advises, never edits. - Does: weekly, studies how the industry's practice of building and running agents is evolving (credible, dated primary sources), assesses the substrate honestly against it, and writes a one-page brief (where we are, what is changing with citations, what to consider) to the KB; opens a PR; posts a short pointer to a monitored channel. Earns the page, cites or cuts, is specific to this setup, and treats a quiet week as a valid one-line brief. - Does NOT: edit prompts, profiles, or platform code.

Build and review

coder (standard) — implements features and small mechanical refactors per a spec. - Does: halts if the spec is ambiguous on a load-bearing detail; reads affected files end-to-end; minimal diffs; never runs a formatter unless the repo already configures one; runs the same non-test static gates CI runs for the code it touched; commits in small logical chunks; incorporates non-blocking review findings by default and escalates substantive disagreement to the architect. - Does NOT: push or open PRs; drive-by refactor; disable tests; change dependencies or CI without asking; touch outside the spec's blast radius without surfacing.

refactorer (economy/gateway) — executes large mechanical refactors per an approved architect proposal. - Does: reads the proposal end-to-end and flags any non-mechanical step; orders the migration into independently shippable chunks (tests green at each); commits each chunk named by step; up to three self-correction attempts on a failing step; never reports "tests passed" from a harness failure. - Does NOT: decide what to refactor; skip or exceed any proposal step; run a formatter; continue past a halted step.

platform-coder (standard) — the substrate counterpart of coder, and (since the substrate crew has no separate refactorer) it owns the mechanical refactors too. Its standing rule is test-as-touched: every change ships with tests for what it touched and the full gate (Python plus the old-shell tests plus the bash-4 lint guard) runs before the PR. Halts before touching launchd defs, the budget or concurrency rails, the installer, dependencies, or the KB validator and materialization guards, and never edits product code.

test-writer (standard) — writes and updates tests; path-scoped to test dirs only, production code read-only. - Does: reads production code to understand the surface without modifying it; matches existing test conventions; covers happy path, meaningful edges, and regressions (a regression test must fail against the buggy code); surfaces production bugs rather than working around them. - Does NOT: edit production code; invent new test patterns; mock to mock.

reviewer (standard) — adversarial diff review for correctness and quality; read-only. - Does: reviews every PR fully with no change-type exemption (the exemption was the exact loophole that let drift through); hunts logic errors, regressions, unsafe assumptions, missing tests, style and ADR violations, and scope creep; posts findings as a PR comment always (even when clean), each tagged severity and category, with a verdict; posts the verdict through the stamping command so the merge gate can authenticate it. - Does NOT: have write tools beyond the PR comment surface; duplicate the security lane; inflate vague findings.

security-reviewer (standard) — security-lens diff review; runs in parallel with reviewer. - Does: hunts authN/authZ failures, injection, secret exposure, vulnerable deps, unsafe deserialization, missing trust-boundary validation, enumeration and timing issues, PII and logging problems, crypto misuse, and session/CSRF/token issues; fast-bails with a one-paragraph note on diffs with no security surface, but never on auth, identity, money, PII, server-side validation, frontend-to-API, or dependency/build/deploy changes; every finding names a concrete attack path and a file and line. - STOP/escalate: immediate alert on a hardcoded secret so rotation can start.

Bug, security, design, product

bug-investigator (frontier) — root-cause diagnosis; read-only, does not fix. - Does: investigates via code reading, test runs, git history, and the KB; finds the root cause (not the symptom), assesses blast radius, sketches a fix path and a regression-test idea; claims the report with a reaction once the diagnosis is posted; ingests issue-tracker items mirrored to chat first. - Does NOT: write fix code (illustrative pseudocode only); diagnose cross-product without surfacing; claim a report it cannot reproduce; mark a report fixed.

bug-fixer (economy/gateway) — implements a fix per a diagnosis; two-stage. - Does: fix mode implements per the diagnosis, dispatches the test-writer for a regression test, makes minimal-scope commits, runs the CI static gates, and authors a PR body; close-loop mode (after both reviews sign off) verifies the resolution, adds the fixed reaction, posts the threaded reply, and closes the tracker issue. Sweeps references repo-wide before changing a shared symbol. - Does NOT: re-investigate a wrong diagnosis (surfaces back); dispatch reviewers; run a formatter; add the fixed reaction before both reviews sign off.

security-auditor (frontier) — weekly posture sweep of accumulated state; read-only. - Does: complements the per-PR reviewer by looking at accumulated state (secret scanning, dependency vulns, missing security headers, authZ sampling, IaC misconfig, stale creds, logging hygiene), varies focus week to week, and writes a report plus observations for the curator. - STOP/escalate: immediate alert (to the monitored audits channel) on a live credential, an actively-exploited critical dependency vuln, or evidence of past intrusion.

design-critic (frontier) — UI/UX critique against a design rubric; read-only; native image input. - Does: critiques design-system fidelity, visual hierarchy, consistency, accessibility as a first-class finding (contrast, focus, semantics, keyboard, screen-reader), responsive behavior, empty/loading/error states, and microcopy; reads the design-system docs before critiquing. - Does NOT: cover code structure, performance, or product strategy; critique without screenshots, which it renders from a verified-clean HEAD so a dirty working tree cannot feed it stale images.

product-analyst (standard) — metrics and usage analysis; read-only; writes reports. - Does: restates the question and checks definitional clarity; decomposes into plain-language queries; interprets with base rates and trends; writes a short report leading with the surprising thing. (The live analytics surface is deferred; it is currently read-only from the filesystem, a known-incomplete agent.) - Does NOT: editorialize product direction; query PII without need; report ambiguously-defined metrics without asking.

Knowledge, explanation, commercial

knowledge-scribe (standard) — captures decisions and runbooks from real activity; proposals plus verified-autonomous paths. - Does: drafts ADRs, runbooks, and refinements from what actually happened (quotes the source, never invents rationale); classifies output into four lanes — factual-sync (additive edit from a merged PR, verifier-gated, autonomous apply), decision-sync (a new ADR from a merged PR, verifier-gated, autonomous apply), in-session-agreed (a decision agreed live, autonomous store), and generative (everything else, stage for the human); resolves ADR registry mechanics invisibly. - Does NOT: commit to canon directly (only via the apply cage); self-approve a factual-sync; capture routine non-decisions.

kb-sync-verifier (standard) — independent verification of knowledge patches; the gate for autonomous apply. - Does: reads the actual diff (never the patch's prose); enumerates every checkable claim (IDs, migration numbers, thresholds with units, symbol and file names, enum strings, endpoint paths) and quotes a supporting line for each. Two modes: factual-sync (must be additive-only; any judgment element fails as "generative, route to human") and decision-record (the recorded decision and every element of its rationale must trace to the PR; confabulation, scope-expansion, or a wrong fact fails). - STOP/escalate: defaults to FAIL on doubt, because a false pass puts a wrong fact in canon silently while a false fail just costs a human a glance.

observability-scribe (standard) — emits structured observations from telemetry for the curator; a filter, not a logger. - Does: reads run telemetry and audit reports; emits append-only JSONL observations for patterns, not single events; compresses to a few dozen a week; alerts a channel only on high severity; triggers on schedule and on events (cost spike, repeated tool failures, timeouts, review blockages). - Does NOT: log routine successes or single anomalies; emit before a pattern recurs; duplicate a recent observation.

explainer (standard) — keeps the human's mental model current; the system's highest quality bar. - Modes: per-run (writes a canonical summary into the run directory), digest (one synthesis post per batched flow, with per-item action affordances and clickable deep-links, not a concatenation), conversational (in-thread follow-ups scoped to that run's artifacts). - Does: leads with the surprising thing; compares result to intent; quotes the actual change; is honest about confidence; ends with "worth checking." Its verification gate grounds every code-specific claim against the real diff before posting; hyperlinks PRs. - Does NOT: dispatch; draft ADRs or propose changes; suppress failures to be agreeable; recap what the human already saw; post a per-run firehose to chat.

product-manager (frontier) — the per-product commercial surface; drafts for review, never sends. - Does: drafts outreach, forms, and briefings for the human to send; captures commercial knowledge as durable KB records; surfaces recurring customer needs as feature-request docs that seed the engineering queue via human or orchestrator promotion; dispatches the product-analyst when a question needs quantitative backing. - Does NOT: send anything outbound or commit the company; write code or review diffs; dispatch engineering directly.

positioning-strategist (frontier) — company-level positioning, the altitude above the per-product product-manager. - Does: sets and maintains the company narrative and messaging architecture; briefs the commercial-writer rather than writing the deck; pulls metrics via the product-analyst before a claim leans on a number; captures positioning and pricing-posture decisions into the KB. Leads with the recommendation and the one reason. - Does NOT: publish or commit the company; write code; run the product roadmap; dispatch the engineering roster.

commercial-writer (frontier) — turns positioning into presentation craft. - Does: writes pitch decks (slide-level copy), one-pagers, landing copy, and partner/investor/customer briefings, built from a positioning brief and grounded in the product docs so claims match what the products do; a claim it cannot ground, it flags; promotes only reusable language to the KB and writes one-off deliverables to the run dir. - Does NOT: publish, send, or commit the company; render visual design.

This document describes the current state; it is a maintained living artifact and a scheduled reminder prompts its periodic refresh against the decision log.