If you feel schizophrenic between LinkedIn hype and your own burnt cash/time, you’re not crazy.

Most “agentic swarm” demos optimize for vibes, not delivery. The gap is almost always the same:

LLMs don’t ship software. They ship text.

You ship software by constraining the problem until “text” reliably becomes “change sets you can trust.” Here’s a workflow that actually holds up once the project stops being toy-sized.

Architecture Is the Prompt

If you don’t provide an explicit architecture, the model will invent one on every iteration. That’s why after 3-5 “small tweaks” the codebase randomly breaks: the LLM has no stable mental model of your system, and you didn’t give it one it can re-load.

So: treat architecture as executable context — near the code, versioned, reviewed, and constantly referenced.

Minimum set of “truth” artifacts: a C4 model (Context / Container / Component / optional Deployment), ADRs for every non-trivial choice, infra and runtime constraints (cloud, network boundaries, identity, secrets, tenant model), contracts (API schemas, events, DB migrations strategy), and test cases written as user paths — manual-style steps are fine.

If you have these, the model stops “guessing” and starts “implementing.” In the previous post I argued that the bottleneck was never intelligence — it was instrumentation. This is what instrumentation looks like in practice.

C4 diagrams rot when they’re hand-drawn. Structurizr DSL fixes this because it’s documentation-as-code — diagram changes are reviewed like code, you can enforce consistency, and it’s easier to keep architecture close to reality than in a wiki. I wrote about this in detail in Documenting System Design with Structurizr & ADRs. The short version: if your architecture diagrams aren’t versioned and validated, they’re fiction.

Keep your C4 model in /docs/architecture/structurizr/ and make it part of CI — validate the DSL, export diagrams, fail builds if the model is invalid. Pair this with ADRs in /docs/architecture/adr/ (simple markdown). Every time the LLM proposes a major decision, either point it to an existing ADR, or make it draft a new one first. This single rule removes a shocking amount of churn.

Constrain the Model, Not the Prompt

If you’re using Cursor, Windsurf, or similar tools, you already noticed a pattern: the model obeys what it can see. So design your repo so the relevant constraints are always nearby.

Put feature specs in docs/features/<feature>.mdc — what it does, why it exists, non-goals, edge cases, contracts, performance and security constraints. Put test cases in testcases/<feature>.mdc — user paths, negative cases, expected results. Drop a .cursorrules file (or equivalent) with style boundaries, forbidden patterns, and links to your canonical docs. The trick isn’t fancy rules. The trick is stable constraints.

Now pair that with a deterministic gate. Agents are most dangerous when they can do a lot without friction. Instead, treat your repo like a controlled lab: make lint, make test, make typecheck, make migrate, make verify. Then tell the model:

“You are not allowed to claim success unless make verify passes.”

That turns LLM output into something measurable. Without this, you’re just reading prose. You can expose these commands via MCP so your tooling calls them consistently — but the key point is: the orchestrator is the source of truth, not the agent.

The Workflow That Actually Ships

Plan (human-led) — update or confirm C4 + ADRs for the change, write or extend testcases as user paths, define the acceptance criteria.

Implement (LLM-assisted, but bounded) — give the model the feature doc, relevant ADRs, relevant C4 snippet, the testcases, and explicit file paths it is allowed to touch. Tell it to propose a diff plan first, then implement in small patches, one concern at a time.

Verify (machine-led) — run formatter, lints, tests, migration checks, security scans. If it fails, the model fixes only what the failing command indicates. No wandering.

Review (human-led) — you check architecture adherence (does it match C4/ADR?), correctness of contracts, and test coverage for the declared paths. This is how you avoid “five edits later the whole thing implodes.”

A note on security: LLMs are not unsafe because they’re “dumb.” They’re unsafe because they’re confident. No secret material in prompts, assume all generated code is untrusted until tests pass, prefer narrow tools over agent autonomy, lock down file scope and permissions, and never let agents “refactor broadly” without a migration plan. The safest “agentic” setup is usually one agent, very constrained, with loud verification gates.

And a note on model selection: use your strongest reasoning model for architecture, tricky bugs, and ADR drafting. Use a fast coding model for mechanical changes, boilerplate, and migrations. Use a cheaper model for docs and test generation — but keep the acceptance criteria human-authored. Rule of thumb: if a mistake is expensive, pay for the best model and force verification. If a mistake is cheap, use speed.


If you’re trying to “swarm” your way around missing architecture, you’ll lose — because the model will re-invent your system every time you ask for a change.

But if you make architecture explicit (C4 + Structurizr), capture decisions (ADRs), keep specs and tests close to code, and force verification through a deterministic orchestrator — then LLMs stop being a slot machine and start being a leverage tool.

That’s what people mean when they say they “ship with AI.” They’re not magically one-shotting complexity. They’re instrumenting the work so the model can’t freestyle.

And if that sounds familiar — it should. In the first post, I called LLMs what they are: GAC — Glorified Auto-Complete. In the second, I argued the bottleneck was never intelligence — it was instrumentation.

This post is what instrumentation looks like when you stop talking about it and start doing it.