The agentic factory began with two commits on November 5, 2025. The first was tagged "Build: add all agent prompts, tools, and orchestrator." The second, a day later: "Build: create complete agentic factory template."
Six months and 266 commits later, it has 29 specialist agents, a 17-phase feature development workflow, 37 helper scripts, 23 slash commands, and five hard-coded approval gates. Customer projects ship through it at roughly 12-hour cycles for Complex-tier features. It absorbed an Opus 4.7 model migration without a single factory change.
This is how it grew up.
The Template Era (November 2025)
The original commits were scaffolding. The template came with agent prompts and a loose orchestrator. It also had a set of tools for agents to use. No versioning discipline or governance layer. No distinction between factory tooling and customer projects. The template had ideas about what agentic workflows should look like but almost no accumulated wisdom about where they break.
Two months passed with minimal activity. The template sat while we built things with it and watched what went wrong.
The Claude Code Migration (January 2026)
On January 3, 2026, a burst of commits reshaped the template. The commits are easy to read in order:
- "Standardize agent file formats with YAML frontmatter"
- "Add Claude Code subagents for agentic factory workflow"
- "Clean up template: migrate to Claude Code subagents"
- "Add orchestrator skill for coordinating subagents"
- "Add Ralph loop and stop hooks for autonomous execution"
- "Enhance template based on official Anthropic plugins"
- "Remove legacy workflow files"
The migration had two big moves. First, agents became Claude Code subagents with YAML frontmatter rather than free-form prompts. This gave the orchestrator a standard way to spawn and coordinate them. Second, the Ralph loop got added alongside stop hooks, which meant the workflow could run autonomously for hours without human poking.
A few days later, multi-model support landed. Opus for reasoning and Sonnet for code. Haiku came along for cheaper speed-focused work. The factory became aware of cost as a variable rather than treating every call as equivalent.
By the end of January, the template had roughly the shape it has today. The growth from here is about discipline, not structure.
The Gemini Years (February and Early March 2026)
The factory ran on Gemini through most of February and the first half of March. Versions moved from v2.0 through v2.9.0 during that period. Each version tightened something. Alignment for production use and hardening with mandatory delegation. Model standardization came alongside.
The v2.9.0 commit on March 14 added the Five Pillars of Premium Design. Visual, Technical, Typographic, Linguistic, Accessible. This was the factory's first opinion about what good frontend output looks like, enforced by the frontend-developer agent. It would later become the Design Module, then Design Module v2 after three real customer sites revealed what the original had missed.
Then, two days later, the Gemini factory got torn out.
The v3 Pivot (March 16 to March 19, 2026)
On March 16, 2026, commit 2b77ee8 tore out the Gemini factory entirely and the build went Claude Code-native. The same day, two more features shipped: the native LLM eval engine (v3.2.0) and the Red-Team Engine for adversarial testing (v3.2.0). Two days later came the Skill Importer for bringing in external skills (v3.3.0).
Then a weekend happened. Between March 17 and March 19, the factory went from v3.3.0 to v3.21.3. That is nine version bumps in 72 hours.
What landed during that sprint:
- Compliance Report Engine mapping to SOC2, NIST AI RMF, EU AI Act, HIPAA, ISO 42001 (v3.16.0)
- AgentGuard CLI unifying the governance and compliance tools (v3.17.0)
- Evidence Linker for keyword-matching compliance questionnaires (v3.17.0)
- Tier enforcement fixes and a delta compliance report (v3.17.1, v3.18.0)
- Red-team deterministic judge assertions (v3.19.0)
- Hallucination-under-pressure category with ASCII smuggling detection (v3.20.0)
- Agent file integrity verification with HMAC-signed registry (v3.21.0)
- Standalone project support (v3.21.1)
- Three-scope project tracking (v3.21.2)
- Session hygiene thresholds tuned for the 1M context window (v3.21.3)
The factory shipped in three days what a slower project ships in three quarters. Not because the team was fast. Because the structure was already there. The v2 era had built the bones. March 19 put meat on them.
The Customer Era (Late March and April 2026)
After v3.21.3 shipped, the factory started getting used to build things for real. Three luxury-adjacent sites went first. Opulent Villas came out of the gate. Exquisite Yachts and Coral Port Services followed. Each used the Five Pillars standard. Each came out looking tonally distinct despite the shared underlying rules.
This phase was the first one where the factory learned from its own output rather than from theory. On April 14, the Design Module v2 shipped. It wasn't driven by a new feature brief. It came from reading the three sites side-by-side and noticing what the original Five Pillars missed. Material Metaphor as a pre-build commitment step. Motion Matrix for tone-scoped timing, plus an Anti-Convergence ledger to keep back-to-back projects from converging on the same visual patterns.
The retrospective memo for Design Module v2 said it plainly: "A one-data-point theory almost always over-specifies. The generalizable finding arrived at N=3."
This was also the era where the craftsmanship principle got enforcement teeth. The analyst agent stopped accepting PRDs with shortcut phrases like "good enough for v1" or "essentials first." Red-flag phrases became a structural refusal pattern. The orchestrator gained a pre-PRD self-audit requirement. Phase 4.5 got its own craftsmanship note. The message underneath all of it: easy scope reductions at Phase 1 cost 10x to unwind at Phase 7.
Meanwhile, a 50-task feature called superpowers-cross-pollination shipped on April 9. It was a meta-feature that made all 29 agents reference each other's work. The retrospective for it is now load-bearing reading for any feature with 10 or more tasks. A few excerpts, taken verbatim:
- "Delegate when N > 5 files with the same mechanical pattern. Don't delegate when the edit requires judgment about content."
- "Cost scales linearly with agent prompt length, not with scenario count."
- "Architect task estimates need a 1.5x multiplier for wave-sequenced features."
- "Don't ship documented features with placeholder verification."
- "Don't work on main alongside other features."
Each of those is a rule of thumb that cost real time to learn.
The Opus 4.7 Test (April 16 and 17, 2026)
On April 16, the account got provisioned with Opus 4.7. The migration plan we'd written two days earlier listed five mitigation steps that would have taken weeks. We ran none of them.
Instead the factory shipped a Complex-tier feature (the peer-benchmark-page PDF page for SecureClear) end-to-end on default Opus 4.7 in roughly 12 hours, with 1,744 backend tests passing and zero factory changes. The three behavior risks Anthropic's release notes flagged never materialized, and the two issues that did surface were pre-existing prompt gaps that would have happened on 4.6 too.
The factory absorbed the model change. The full migration, risk by risk, is its own post.
What Makes It Durable
Reading the commit history back-to-back makes one pattern obvious. The factory doesn't survive because it's clever. It survives because it has many small gates, and each gate exists to catch a specific past failure.
Take the Phase 1 craftsmanship self-audit. It exists because an analyst once made four shortcut scope reductions in a single session, and the rule now fires before the PRD gets written. Phase 6.5 deep refactor has a different origin. Reviewers looking at file A can't catch smells in file B if they're thinking about A, so this gate spawns a fresh agent per file. DONE_WITH_CONCERNS came from a different problem still. Developers used to either silently fix out-of-scope issues or halt the wave entirely. The protocol gives them a third option: a surgical fix plus a flag for process refinement.
Jaime Teevan, Microsoft's Chief Scientist, has a research program built on what she calls positive friction. The idea is that good AI design slows people down at the right moments to make them surface assumptions and keep thinking, rather than handing over an answer that removes the thinking entirely. In a June 2025 conversation with Scott Galloway, she put the cost of the alternative plainly.
"AI increases the metacognitive demand."
Jaime Teevan, Chief Scientist, MicrosoftThe gates in our factory work on that principle. The friction isn't overhead. It's the product. Every gate earns its cost by catching one class of past failure.
When a model changes underneath a workflow like this, the workflow doesn't depend on the model being unusually smart. It depends on the model following structured prompts and respecting gates. Both properties are durable across Claude 4.6 and 4.7. A future 4.8 or 5.0 has no obvious reason to break them either.
What's Next
Two follow-ups are in flight right now.
One is instrumentation. Until this week, the factory tracked wall-clock time per feature but didn't capture token counts. agents.json has a field called estimatedTokens that was actually manual guesses, not measurements. A new audit log event type (token_usage) and a per-feature aggregator shipped on April 17. Two or three feature cycles from now we'll have enough measured data to re-baseline the estimates properly.
The other is upstream. Claude Code's Glob and Grep tools time out at 20-30 seconds but don't consistently kill the spawned find subprocess. Orphan processes accumulate across long sessions. This week we shipped two defensive layers (SessionStart and Stop hooks) but the real fix belongs in Claude Code's tool runtime.
Six months in, the factory has moved from template to production system. It shipped a Complex feature on a new model with zero factory changes. The accumulated failures live in .claude/lessons.md, and retrospectives are part of the workflow now rather than a nice-to-have.
The next six months will tell us whether it scales to more teams and more model generations. Harder features are the real stress test. The bones are good. The discipline is the question.
We build agentic systems that survive production.
If you're putting AI agents in front of real work and real auditors, book a scoping call. We'll walk through the gates, the audit trail, and where your workflow breaks before it ships.
Sources
- VKTR. "AI Reality Check: Microsoft's Chief Scientist on the Future of Work" (2025), reporting on Jaime Teevan's June 12, 2025 conversation with Scott Galloway.
- First-party records: the factory's git history, version tags,
.claude/lessons.md, and the per-feature retrospectives cited inline. Commit hashes, PR numbers, and test counts trace to those records.