Using Adversarial AI Debate to Plan Software

Table of Contents

The standard way to use an LLM for planning is to ask it to plan. You describe the feature, it produces a document, you read the document, you fix what’s wrong. The document is always confident. The confidence is the problem.

A single model doesn’t argue with itself. It picks a direction early — the first plausible architecture that fits the prompt — and then builds downward from there, compounding its initial assumptions into a structure that looks coherent because it never encountered friction. Every section supports every other section. Nothing was stress-tested. The plan reads well, which is not the same as being correct.

Two plans, then a fight #

The process is simple enough. Two models plan the same feature independently, in parallel. They don’t see each other’s work. When both plans exist, each model reads the other’s and writes a critique. Then rebuttals. Then a convergence round where they identify agreements, flag where they were talking past each other, and propose a unified approach. Then a final consensus document.

Four rounds. About forty minutes if you’re running both models concurrently. The output is ten documents: two initial plans, two critiques, two rebuttals, two convergence proposals, a consensus spec, and a closing statement.

The interesting constraint is that arguments have to be backed by evidence. Not “I think this approach is better” but “this approach assumes method X exists on the Page class, and it doesn’t — here’s the actual API surface.” You tell both models this rule upfront: arguments backed by facts win. If you’re wrong, concede and fix it.

Why it works (probably) #

The mechanism isn’t mysterious. A single model optimizes for internal coherence. Two models in opposition optimize for external validity. The critique rounds force each plan to encounter the assumptions it was built on, because the other model doesn’t share those assumptions and has no reason to be gentle about it.

There’s also something about the concession rule. Telling a model “if you’re wrong, say so” produces surprisingly honest reassessment. During the rebuttal phase you see sentences like “Gemini is correct that my timeline for this component was optimistic — extending from three weeks to four.” A model arguing with itself would never write that sentence. It doesn’t have a position to retreat from.

The convergence round is where the strange things happen. Both models have been fighting for two rounds. Now they have to agree. And the agreement is almost never “one plan was right.” It’s a third thing — pieces of both plans recombined in a way neither model would have produced alone because neither had the other’s critique as input to its creative process.

Planning is not a creative act. It’s a critical one. The thing you need from a plan is not that it’s brilliant but that it’s not broken, and the fastest way to find out if something is broken is to show it to someone who has a different idea about how it should work.

Then you listen to it over breakfast #

Ten markdown documents from a planning debate is a lot of reading. You’re not going to sit down and carefully study all ten. But you do eat breakfast.

The output of the debate feeds into a text-to-speech pipeline. A script agent reads all ten documents — plans, critiques, rebuttals, consensus — and composes an audio-friendly narration. Speaker labels for each debater, transitions between rounds, technical content rephrased so it makes sense without seeing the code blocks. Different voices for the narrator and each debater. The whole thing renders to an MP3.

Now the debate is a podcast episode. Thirty minutes of two AI models arguing about your architecture while you make coffee. You catch the parts that matter to you — the concessions, the convergence points, the risk analysis — and skip the parts that don’t, the same way you’d half-listen to any conversation and snap to attention when something interesting happens.

The full pipeline, for the visually inclined:

debate → audio → breakfast

This is not about efficiency, exactly. It’s about surface area. A ten-document planning artifact sitting in a docs/ folder has a good chance of never being read carefully. The same artifact playing through your kitchen speaker while you scramble eggs gets absorbed whether you mean to absorb it or not.

The skill system that makes this work #

Both stages — the debate and the audio conversion — are Claude Code skills. Skills are on-demand instruction sets that load into the model’s context only when triggered. The rest of the time they cost nothing.

~/.claude/skills/
├── debate-planner/
│   └── SKILL.md        # adversarial debate workflow
└── md-to-audio/
    └── SKILL.md        # TTS pipeline instructions

A skill is a directory with a SKILL.md file. The frontmatter declares when it activates:

---
name: debate-planner
description: Plan features using adversarial debate between AI agents.
  Use when the user asks to plan or architect a feature using debate,
  competing planners, or adversarial planning.
---

Say “plan this feature using debate” and the skill loads. Say “fix the login bug” and it doesn’t. Below the frontmatter goes the actual payload — workflow steps, prompts, file paths, code patterns, whatever the model needs to execute the workflow autonomously.

The skills don’t know about each other. The debate planner produces markdown files in docs/. The audio converter reads markdown files. The interface is the filesystem. You run one, then the other.

There’s a skill for writing skills. It goes exactly as many levels deep as you’d expect. But the easier path is the Claude Code skill documentation — the structure is minimal enough that the docs are all you need. A directory, a YAML frontmatter, and instructions written the way you’d brief someone who’s competent but has never seen your project.

The debate skill took longer to calibrate than to write. The round structure matters — cross-round context matters (both models need the previous round’s outputs), and the convergence round is where miscommunications surface because models make simultaneous but contradictory moves during parallel rounds. The audio skill was faster. TTS pipelines are more deterministic than arguments.

Running it yourself #

Two models capable of long-form technical reasoning. We use Claude and Gemini (Claude code referee calls subagent and Gemini cli in background task). Each debate round runs both in parallel. Feed Round N-1’s outputs as context to Round N.

Phase 1: Explore codebase
Phase 2: Two independent plans (parallel)
Phase 3: Four debate rounds (critique → rebuttal → convergence → consensus)
Phase 4: Audio script from debate artifacts
Phase 5: TTS render → MP3

For TTS, Kokoro runs locally on my thinkbook at about 1.7x realtime on CPU. Piper is 10x faster but you don’t want to hear it. No API calls, no cloud dependencies. The whole pipeline runs on your machine.

Here’s what one of these sounds like — a debate about an MVP we planned recently, rendered to audio. The tts module is still work in progress so a lot of valuable content missing from deabate. I think in this one the TTS scrip writer actually hit a context limit and autocompacted his mind:

You were going to eat breakfast anyway.