Empirical Method

How testing revealed the design. Every gate passed left a lesson. Statistical results without context mislead. The design emerged from pressure, not speculation.

Empirical pressure before architectural complexity

Don't add sophistication until you've proven the simple version breaks. Every phase in this build was tested with real articles before the next phase began. Gates were not checklists — they were failure modes that had to be survived.

Gates Passed

Each gate represents a point where testing revealed something unexpected about how the agent actually behaves.

Phase 7
30 Articles Before the Critic
Before the Critic node was built, 30 articles were run through the basic analysis pipeline. The goal was to see what the model actually produces — not what we assumed it would produce.
30/30 processed 0 pipeline failures
Lesson: The model's analysis quality varied dramatically by article type. Opinion pieces produced rich analysis. News summaries produced thin analysis. This variation was invisible until we ran enough articles. One or two test articles would have been misleading.
Phase 10
Classification Precision Gate
The classification node (domain, content_type, author_stance) had to correctly label articles before the recall phase could work. Wrong classification → wrong past briefs recalled → wrong analysis.
domain accuracy: high author_stance: most variable
Lesson: author_stance was the hardest field for the model. "Critic" vs "analyst" vs "practitioner" — the boundaries were fuzzy. The fix was clearer documentation in the classification prompt with examples of each stance.
Phase 15
Isolation Testing: Rare Code Paths
"Correct evaluator behavior looks like not working." The Critic node was designed to catch specific error patterns. Testing it required injecting errors intentionally — because a working Critic on clean data produces zero output.
5/5 KNOWN_MISSES caught 0 false positives
Lesson: Testing correctness requires breaking things on purpose. If you only test with clean data, you prove nothing. The Critic's silence on good data was the success signal — but only after we'd proven it could catch bad data.
Tool Decision Split Gate
The choose_next_action node needed to show it could pick both tools appropriately. A 100% stop_here rate would mean the search tool was useless. A 100% search_web rate would mean the agent couldn't recognize when an article was its own source.
60% stop_here 40% search_web
Lesson: The split emerged naturally from better docstrings. Earlier versions with vague docstrings showed 95%+ search_web — the model defaulted to "do something." Clear documentation of when to stop was the fix, not a routing rule.
Citation Verification Gate
Every claim in every brief had to trace to a source. The post-processing filter verified that each citation was valid — no fabricated sources, no orphaned claims.
43/43 FK-verified 0 fabricated sources
Lesson: The model rarely fabricated sources when the schema demanded citations. The constraint created discipline. When the schema said "citations required," the model provided them. When earlier versions didn't require citations, the model didn't provide them.
!
Phase 18
"169 runs" Was Gate Exhaust, Not Knowledge
A seemingly impressive statistic: 169 successful runs. But this number reflected gate exhaustion — the pipeline ran until it had exhausted the test articles — not knowledge accumulation. The number alone was meaningless.
169 total runs ~12 unique articles
Lesson: Audit statistics before trusting them. "169 runs" sounded like broad validation. It was actually ~12 articles run through multiple phases. Run count is a vanity metric. Unique article count + failure rate is what matters.

Numbers Lie Without Context

The most important lesson from the empirical method: every statistic the pipeline produces needs context to be meaningful.

"169 runs" ≠ 169 articles tested

The pipeline ran 169 times. But each article passed through multiple phases — classification, recall, analysis, decision, grounding. One article could produce 5+ runs. The real coverage was ~12 unique articles. The headline number was gate exhaust, not knowledge breadth. Always ask: what does this number actually count?

The method shapes the agent

You can't design an agent in your head and then test it. The testing IS the design process. Every gate revealed something that the spec didn't predict. The 60/40 tool split wasn't designed — it emerged. The classification precision gate showed that author_stance was the weak point. The 169 runs lesson showed that statistics need auditing. These are not bugs found — they are design decisions discovered.