Empirical Method

How testing revealed the design. Every gate passed left a lesson. Statistical results without context mislead. The design emerged from pressure, not speculation.

Gates Passed

Each gate represents a point where testing revealed something unexpected about how the agent actually behaves.

✓

Phase 7

30 Articles Before the Critic

Before the Critic node was built, 30 articles were run through the basic analysis pipeline. The goal was to see what the model actually produces — not what we assumed it would produce.

30/30 processed 0 pipeline failures

Lesson: The model's analysis quality varied dramatically by article type. Opinion pieces produced rich analysis. News summaries produced thin analysis. This variation was invisible until we ran enough articles. One or two test articles would have been misleading.

✓

Phase 10

Classification Precision Gate

The classification node (domain, content_type, author_stance) had to correctly label articles before the recall phase could work. Wrong classification → wrong past briefs recalled → wrong analysis.

domain accuracy: high author_stance: most variable

Lesson: author_stance was the hardest field for the model. "Critic" vs "analyst" vs "practitioner" — the boundaries were fuzzy. The fix was clearer documentation in the classification prompt with examples of each stance.

✓

Phase 15

Isolation Testing: Rare Code Paths

"Correct evaluator behavior looks like not working." The Critic node was designed to catch specific error patterns. Testing it required injecting errors intentionally — because a working Critic on clean data produces zero output.

5/5 KNOWN_MISSES caught 0 false positives

Lesson: Testing correctness requires breaking things on purpose. If you only test with clean data, you prove nothing. The Critic's silence on good data was the success signal — but only after we'd proven it could catch bad data.

✓

Tool Decision Split Gate

The choose_next_action node needed to show it could pick both tools appropriately. A 100% stop_here rate would mean the search tool was useless. A 100% search_web rate would mean the agent couldn't recognize when an article was its own source.

60% stop_here 40% search_web

Lesson: The split emerged naturally from better docstrings. Earlier versions with vague docstrings showed 95%+ search_web — the model defaulted to "do something." Clear documentation of when to stop was the fix, not a routing rule.

✓

Citation Verification Gate

Every claim in every brief had to trace to a source. The post-processing filter verified that each citation was valid — no fabricated sources, no orphaned claims.

43/43 FK-verified 0 fabricated sources

Lesson: The model rarely fabricated sources when the schema demanded citations. The constraint created discipline. When the schema said "citations required," the model provided them. When earlier versions didn't require citations, the model didn't provide them.

Phase 18

"169 runs" Was Gate Exhaust, Not Knowledge

A seemingly impressive statistic: 169 successful runs. But this number reflected gate exhaustion — the pipeline ran until it had exhausted the test articles — not knowledge accumulation. The number alone was meaningless.

169 total runs ~12 unique articles

Lesson: Audit statistics before trusting them. "169 runs" sounded like broad validation. It was actually ~12 articles run through multiple phases. Run count is a vanity metric. Unique article count + failure rate is what matters.

Gates Passed

Numbers Lie Without Context