The generation step works. Give an LLM a 400-line skill file, access to a financial data API, and a company ticker, and it’ll produce a 13-section initiation report — sourced figures, internal consistency checks, a variant perception thesis, the lot. I’ve now run the pipeline on Veeva, Palantir, Amazon, Snowflake, Thermo Fisher, Danaher and a few others. Structurally sound, directionally useful. Not perfect, but the analytical work is doing what I asked.

What doesn’t work is what comes after.

Each stage of the pipeline reads its skill file, calls the Quartr API, writes a markdown file. Closed-book draft. Audit pass. Open-book valuation. Earnings sentiment. Four files. The natural next step was to spawn a fifth subagent to merge the four into one report and convert to docx. Clear instruction: combine these four files into a single 13-section report. Should be a five-minute job.

It wasn’t.

The merge agent has no memory of my formatting preferences. No access to the tested document converter. No awareness that the section structure is defined in a separate skill file the agent can’t see. So it improvised. It reordered sections based on its own judgment. It wrote an ad-hoc Python converter that didn’t handle markdown bold syntax or table parsing. It paraphrased content during assembly — which subtly altered source-tagged figures, which is the one thing that absolutely cannot happen in financial work. Thermo Fisher came back with 795 paragraphs of unrendered bold markers. Danaher’s sections were renumbered. Two reports, same pipeline, minutes apart, different structures.

The Step That Needs the Least Intelligence

The fix was almost embarrassingly simple, which (well, in hindsight) is what told me I was using the wrong tool for the wrong job. I replaced the merge subagent with a Python script. Three hundred lines of deterministic code that reads the canonical section structure from the skill file, pattern-matches headers in each stage output, slots content into the right positions, downgrades internal headings so they don’t leak as top-level sections, runs a post-processor for known markdown edge cases, calls the tested Node.js converter, and verifies the output. Broken bold count: zero. Table count: positive. Heading structure: matches the canonical set exactly. Same input, same output, every time.

The generation stages stayed as LLM subagents because reading filings and synthesising analysis is genuinely LLM work. The assembly step became code because mechanical work needs to be identical every run. I had been asking an LLM to do the kind of work where its variability is a bug, not a feature.

The Verification Tax I Should Have Paid Earlier

Every failure mode I encountered — a cross-line regex that matched newlines and corrupted a financial table, an annotation tag that broke the bold parser, Stage 3 valuation headings appearing as peer-level sections in the merged output — was caught not by me reading the document, but by automated verification checks I built after the fact.

grep "^## " on the final markdown tells you instantly if the section structure is clean. A three-line Python snippet counting ** in the docx paragraphs tells you if the converter worked. Both trivial to write. I should have built them from the start.

The hours I spent tracing a \s regex that matched across line boundaries, tracking down a triple-bold pattern that one company’s filings produced and no other company did — that was the cost of delivering output without verification and discovering problems only by reading the final product. Verification scripts catch in two seconds what a careful human read catches in twenty minutes. And the careful human read often misses things the script wouldn’t.

Letting the Pipeline Read Its Own Failure Log

The pipeline now has a single entry point. A unique trigger phrase. Verbatim subagent prompts that leave no room for interpretation. Deterministic merge. Automated verification. And a failure log that persists across sessions, so the system accumulates knowledge instead of repeating mistakes.

Of all of those, the failure log mattered most, in a way I didn’t see coming. After one particularly bad run I asked Claude a simple question: under what conditions will this break? The response was a structured taxonomy — failure modes sorted by probability and severity, honest assessments of what’s solved, what’s mitigated, and what is still an accepted risk that the verification step will catch. I took that taxonomy and fed it back into the orchestrator itself as a mandatory pre-read. Before the system runs a single stage now, it reads its own failure log. The LLM that might break the pipeline starts every run by studying exactly how it has broken before.

An LLM’s self-awareness can be an engineering input. It knows its own failure modes better than I would have guessed. Ask the right question and you get a structured answer you can operationalise. Then the feedback loop closes: the system breaks, you diagnose it together, the diagnosis becomes part of the system’s own instructions, and the next run is harder to break in the same way.

This doesn’t make the pipeline infallible. A subagent might still format a header in a way the regex doesn’t expect. A future filing might produce a markdown pattern nobody’s seen before. But when something breaks now, I’ll know which stage broke, what the expected output was, and where in the code to look.

The pipeline runs daily for new tickers I’m researching. The failure log keeps growing. The verification scripts keep catching things before I see them. That’s where it is for now.