LLMs are good at generating analysis. Give one a 400-line skill file, access to a financial data API, and a company ticker, and it’ll produce a 13-section initiation report with sourced figures, internal consistency checks, and a variant perception thesis. I’ve now run this pipeline on over a dozen companies — Veeva, Palantir, Amazon, Snowflake, Thermo Fisher, Danaher — and the generation step works. It isn’t perfect, but it’s structurally sound and directionally useful.

What doesn’t work is what comes after: taking four separately generated files and assembling them into one coherent report.

The pipeline has four stages: a closed-book draft, an audit pass, an open-book valuation, and an earnings sentiment analysis. Each stage reads a skill file, calls the Quartr API, and writes a markdown file. The natural implementation was to spawn each stage as a subagent with the right model assignment and then spawn a fifth subagent to merge the four outputs and convert to a Word document. The merge agent received a clear instruction: “Combine these four files into a single report following the 13-section structure.” Sounds trivial. It isn’t.

The merge agent has no memory of my formatting preferences, no access to the tested document converter, and no awareness that the section structure is defined in a separate skill file. So it improvises. It reorders sections based on its own judgment. It writes an ad-hoc Python converter that doesn’t handle markdown bold syntax or table parsing. It paraphrases content during assembly, subtly altering source-tagged figures. Thermo Fisher’s report came back with 795 paragraphs of unrendered bold markers. Danaher’s sections were completely renumbered. The two reports, produced by the same pipeline minutes apart, had different structures.

That’s the assembly problem: the step that requires the least intelligence is the step most damaged by LLM involvement.

The fix was almost embarrassingly simple. I replaced the merge agent with a Python script. Three hundred lines of deterministic code that reads the canonical section structure from the skill file, pattern-matches headers in each stage output, slots content into the right positions, downgrades internal headings so they don’t leak as top-level sections, runs a post-processor for known markdown edge cases, calls the tested Node.js converter, and verifies the output. Broken bold count must be zero. Table count must be positive. Heading structure must match the canonical set exactly. Same input, same output, every time.

The generation stages stayed as LLM subagents because that’s genuinely LLM work — reading filings, synthesizing analysis, building financial models. The assembly step became code because it’s mechanical work that needs to be identical every run.

Know which parts of your system need intelligence and which parts need consistency, and don’t use one where you need the other.

There’s a second lesson buried in the debugging. Every failure I encountered — a cross-line regex that corrupted a financial table by matching newlines, an annotation tag that broke the bold parser, Stage 3 valuation headings appearing as peer-level sections in the merged output — was caught not by me reading the document, but by automated verification checks. grep "^## " on the final markdown tells you instantly if the section structure is clean. A three-line Python snippet counting ** in the docx paragraphs tells you if the converter worked. Trivial to write. I should have built them from the start.

The debugging tax I paid in this session — hours tracing a \s regex that matched across line boundaries, hours finding a triple-bold pattern in one company’s filing that no other company produced — was the cost of delivering output without verification and discovering problems only when I looked at the final product.

The pipeline now has a single entry point, a unique trigger phrase, verbatim subagent prompts that leave no room for interpretation, a deterministic merge, and automated verification. It also has a failure log that persists across sessions, so the system accumulates knowledge instead of repeating mistakes.

The most useful thing I did in this entire session wasn’t writing code or fixing regex. It was asking Claude a simple question: “Under what conditions will this break?” The response was a structured taxonomy of failure modes, sorted by probability and severity, with honest assessments of what’s solved, what’s mitigated, and what’s still an accepted risk. Then I took that taxonomy and fed it back into the orchestrator itself as a mandatory pre-read. Before the system runs a single stage, it now reads its own failure log. The LLM that might break the pipeline starts every run by studying exactly how and why it has broken before.

That’s the move I didn’t expect to matter this much. You can use an LLM’s self-awareness as an engineering input. It knows its own failure modes better than you’d guess, and if you ask the right question, it’ll give you a structured answer you can operationalize. The feedback loop closes: the system breaks, you diagnose it together, the diagnosis becomes part of the system’s own instructions, and the next run is harder to break in the same way.

The honest assessment: this eliminates the systematic failures and reduces the probabilistic ones to edge cases the verification step will catch. It doesn’t make the pipeline infallible. A subagent might still format a header in a way the regex doesn’t expect. A future company’s filings might produce a markdown pattern nobody’s seen before. But when something breaks, I’ll know which stage broke, what the expected output was, and where in the code to look. That’s the real product of this session — not a fixed report, but a system that fails gracefully and learns incrementally.