The question I kept asking was simpler than it sounds: can an LLM produce a research report I would actually trust?
Not a summary. Not a chatbot pulling from whatever it finds on the internet. A structured document with a thesis, financial analysis, and a valuation — the kind of thing a sell-side analyst would put their name on. And with one constraint that shaped everything that followed: if a number appears in the report, I need to know exactly where it came from.
That constraint — call it data integrity — is the entire design of what follows. Building toward it taught me something uncomfortable: when an LLM has no good answer, it will invent one. Not randomly — plausibly. It draws on everything it was trained on, which is the entirety of the internet, and produces a figure that looks right, is tagged correctly, and is wrong. Getting the workflow to label its own uncertainty honestly — to say "this came from training data, not from your documents" — was the hardest part of the build. It was also the most rewarding.
This is also an evolving project. The sample reports above show that arc directly: the older ones are longer and rougher, the newer ones tighter and more auditable. Each run exposed a gap I hadn’t anticipated — a wrong P&L template for a bank, a fabricated figure in a cloud software report, a transcript section left with placeholder tags. Every gap became a rule. The workflow is better for each of them, and it will keep improving.
Getting the Documents Right
Before any analysis begins, the documents have to be right. Two paths to get there.
The first is an external MCP connector — tools like Quartr Pro or Capital IQ that pipe structured financial data directly into Claude. Clean, well-structured, fast.
The second is the IR scraper I’ve described elsewhere on this page. It pulls raw documents directly from a company’s investor relations website: 10-Ks, 10-Qs, earnings transcripts, proxy statements, all of it, organised locally by document type. If the company published it on their IR site, the scraper gets it.1
I use both depending on the situation. The scraper gives comprehensive coverage. The MCP connectors give structured data without the extraction work. The choice depends on what the company publishes and how well-maintained their IR site is.
Closed-Book First
Here is the decision that made the workflow trustworthy: Claude does not have internet access during the financial analysis phase.
Closed-book means the model works only from the documents I’ve provided. No browsing. No retrieval from sources I haven’t read. Every number in the draft report traces back to a specific filing I can open.
This is a deliberate constraint, not a limitation I’m working around. Every hallucination I’ve encountered in LLM-generated financial analysis has the same root cause: the model infers or fills in data it wasn’t given. Closing the book removes that failure mode entirely.
Some things genuinely need live data — the current stock price, the risk-free rate, market multiples for a peer comparison. Those get pulled separately, verified manually, and fed in at a specific stage. But the financial history, the segment analysis, the management commentary — all of that stays closed-book until the audit is complete.
The Process Has to Exist Before It Can Be Automated
There’s an old principle in software engineering: do one thing well. The Unix pipe philosophy. cat file | grep keyword | sort | uniq. The power isn’t in any single command; it’s in the composition — each stage does exactly one job, hands clean output to the next, and doesn’t try to do more than it should.
When I started building this workflow, I kept hitting the same wall. I couldn’t instruct Claude on Stage 2 until I had understood Stage 1 completely. And Stage 1 kept breaking down until I was honest about what Stage 0 was actually producing.
Building a stepwise way to generate a report forced me to define the process. And defining the process forced me to understand what I actually valued in research.
That turned out to be the most useful part of the exercise — not the workflow itself, but the thinking it forced.
The Five Stages
Stage 0: Document Preparation. Two paths depending on what’s available. When a well-populated IR folder exists — scraped from the company’s investor relations site — income statements, balance sheets, and cash flow statements are extracted into structured CSVs before Stage 1 begins. When that folder isn’t available, a financial data connector (Quartr Pro, Capital IQ) provides the structured numbers instead. Either way, Claude never constructs financial tables from raw PDFs. This is the work that makes Stage 1 possible.
Step 0B: Business Model Classification. Before any drafting begins, the workflow classifies the company into one of seven types: technology/SaaS, bank, insurer, REIT, marketplace, holding company, or mining and oil & gas. This isn’t cosmetic. Each type uses a different P&L template — the standard gross profit line is wrong by construction for a bank (ECL belongs above it) and meaningless for a REIT. Getting this classification right is what prevents Stage 1 from applying the wrong financial framework to the wrong business.
Stage 1: Draft. Claude reads the structured financial data and the source documents and produces a first-pass research report. Closed-book throughout. The draft covers the business model, financial history, segment breakdown, and management commentary across recent quarters. No valuation at this stage — the numbers have to be right before the analysis begins. The business model type from Step 0B determines which P&L template gets used.
Stage 2: Audit. A second pass over the draft, with one question: does this report rely on anything not in the source documents? Anything that can’t be traced to a specific filing gets flagged or removed. This is the stage that enforces the no-hallucination contract. The framing matters: asking Claude to “review the report for accuracy” produces reassurances. Asking it to “find every claim in this report not directly supported by the attached documents” produces findings. After the audit, a deterministic Python script runs a final gate check — no LLM, no judgment, just verifiable rules: no bare source placeholders, no unfilled financial cells, source tag syntax intact, balance sheet reconciles within tolerance.
Stage 3: Valuation (on request). The closed-book constraint lifts here. Stock price, risk-free rate, peer multiples — pulled explicitly, verified, fed in. Claude runs the valuation against the audited financial data from Stage 2. The model doesn’t find these numbers itself; I provide them. DCF, comparable companies analysis, whatever the situation calls for. This stage runs when I want it to — not every report needs a formal valuation.
Stage 4: Earnings Sentiment. The most recent earnings call transcript gets its own pass. Not to extract numbers — those are already in Stage 1 — but to read tone. What did management say about next quarter? What did they not say? How does the language compare to prior calls? Tone shifts in management commentary are often more informative than the numbers themselves.
What I Got Wrong
My first version tried to do all five stages in one prompt. The output was unusable. The mistake was thinking Claude’s capability was the constraint. It wasn’t. My problem specification was the constraint.
Stage 4 (earnings sentiment) was the specific failure point early on. I was asking Claude to analyse without giving it a thesis to test against. What I got back were accurate summaries — which I could have produced myself by reading more carefully. What I needed were comparisons: to my prior view, to what management had said four months ago, to where the numbers had moved.
The version I run today is the result of that learning. Each failure was specific.2
The Chaos of Parallelisation
At one point I tried a different architecture: run multiple stages simultaneously, let sub-agents handle different sections in parallel, then merge the outputs. Faster in theory. Cleaner in theory.
It fell apart immediately. Sub-agents don’t share context with each other or with the conversation that spawned them. Each one starts fresh — no memory of the documents, no awareness of what the other agents were doing, no knowledge of the constraints I had built up over months. Without context, they did what LLMs do when they have no good answer: they filled in the gaps from training data. Plausibly. Quietly. With the right formatting and the right source tags.
The Snowflake run was the clearest example. The sub-agent produced a report that looked complete: all the right sections, all the right labels. Revenue and gross profit were correct — those came early in the API response. Then I checked operating income. The model had filled in a figure close to breakeven. The actual number was a $1.4 billion loss. Every data point past the first two fields had been fabricated. Tagged correctly. Wrong in every detail.
The merge step made it worse. When you take outputs from parallel agents and try to stitch them together, you are merging documents that were each written without knowing what the other contained. Gaps get papered over. Inconsistencies get smoothed out. The final document looks coherent because it has been edited to be coherent — not because the underlying numbers are right.
The fix was to abandon parallelisation entirely. Every stage now runs sequentially in a single conversation, with full context carried forward. It is slower. It is also the only architecture that actually enforces the closed-book contract end to end. I write about sub-agents and orchestration in more detail in a separate piece.
Where I’ve Landed
What the workflow currently produces — without a formal valuation — is a thorough closed-book fundamental analysis. The business model, what management says about where it’s going, the headwinds and tailwinds, how consistently they have delivered against what they promised. Earnings sentiment. The competitive ecosystem: who the customers are, who the partners are, what the company is up against. All of that, sourced and auditable.
Valuation sits at the edge of that. When I want it, I open the book, feed in the live inputs myself, and run it. Not every report needs it at the same time as the fundamental analysis.
Progress on this has been five steps forward, three steps back, four steps forward, three steps back. That’s not a complaint — each retreat taught something the advance couldn’t. But the most useful thing I didn’t anticipate going in has nothing to do with the workflow at all. It is what happens to my own reading when every report looks the same.
Sell-side research varies enormously in structure. Different banks frame the same company differently; different analysts lead with different things. A lot of the cognitive load in reading a research report is just orienting yourself to the format before you can get to the substance. When every report I produce follows the same structure — same sections, same ordering, same way of presenting the financials — that overhead disappears. I find I can move through companies significantly faster, because I’m processing the data, not the form. Pattern matching is doing work I used to spend attention on.
That was an emergent benefit. I’m still figuring out what to do with it.