Every public company has an investor relations website. None of them is designed for you.
They exist to satisfy regulatory disclosure requirements, to look credible to the analysts who visit once a year, and — not incidentally — to resist automated access. Most IR sites deploy the same arsenal: Cloudflare WAF, Akamai Bot Manager, PerimeterX, rate limiting, cookie consent walls, region gates. They do this to block scrapers. Which is precisely what I needed to build.
The question I started with: can a single scraper work on any company's investor relations website without needing site-specific code? I have 23 companies in my portfolio. I did not want 23 scrapers. I wanted one.
What Linnaeus Would Have Done
The naturalist's instinct is to classify before you act. Carl Linnaeus didn't describe every plant individually — he built a taxonomy. Kingdom, phylum, class, order, family, genus, species. Once the system existed, you didn't characterise a new organism from scratch; you placed it in the tree and inherited everything the tree already knew about its relatives.
The IR scraper works the same way. Instead of writing Tesla-specific code and Veeva-specific code and HDFC-specific code, the first problem to solve was: how many types of investor relations websites actually exist? The answer, after testing enough of them, is eight. WordPress. Q4 web platform. Drupal. GCS-Web. React/Next.js SPAs. Angular. Vue. And everything else.
Platform detection runs before any scraping begins. It reads the raw HTML for fingerprints — wp-content or wp-json for WordPress, __NEXT_DATA__ for Next.js, q4cdn.com in URL patterns. Ninety percent of the time, detection completes in under two seconds.
Once you know which platform a site runs on, you know how to scrape it. The taxonomy is the strategy.
A Q4 site has predictable URL patterns and loads its documents via JavaScript. A React SPA has document URLs embedded in compiled bundle files or triggered by API calls that you can intercept. A WordPress site is almost trivially easy — everything is in the HTML. The platform fingerprint collapses an open-ended problem into one of eight known patterns.
The Bot Problem — and the Barbell
The other ten percent is more interesting.
Some IR sites deploy serious anti-bot infrastructure. When you hit Akamai's Bot Manager or Cloudflare's challenge page, a naive scraper returns a 403 and stops. That's not good enough.
Nassim Taleb's barbell principle is about combining a very safe position with a very aggressive one, and eliminating the middle. Applied here: one end of the barbell is a lightweight requests session — no JavaScript rendering, minimal headers, fast and silent. It passes through most sites without triggering any defences. The other end is a full Playwright Chromium browser: real TLS fingerprint, full cookie handling, JavaScript execution, complete DOM rendering. Heavy, slow, but almost indistinguishable from a human.
Most documents get downloaded at the fast end. Documents that fail — 403s, CloudFront CDN blocks, authentication walls — go to the Playwright fallback automatically. You get the speed of a lightweight session for the easy cases and the power of a full browser for the hard ones. The wins compound: PB Fintech returned 720 documents with 99.8% download success. HDFC Bank returned 445 with 99.5%. Sea, a Q4 platform site, completed at 100%.
The scraper also handles four distinct WAF vendors — Akamai, Cloudflare, PerimeterX, DataDome — and escalates through a four-step browser strategy: Chromium headless → Chromium with HTTP/1.1 → system Chrome headless → system Chrome headed. When a French engineering company (Dassault Systèmes, 3DS) required a headed browser to pass Akamai's challenge, that information was saved to the company's profile. The next run starts headed.
What I Didn't Expect: Discovery Is Harder Than Downloading
Downloading is a solved problem once you have the URL. Discovery — finding all the URLs in the first place — is not.
Every IR site organises its content differently. Some use anchor tags pointing directly to PDFs. Some load document URLs through API calls triggered by clicking "Annual Reports." Some paginate their document listings across fifteen pages of ten results each. Some hide everything behind a year selector — you click "2024," the page refreshes, you click "2023," and so on back to the beginning. IndiaMART had one section and needed entirely different discovery logic than PDD Holdings, which runs on GCS-Web and required its own platform handler.
The discovery engine runs four phases: static HTML extraction, dynamic content expansion (clicking year selectors, expanding accordions, following "Load More" buttons), JavaScript bundle scanning for document URLs embedded in compiled React code, and fallbacks (sitemaps, common IR path patterns, CDN enumeration).
The Palantir run revealed a subtler problem: after discovery, 1,638 of 1,728 documents were not real documents. They were EDGAR filing viewer artifacts — phantom PDF links each with a UUID in the URL, each labelled exactly "PDF" or "XBRL." The classifier now filters any link where the text is a bare file-type label and the URL contains a UUID pattern. Real document links have descriptive text.
What I've Learned
The generic approach outperforms the specific one — but only after you've done enough classification work to understand what "generic" actually covers. Eight platform types isn't a universal solution. It's eight specific handlers that are each somewhat more general than the company-specific code they replaced.
Each company I tested taught me something that made the next one easier. The Palantir session taught me about section-button deduplication and EDGAR junk filtering. The Blackbuck session taught me about React SPAs that trigger downloads on page load rather than on click. The 3DS session taught me about Akamai's headed-mode requirement. The taxonomy compounds. Each new case either fits an existing branch or forces a new one.
The scraper now runs on 18 companies. Before I write a single line of analysis on a new company, the scraper goes first — fifteen minutes of automated discovery and download versus what used to be an afternoon of manual browsing across investor pages.
What I find most interesting is the inversion: the harder problem was never the downloading. It was the listening — understanding what each site was saying about how it organised its knowledge, and building a system general enough to hear it.