Thanks. I initially used a regular HTML parser as well, but I quickly ran into sites that wouldn't render without JavaScript. I'm therefore now using a regular browser controlled by Playwright to fetch the websites.
Care to name any sites? I've always managed to find workarounds for everything I've wanted to follow. Most websites want to be indexed by search engines, and googlebot doesn't do javascript. So sometimes a forged user agent is all you need. Occasionally, finding the actual json file and parsing the info you need out of it does the job. etc.
The User Agent trick is a good one that I should've tried, but I just checked and it didn't work for this one. Parsing actual JSON wasn't really an option, as I wanted to be able to quickly and easily add RSS feeds.
Possibly SEO is less a concern for the type of website I initially made this for, i.e. Dutch real estate agents. Most people find their listings through funda.nl rather than through search engines; I was just hoping to see them listed before they got posted there.
Send me a message on Twitter or email me (hacker_news@ my domain) if you still want the URL of a failing website to play around with.
This changes from time to time, of course, but when last I investigated, around two years ago, consensus was that it mostly wouldn’t do JavaScript until you nudged it into doing so in some way that I forget, and that it was always slower to index/update if it needed to do JavaScript.
(For my part, I disable JavaScript by default for various reasons, mostly performance, and it’s decidedly uncommon for a general-internet site to be completely broken by it. Sites that get posted on HN are disproportionately JS-dependent, especially if they’re new.)