What is the point of using LLMs for the scrapping itself instead of using them t...

geuis · on May 7, 2024

It is potentially expensive, but here's a different take.

Instead of writing a bunch of selectors that break often, imagine just being able to write a paragraph telling the LLM to fetch the top 10 headlines and their links on a news site. Or to fetch the images, titles, and prices off a store front?

It abstracts away a lot of manual fragile work.

mariopt · on May 7, 2024

I get that and LLMs are expected to get better.

Today, would you build a scraper with current LLMs that randomly hallucinate? I wouldn't.

The idea of a LLM powered scraper adapting the selectors every time the website owner updates it, it's pretty cool.

ewild · on May 8, 2024

At my job we are scraping using LLMs. For a 10M sector of the company. GPT4 turbo has never not once out of 1.5 million API requests hallucinated. We however use it to parse data and interpret it from webpages, this is something you wouldn't be able to do with a regular scraper. Not well atleast.

what · on May 8, 2024

Bold claim, did you review all 1.5 million requests?

bryanrasmussen · on May 8, 2024

I guess the claim is based on statistical sampling at reasonably high level to be sure that if there were hallucinations you would catch them? Or is there something else you're doing?

Do you have any workflow tools etc. to find hallucinations, I've got a project in backlog to build that kind of thing and would be interested in how you sort through bad and good results.

ewild · on May 8, 2024

in this case we had 1.5 millioon ground truths for our testing purposes. we now have run it over 10 million, but i didnt want to claim it had 0 hallucinations on those as technically we cant say for sure, but considering the hallucination rate was 0% for 1.5 million when compared to ground truths im fairly confident.

krainboltgreene · on May 8, 2024

How do you know that's true?

ewild · on May 8, 2024

the 1.5 million was our test set. we had 1.5 million ground truths, and it didnt make up fake data for a single one

krainboltgreene · on May 10, 2024

That's not what I asked. I asked "How did you determine that it didn't make up/get information wrong for all 1.5m?"

is_true · on May 8, 2024

I've written thousands of scrapers and trust me, they don't break often.

infecto · on May 8, 2024

Me too but for adversaries that obfuscate and change their site often to prevent scrapping. It can happen depending on what you are looking at.

is_true · on May 8, 2024

Scrapers well written should be able to cope with site changes.

suchintan · on May 8, 2024

https://github.com/Skyvern-AI/skyvern

This is pretty much what we're building at Skyvern. The only problem is that inference cost is still a little bit too high for scraping, but we expect that to change in the next year