Web scraping with GPT-4o: powerful but expensive

377 pointsby edublancas9 months ago

45 comments

We've had the best success by first converting the HTML to a simpler format (i.e. markdown) before passing it to the LLM.There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.By simplifying the "problem" we've had decent success, even with GPT-4o mini.[0] <a href="https://github.com/extractus">https://github.com/extractus</a>[1] <a href="https://github.com/romansky/dom-to-semantic-markdown">https://github.com/romansky/dom-to-semantic-markdown</a>[2] <a href="https://apify.com/" rel="nofollow">https://apify.com/</a>[3] <a href="https://www.firecrawl.dev/">https://www.firecrawl.dev/</a>[4] <a href="https://magicloops.dev/">https://magicloops.dev/</a>

评论 #41430953 未加载

评论 #41435909 未加载

评论 #41433942 未加载

评论 #41431521 未加载

评论 #41431027 未加载

tom13379 months ago

OpenAI recently announced a Batch API [1] which allows you to prepare all prompts and then run them as a batch. This reduces costs as its just 50% the price. Used it a lot with GPT-4o mini in the past and was able to prompt 3000 Items in less than 5min. Could be great for non-realtime applications.[1] <a href="https://platform.openai.com/docs/guides/batch" rel="nofollow">https://platform.openai.com/docs/guides/batch</a>

评论 #41429662 未加载

评论 #41432184 未加载

评论 #41434416 未加载

namukang9 months ago

For structured content (e.g. lists of items, simple tables), you really don’t need LLMs.I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.[0] <a href="https://easyscraper.com" rel="nofollow">https://easyscraper.com</a>

评论 #41434003 未加载

评论 #41434104 未加载

parhamn9 months ago

Is there a "html reducer" out there? I've been considering writing one. If you take a page's source it's going to be 90% garbage tokens -- random JS, ads, unnecessary properties, aggressive nesting for layout rendering, etc.I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.Whats the gold standard for something like this?

评论 #41429746 未加载

评论 #41429728 未加载

评论 #41431309 未加载

评论 #41429712 未加载

评论 #41429719 未加载

评论 #41429716 未加载

评论 #41430481 未加载

评论 #41436210 未加载

antirez9 months ago

It's very surprising that the author of this post does 99% of the work and writing and then does not go forward for the other 1% downloading ollama (or some other llama.cpp based engine) and testing how some decent local LLM works in this use case. Because maybe a 7B or 30B model will do great in this use case, and that's cheap enough to run: no GPT-4o needed.

评论 #41437003 未加载

hubraumhugo9 months ago

We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.Here is what we ended up with:- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us[0] <a href="https://kadoa.com" rel="nofollow">https://kadoa.com</a>

评论 #41432754 未加载

btbuildem9 months ago

I've had good luck with giving it an example of HTML I want scraped and asking for a beautifulsoup code snippet. Generally the structure of what you want to scrape remains the same, and it's a tedious exercise coming up with the garbled string of nonsense that ends up parsing it.Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.

abhgh9 months ago

As others have mentioned here you might get better results cheaper (this probably wasn't the point of the article, so just fyi) if you preprocess the html first. I personally have had good results with trafilatura[1], which I don't see mentioned yet.[1] <a href="https://trafilatura.readthedocs.io/en/latest/" rel="nofollow">https://trafilatura.readthedocs.io/en/latest/</a>

评论 #41433841 未加载

artembugara9 months ago

Wow, that's one of the most orange tag-rich posts I've ever seen.We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT."Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.So, we're looking for LLM to generate a code to parse HTML.Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com

评论 #41433583 未加载

kcorbitt9 months ago

Funnily enough, web scraping was actually the motivating use-case that started my co-founder and I building what is now openpipe.ai. GPT-4 is really good at it, but extremely expensive. But it's actually pretty easy to distill its skill at scraping a specific class of site down to a fine-tuned model that's way cheaper and also really good at scraping that class of site reliably.

评论 #41429502 未加载

评论 #41432618 未加载

jasonthorsness9 months ago

I also had good results with structured outputs, scraping news articles for city names from <a href="https://lite.cnn.com" rel="nofollow">https://lite.cnn.com</a> for the “in the news” list at <a href="https://weather.bingo" rel="nofollow">https://weather.bingo</a> - code here: <a href="https://www.jasonthorsness.com/13" rel="nofollow">https://www.jasonthorsness.com/13</a>I’ve had problems with hallucinations though even for something as simple as city names; also the model often ignores my prompt and returns country names - am thinking of trying a two-stage scrape with one checking the output of the other.

marcell9 months ago

I'm working on a Chrome extension to do web scraping using OpenAI, and I've been impressed by what ChatGPT can do. It can scrape complicated text/html, and usually returns the correct results.It's very early still but check it out at <a href="https://FetchFoxAI.com" rel="nofollow">https://FetchFoxAI.com</a>One of the cool things is that you can scrape non-uniform pages easily. For example I helped someone scrape auto dealer leads from different websites: <a href="https://youtu.be/QlWX83uHgHs" rel="nofollow">https://youtu.be/QlWX83uHgHs</a> . This would be a lot harder with a "traditional" scraper.

评论 #41429594 未加载

zulko9 months ago

Same experience here. Been building a classical music database [1] where historical and composer life events are scraped off wikipedia by asking ChatGPT to extract lists of `[{event, year, location}, ...]` from biographies.- Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free.- Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables)- At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods.- Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>`[1] <a href="https://github.com/Zulko/composer-timelines">https://github.com/Zulko/composer-timelines</a>

评论 #41433631 未加载

评论 #41442935 未加载

ammario9 months ago

To scale such an approach you could have the LLM generate JS to walk the DOM and extract content, caching the JS for each page.

tuktuktuk9 months ago

Can you share how long it took for you to parse the HTML? I recently experimented with comparing different AI models, including GPT-4o, alongside Gemini and Claude to parse raw HTML: <a href="https://serpapi.com/blog/web-scraping-with-ai-parsing-html-to-structured-data/" rel="nofollow">https://serpapi.com/blog/web-scraping-with-ai-parsing-html-t...</a>. Result is pretty interesting.

mjrbds9 months ago

We've had lots of success with this at Rastro.sh - but the biggest unlock came when we used this as benchmark data to build scraping code. Sonnet 3.5 is able to do this. It reduced our cost and improved accuracy for our use case (extracting e-commerce products), as some of these models are not reliable to extract lists of 50+ items.

simonw9 months ago

GPT-4o mini is 33x cheaper than GPT-4o, or 66x cheaper in batch mode. But the article says:> I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.Would be interesting to compare with the other inexpensive top tier models, Claude 3 Haiku and Gemini 1.5 Flash.

评论 #41429744 未加载

wslh9 months ago

Isn't ollama an answer to this? Or is there something inherent to OpenAI that makes it significantly better for web scraping?

评论 #41429624 未加载

godber9 months ago

I would definitely approach this problem by having the LLM write code to scrape the page. That would address the cost and accuracy problems. And also give you testable code.

mfrye09 months ago

As others have mentioned, converting html to markdown works pretty well.With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.

luigi239 months ago

Why are scrapers so popular nowadays?

评论 #41429584 未加载

评论 #41429248 未加载

评论 #41429243 未加载

评论 #41431916 未加载

评论 #41431095 未加载

ozr9 months ago

GPT-4 (and Claude) are definitely the top models out there, but: Llama, even the 8b, is more than capable of handling extraction like this. I've pumped absurd batches through it via vLLM.With serverless GPUs, the cost has been basically nothing.

评论 #41432596 未加载

FooBarWidget9 months ago

Can anyone recommend an AI vision web browsing automation framework rather than just scraping? My use case: automate the monthly task of logging into a website and downloading the latest invoice PDF.

nsonha9 months ago

Most discussion I found about the topic is how to extract information. Is there any technique for extracting interactive elements? I reckon listing all of inputs/controls would not be hard, but finding the corresponding labels/articles might be tricky.Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).

mmasu9 months ago

As a poc, we first took a screenshot of the page, cropped it to the part we needed and then passed it to GPT. One of the things we do is compare prices of different suppliers for the same product (i.e. airline tickets), and sometimes need to do it manually. While the approach could look expensive, it is in general cheaper than a real person, and enables the real person to do more meaningful work… so it’s a win-win. I am looking forward to put this in production hopefully

fvdessen9 months ago

This looks super useful, but from what i've heard, if you try to do this at any meaningful scale your scrapers will be blocked by Cloudflare and the likes

评论 #41431006 未加载

kanzure9 months ago

Instead of directly scraping with GPT-4o, what you could do is have GPT-4o write a script for a simple web scraper and then use a prompt-loop when something breaks or goes wrong.I have the same opinion about a man and his animals crossing a river on a boat. Instead of spending tokens on trying to solve a word problem, have it create a constraint solver and then run that. Same thing.

sentinels9 months ago

What people mentioned above is pretty much what they did at octabear and as an extension of the idea it's also what a lot of startups applicants did for other type of media like video scraping, podcast scraping, audio scraping, etc [0] <a href="https://www.octabear.com/" rel="nofollow">https://www.octabear.com/</a>

mateuszbuda9 months ago

I think that LLM costs, even GPT-4o, are probably lower compared to proxy costs usually required for web scraping at scale. The cost of residential/mobile proxies is a few $ per GB. If I were to process cleaned data obtained using 1GB of residential/mobile proxy transfer, I wouldn't pay more for LLM.

Havoc9 months ago

Asking for XPaths is clever!Plus you can probably use that until it fails (website changes) and then just re scrape it with llm request

danielvaughn9 months ago

The author claims that attempting to retrieve xpaths with the LLM proved to be unreliable. I've been curious about this approach because it seems like the best "bang for your buck" with regards to cost. I bet if you experimented more, you could probably improve your results.

timsuchanek9 months ago

This is also how we started a while ago. I agree that it's too expensive, hence we're working on making this scalable and cheaper now! We'll soon launch, but here we go! <a href="https://expand.ai" rel="nofollow">https://expand.ai</a>

评论 #41432480 未加载

impure9 months ago

I was thinking of adding a feature of my app to use LLMs to extract XPaths to generate RSS feeds from sites that don't support it. The section on XPaths is unfortunate.

bilater9 months ago

Not sure why author didn't use 4o-mini. 4o for reasoning but things like parsing/summarizing can be done by cheaper models with little loss in quality.

评论 #41436754 未加载

raybb9 months ago

On this note, does anyone know how Cursor scrapes websites? Is it just fetching locally and then feeding the raw html or doing some type of preprocessing?

the_cat_kittles9 months ago

is it really so hard to look at a couple xpaths in chrome? insane that people actually use an llm when trying to do this for real. were headed where automakers are now- just put in idiot lights, no one knows how to work on any parts anymore. suit yourself i guess

kimoz9 months ago

Is it possible to achieve good results using the open source models for scrapping?

LetsGetTechnicl9 months ago

Surely you don't need an LLM for this

Gee1019 months ago

A bit of topic but great post title.

blackeyeblitzar9 months ago

I just want something that can take all my bookmarks, log into all by subscriptions using my credentials, and archive all those articles. I can then feed them to an LLM of my choice to ask questions later. But having the raw archive is the important part. I don’t know if there are any easy to use tools to do this though, especially with paywalled subscription based websites.

webprofusion9 months ago

Just run the model locally?

fsndz9 months ago

useful for one shot cases, but not more for the moment imo.

lccerina9 months ago

We are re-opening coal plants to do this? Every day a bit more disgusted by GenAI stuff

albert_e9 months ago

Offtopic:What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data.We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.

评论 #41435197 未加载

评论 #41433960 未加载

评论 #41433944 未加载

LetsGetTechnicl9 months ago

I'm starting to think that LLM's are a solution in need of a problem like how crypto and the blockchain was. Have we not already solved web scraping?

评论 #41435239 未加载

45 comments

jumploops9 months ago

评论 #41430953 未加载

评论 #41435909 未加载

评论 #41433942 未加载

评论 #41431521 未加载

评论 #41431027 未加载

tom13379 months ago

评论 #41429662 未加载

评论 #41432184 未加载

评论 #41434416 未加载

namukang9 months ago

评论 #41434003 未加载

评论 #41434104 未加载

parhamn9 months ago

评论 #41429746 未加载

评论 #41429728 未加载

评论 #41431309 未加载

评论 #41429712 未加载

评论 #41429719 未加载

评论 #41429716 未加载

评论 #41430481 未加载

评论 #41436210 未加载

antirez9 months ago

评论 #41437003 未加载

hubraumhugo9 months ago

评论 #41432754 未加载

btbuildem9 months ago

abhgh9 months ago

评论 #41433841 未加载

artembugara9 months ago

评论 #41433583 未加载

kcorbitt9 months ago

评论 #41429502 未加载

评论 #41432618 未加载

jasonthorsness9 months ago

marcell9 months ago

评论 #41429594 未加载

zulko9 months ago

评论 #41433631 未加载

评论 #41442935 未加载

ammario9 months ago

To scale such an approach you could have the LLM generate JS to walk the DOM and extract content, caching the JS for each page.

tuktuktuk9 months ago

mjrbds9 months ago

simonw9 months ago

评论 #41429744 未加载

wslh9 months ago

Isn't ollama an answer to this? Or is there something inherent to OpenAI that makes it significantly better for web scraping?

评论 #41429624 未加载

godber9 months ago

I would definitely approach this problem by having the LLM write code to scrape the page. That would address the cost and accuracy problems. And also give you testable code.

mfrye09 months ago

luigi239 months ago

Why are scrapers so popular nowadays?

评论 #41429584 未加载

评论 #41429248 未加载

评论 #41429243 未加载

评论 #41431916 未加载

评论 #41431095 未加载

ozr9 months ago

评论 #41432596 未加载

FooBarWidget9 months ago

nsonha9 months ago

mmasu9 months ago

fvdessen9 months ago

This looks super useful, but from what i've heard, if you try to do this at any meaningful scale your scrapers will be blocked by Cloudflare and the likes

评论 #41431006 未加载

kanzure9 months ago

sentinels9 months ago

mateuszbuda9 months ago

Havoc9 months ago

Asking for XPaths is clever!Plus you can probably use that until it fails (website changes) and then just re scrape it with llm request

danielvaughn9 months ago

timsuchanek9 months ago

评论 #41432480 未加载

impure9 months ago

I was thinking of adding a feature of my app to use LLMs to extract XPaths to generate RSS feeds from sites that don't support it. The section on XPaths is unfortunate.

bilater9 months ago

Not sure why author didn't use 4o-mini. 4o for reasoning but things like parsing/summarizing can be done by cheaper models with little loss in quality.

评论 #41436754 未加载

raybb9 months ago

On this note, does anyone know how Cursor scrapes websites? Is it just fetching locally and then feeding the raw html or doing some type of preprocessing?

the_cat_kittles9 months ago

kimoz9 months ago

Is it possible to achieve good results using the open source models for scrapping?

LetsGetTechnicl9 months ago

Surely you don't need an LLM for this

Gee1019 months ago

A bit of topic but great post title.

blackeyeblitzar9 months ago

webprofusion9 months ago

Just run the model locally?

fsndz9 months ago

useful for one shot cases, but not more for the moment imo.