Minifying HTML for GPT-4o: Remove all the HTML tags

149 pointsby edublancas9 months ago

17 comments

I don't think that Mercury Prize table is a representative example because each column has an obviously unique structure that the LLM can key in on: (year) (Single Artist/Album pair) (List of Artist/Album pairs) (image) (citation link)I think a much better test would be something like "List of elements by atomic properties" [1] that has a lot of adjacent numbers in a similar range and overlapping first/last column types. However, the danger with that table might be easy for the LLM to infer just from the element names since they're well known physical constants. The table of counties by population density might be less predictable [2] or list of largest cities [3]The test should be repeated with every available sorting function too, to see if that causes any new errors.[1] <a href="https://en.wikipedia.org/wiki/List_of_elements_by_atomic_properties#Table[1]" rel="nofollow">https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...</a>[2] <a href="https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_density#List" rel="nofollow">https://en.wikipedia.org/wiki/List_of_countries_and_dependen...</a>[3] <a href="https://en.wikipedia.org/wiki/List_of_largest_cities#List" rel="nofollow">https://en.wikipedia.org/wiki/List_of_largest_cities#List</a>

评论 #41469872 未加载

评论 #41469656 未加载

评论 #41468740 未加载

评论 #41472531 未加载

cpursley8 months ago

What I do is convert to markdown, that way you still get some semantic structure. Even built an Elixir library for this: <a href="https://github.com/agoodway/html2markdown">https://github.com/agoodway/html2markdown</a>

评论 #41469910 未加载

评论 #41472953 未加载

beepbooptheory8 months ago

You step back and realize: we are thinking about how to best remove some symbols from documents that not a moment ago we were deciding certainly needed to be in there, all to feed a certain kind of symbol machine which has seen all the symbols before anyway, all so we don't pay as much cents for the symbols we know or think we need.If I was not a human but some other kind of being suspended above this situation, with no skin in the game so to speak, it would all seem so terribly inefficient... But as fleshy mortal I do understand how we got here.

评论 #41473243 未加载

yawnxyz8 months ago

I found that reducing html down to markdown using turndown or <a href="https://github.com/romansky/dom-to-semantic-markdown">https://github.com/romansky/dom-to-semantic-markdown</a> works well;if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like <a href="https://www.npmjs.com/package/html2pug" rel="nofollow">https://www.npmjs.com/package/html2pug</a> which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt

评论 #41468850 未加载

评论 #41473946 未加载

ravedave58 months ago

ChatGPT is clearly trained on wikipedia, is there any concern about its knowledge from there polluting the responses? Seems like it would be better to try against data it didn't potentially already know.

评论 #41470759 未加载

CharlieDigital8 months ago

I roughly came to the same conclusion a few months back and wrote a simple, containerized, open source general purpose scraper for use with GPT using Playwright in C# and TypeScript that's fairly easy to deploy and use with GPT function calling[0]. My observation was that using `document.body.innerText` was sufficient for GPT to "understand" the page and `document.body.innerText` preserves some whitespace in Firefox (and I think Chrome).I use more or less this code as a starting point for a variety of use cases and it seems to work just fine for my use cases (scraping and processing travel blogs which tend to have pretty consistent layouts/structures).Some variations can make this better by adding logic to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors) and taking only the `innerText` from the main container.[0] <a href="https://github.com/CharlieDigital/playwright-scrape-api">https://github.com/CharlieDigital/playwright-scrape-api</a>

simplecto8 months ago

One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc).We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.It works well for LLM work as well as generating embeddings for vectors and downstream things.[1] - <a href="https://trafilatura.readthedocs.io/en/latest/" rel="nofollow">https://trafilatura.readthedocs.io/en/latest/</a>

评论 #41474408 未加载

评论 #41474516 未加载

longnguyen8 months ago

I've been building an AI chat client and I use this exact technique to develop the "Web Browsing" plugin. Basically I use Function Calling to extract content from a web page and then pass it to the LLM.There are a few optimizations we can make:- trip all content in <script/> and <style/> - use Readability.js for articles - extract structured content from oEmbedIt works surprisingly well for me, even with gpt-4o-mini

coddle-hark8 months ago

Anecdotally, the same seems to apply to the output format as well. I’ve seen much better performance when instructing the model to output something like this:<pre><code> name=john,age=23 name=anna,age=26 </code></pre> Rather than this:<pre><code> { matches: [ { name: "john", age: 23 }, { name: "anna", age: 26 } ] }</code></pre>

评论 #41472955 未加载

giancarlostoro8 months ago

I wonder if this is due to some template engines looking minimalist like that. I think maybe Pug?<a href="https://github.com/pugjs/pug?tab=readme-ov-file#syntax">https://github.com/pugjs/pug?tab=readme-ov-file#syntax</a>It is whitespace sensitive though, but essentially looks like that. I doubt this is the only unique template engine like this though.

cj8 months ago

Related article from 4 days ago (with comments on scraping, specifically discussing removing HTML tags)<a href="https://news.ycombinator.com/item?id=41428274">https://news.ycombinator.com/item?id=41428274</a>Edit: looks like it's actually the same author

cfcfcf8 months ago

I’m curious. Scraping seems to come up a lot lately. What is everyone scraping? And why?

评论 #41477105 未加载

评论 #41472900 未加载

评论 #41472821 未加载

topaz08 months ago

Is .8 or .9 considered good enough accuracy for something as simple as this?

评论 #41468752 未加载

评论 #41468950 未加载

评论 #41469126 未加载

评论 #41486119 未加载

bbarnett8 months ago

A simple |htmltotext works well here, I suspect. Why rewrite the thing from scratch? It even outputs formatted text if requested.Certainly good enough for gpt input, it's quite good.

IncreasePosts8 months ago

Isn't GPT-4o multimodal? Shouldn't I be able to just feed in an image of the rendered HTML, instead of doing work to strip tags out?

评论 #41469514 未加载

评论 #41470757 未加载

simonw8 months ago

I built a CLI tool (and Python library) for this a while ago called strip-tags: <a href="https://github.com/simonw/strip-tags">https://github.com/simonw/strip-tags</a>By default it will strip all HTML tags and return just the text:<pre><code> curl 'https://simonwillison.net/' | strip-tags </code></pre> But you can also tell it you just want to get back the area of a page identified by one or more CSS selectors:<pre><code> curl 'https://simonwillison.net/' | strip-tags .quote </code></pre> Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM:<pre><code> curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote </code></pre> Add "-m" to minify the output (basically stripping most whitespace)Running this command:<pre><code> curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m </code></pre> Gives me back output that starts like this:<pre><code> <div class="quote segment"> <blockquote>history | tail -n 2000 | llm -s "Write aliases for my zshrc based on my terminal history. Only do this for most common features. Don't use any specific files or directories."</blockquote> — anjor # 3:01 pm / ai, generative-ai, llms, llm </div> <div class="quote segment"> <blockquote>Art is notoriously hard to define, and so are the differences between good art and bad art. But let me offer a generalization: art is something that results from making a lot of choices. […] to oversimplify, we can imagine that a ten-thousand-word short story requires something on the order of ten thousand choices. When you give a generative-A.I. program a prompt, you are making very few choices; if you supply a hundred-word prompt, you have made on the order of a hundred choices. If an A.I. generates a ten-thousand-word story based on your prompt, it has to fill in for all of the choices that you are not making.</blockquote> — Ted Chiang # 10:09 pm / art, new-yorker, ai, generative-ai, ted-chiang </div> </code></pre> I also often use the <a href="https://r.jina.ai/" rel="nofollow">https://r.jina.ai/</a> proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. <a href="https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-user-interface/" rel="nofollow">https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...</a>

sergiotapia8 months ago

In Elixir, I select the `<body>`, then remove all script and style tags. Then extract the text.This results in a kind of innerText you get in browsers, great and light to pass into LLMs.<pre><code> defp extract_inner_text(html) do html |> Floki.parse_document!() |> Floki.find("body") |> Floki.traverse_and_update(fn {tag, _attrs, _children} = _node when tag in ["script", "style"] -> nil node -> node end) |> Floki.text(sep: " ") |> String.trim() |> String.replace(~r/\s+/, " ") end</code></pre>

评论 #41476298 未加载