Show HN: Convert HTML DOM to semantic markdown for use in LLMs

146 pointsby leroman10 months ago

19 comments

This is cool. When dealing with tables, you might want to explore departing from markdown. I’ve found that LLMs tend to struggle with tables that have large numbers of columns containing similar data types. Correlating a row is easy enough, because the data is all together, but connecting a cell back to its column becomes a counting task, which appears to be pretty rough.<p>A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.

评论 #41046346 未加载

评论 #41056031 未加载

评论 #41054195 未加载

评论 #41104392 未加载

评论 #41054222 未加载

评论 #41045938 未加载

gmaster144010 months ago

> Semantic Clarity: Converts web content to a format more easily "understandable" for LLMs, enhancing their processing and reasoning capabilities.<p>Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?

评论 #41045819 未加载

评论 #41045780 未加载

评论 #41045768 未加载

DeveloperErrata10 months ago

It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).<p>I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: <a href="https://pandoc.org/chunkedhtml-demo/8.9-tables.html" rel="nofollow">https://pandoc.org/chunkedhtml-demo/8.9-tables.html</a>). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.

richardreeze10 months ago

This is really cool. I've already implemented it in one of my tools (I found it to work better than the Turndown/ Readability combination I was previously using).<p>One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).

评论 #41072392 未加载

la_fayette10 months ago

The scoring approach seems interesting to extract the main content of web pages. I am aware of the large body of decades of research on that subject, with sophisticated image or nlp based approaches. Since this extraction is critical to the quality of the LLM response, it would be good to know how well this performs. E.g., you could test it against a test dataset (<a href="https://github.com/scrapinghub/article-extraction-benchmark">https://github.com/scrapinghub/article-extraction-benchmark</a>). Also, you could provide the option to plugin another extraction algorithm, since there are other implementations available... just some ideas for improvement...

评论 #41049921 未加载

gradientDissent10 months ago

Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.

评论 #41046706 未加载

kartoolOz10 months ago

WebArena does this really well, called the "accessibility_tree" <a href="https://github.com/web-arena-x/webarena/blob/main/browser_env/processors.py#L47">https://github.com/web-arena-x/webarena/blob/main/browser_en...</a>

nvartolomei10 months ago

While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.<p>The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.<p>The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.<p>I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?<p>——<p>The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.

KolenCh10 months ago

I am curious how it would compare to using pandoc with readability algorithm for example.

评论 #41050260 未加载

alexliu51810 months ago

Converting web pages to Markdown is a common requirement. I have found that turndown does a good job, but it cannot meet the needs of all dynamic web page content. As far as I know, if you need to process dynamic web pages, you need targeted adaptation, such as Google extensions such as Web2Markdown.

throwthrowuknow10 months ago

Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!

评论 #41046667 未加载

评论 #41046329 未加载

nbbaier10 months ago

This is really cool! Any plans to add Deno support? This would be a great fit for environments like val.town[0], but they are based on a Deno runtime and I don't think this will work out of the box.<p>Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`<p>[0]: <a href="https://val.town" rel="nofollow">https://val.town</a>

评论 #41047591 未加载

KolenCh10 months ago

Does anyone compare the performance between HTML input and other formats? I did an informal comparison and from a few tests it seems the HTML input is better. I thought having markdown input would be more efficient too but I’d like to see more systematic comparison to see it is the case.

brightvegetable10 months ago

This is great, I was just in need of something like this. Thank!

explosion-s10 months ago

How is this different than any other HTML to markdown library, like Showdown or Turndown? Is there any specific features that make it better for LLMS specifically instead of just converting HTML to MD?

评论 #41045880 未加载

Layvier10 months ago

Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?

Zetaphor10 months ago

A browser demo would be a nice addition to this readme

评论 #41046715 未加载

评论 #41049995 未加载

DevX10110 months ago

Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.

评论 #41047717 未加载

评论 #41049840 未加载

ianbicking10 months ago

This is a great idea! There's an exceedingly large amount of junk in a typical HTML page that an LLM can't use in any useful way.<p>A few thoughts:<p>1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.<p>2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: <a href="https://github.com/romansky/dom-to-semantic-markdown/blob/main/src/core/htmlToMarkdownAST.ts">https://github.com/romansky/dom-to-semantic-markdown/blob/ma...</a> – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)<p>3. With a more formal AST you could replace the big switch in <a href="https://github.com/romansky/dom-to-semantic-markdown/blob/main/src/core/markdownASTToString.ts">https://github.com/romansky/dom-to-semantic-markdown/blob/ma...</a> with a class that can be subclassed to override how particular nodes are serialized.<p>4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.<p>5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).<p>6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.<p>7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)

评论 #41050157 未加载

19 comments

mistercow10 months ago

评论 #41046346 未加载

评论 #41056031 未加载

评论 #41054195 未加载

评论 #41104392 未加载

评论 #41054222 未加载

评论 #41045938 未加载

gmaster144010 months ago

评论 #41045819 未加载

评论 #41045780 未加载

评论 #41045768 未加载

DeveloperErrata10 months ago

richardreeze10 months ago

评论 #41072392 未加载

la_fayette10 months ago

评论 #41049921 未加载

gradientDissent10 months ago

Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.

评论 #41046706 未加载

kartoolOz10 months ago

nvartolomei10 months ago

KolenCh10 months ago

I am curious how it would compare to using pandoc with readability algorithm for example.

评论 #41050260 未加载

alexliu51810 months ago

throwthrowuknow10 months ago

Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!

评论 #41046667 未加载

评论 #41046329 未加载

nbbaier10 months ago

评论 #41047591 未加载

KolenCh10 months ago

brightvegetable10 months ago

This is great, I was just in need of something like this. Thank!

explosion-s10 months ago

评论 #41045880 未加载

Layvier10 months ago

Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?

Zetaphor10 months ago

A browser demo would be a nice addition to this readme

评论 #41046715 未加载

评论 #41049995 未加载

DevX10110 months ago

Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.

评论 #41047717 未加载

评论 #41049840 未加载

ianbicking10 months ago

评论 #41050157 未加载