TechEcho

Hey HN!A ton of document parsing solutions have been coming out lately, each claiming SOTA with little evidence. A lot of these turned out to be LLM or LVM wrappers that hallucinate frequently on complex tables.We just released RD-TableBench, an open benchmark to help teams evaluate extraction performance for complex tables. The benchmark includes a variety of challenging scenarios including scanned tables, handwriting, language detection, merged cells, and more.We employed an independent team of PhD-level human labelers who manually annotated 1000 complex table images from a diverse set of publicly available documents.Alongside this, we also release a new bioinformatics inspired algorithm for grading table similarity. Would love to hear any feedback!-Raunak

4 comments

gregw27 months ago

I have realworld bank statements that I have been unable to find any PDF/AI extractor that can do a good job on.(To summarize, the core challenge appears to be recognizing nested columnar layout formats combined with odd line wrapping within those columns.)Is there anyone I can submit an example few pages to for consideration in some benchmark?

评论 #42055525 未加载

adit_a7 months ago

Part of the goal with releasing the dataset is to highlight how hard PDF parsing can be. Reducto models are SOTA, but they aren't perfect.We constantly see alternatives show one ideal table to claim they're accurate. Being able to parse some tables is not hard.What happens when it has merged cells, dense text, rotations, or no gridlines? Will your table outputs be the same when a user uploads a document twice?Our team is relentlessly focused on solving for the true range of scenarios so our customers don't have to. Excited to share more about our next gen models soon.

michaefe7 months ago

Not surprising to see Reducto at the top, it's by far the best option we've tried

nparsan7 months ago

This is great, but are there datasets for this already? I know pubtables is like 1M labeled data points. Also how important are table schemas as a % of overall unstructured documents?

评论 #42054322 未加载

4 comments

gregw27 months ago

评论 #42055525 未加载

adit_a7 months ago

michaefe7 months ago

Not surprising to see Reducto at the top, it's by far the best option we've tried

nparsan7 months ago

This is great, but are there datasets for this already? I know pubtables is like 1M labeled data points. Also how important are table schemas as a % of overall unstructured documents?

评论 #42054322 未加载

Rd-TableBench – Accurately evaluating table extraction

4 comments

Rd-TableBench – Accurately evaluating table extraction

4 comments