TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Rd-TableBench – Accurately evaluating table extraction

29 pointsby raunakchowdhuri7 months ago
Hey HN!<p>A ton of document parsing solutions have been coming out lately, each claiming SOTA with little evidence. A lot of these turned out to be LLM or LVM wrappers that hallucinate frequently on complex tables.<p>We just released RD-TableBench, an open benchmark to help teams evaluate extraction performance for complex tables. The benchmark includes a variety of challenging scenarios including scanned tables, handwriting, language detection, merged cells, and more.<p>We employed an independent team of PhD-level human labelers who manually annotated 1000 complex table images from a diverse set of publicly available documents.<p>Alongside this, we also release a new bioinformatics inspired algorithm for grading table similarity. Would love to hear any feedback!<p>-Raunak

4 comments

gregw27 months ago
I have realworld bank statements that I have been unable to find any PDF&#x2F;AI extractor that can do a good job on.<p>(To summarize, the core challenge appears to be recognizing nested columnar layout formats combined with odd line wrapping within those columns.)<p>Is there anyone I can submit an example few pages to for consideration in some benchmark?
评论 #42055525 未加载
adit_a7 months ago
Part of the goal with releasing the dataset is to highlight how hard PDF parsing can be. Reducto models are SOTA, but they aren&#x27;t perfect.<p>We constantly see alternatives show one ideal table to claim they&#x27;re accurate. Being able to parse some tables is not hard.<p>What happens when it has merged cells, dense text, rotations, or no gridlines? Will your table outputs be the same when a user uploads a document twice?<p>Our team is relentlessly focused on solving for the true range of scenarios so our customers don&#x27;t have to. Excited to share more about our next gen models soon.
michaefe7 months ago
Not surprising to see Reducto at the top, it&#x27;s by far the best option we&#x27;ve tried
nparsan7 months ago
This is great, but are there datasets for this already? I know pubtables is like 1M labeled data points. Also how important are table schemas as a % of overall unstructured documents?
评论 #42054322 未加载