TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What's the current best way to extract tables from PDFs?

4 点作者 shekhar101超过 1 年前
I am have a set of pdf that are bank statements. The formats of these statements are different based on the bank but they are limited set (<15). What's the current best approach to extract tabular data from PDFs? I tried writing custom logic based on pdfplumber and such but they are very fragile and have lots of ad-hoc logic. The maintenance is pretty high. Are there small models that can run preferably on CPUs alone and that I can possibly fine tune for this task? Any guides or pointers for that? I see a lot of available models, but as someone with no ML background, it's difficult to navigate through.

2 条评论

jonahbenton超过 1 年前
Field report- the problem is subtle. I wrote code to do this for mine, rather than use CSVs, because the statement is a regulated document, which CSVs are not, and it has balances for validation, which CSVs also lack.<p>I wound up with a pipeline of pdftotext -&gt; configurable regexes to capture the transactions within their respective sections (banks list credits and debits separately without indicating the sign in the amount field) -&gt; BNF parser to turn transaction lines into data, then checks start balance + transactions = end balance.<p>PITB but works well.<p>Over the winter will be standing up a local model to see whether a sophisticated prompt can reliably accomplish the same.<p>Not going to base any workflow on my transaction data on hosted models.
andrewio超过 1 年前
To extract tables from PDFs, you can use the following tools:<p>1. Tabula (<a href="https:&#x2F;&#x2F;tabula.technology" rel="nofollow noreferrer">https:&#x2F;&#x2F;tabula.technology</a>): a free and open-source tool.<p>2. Parsio (<a href="https:&#x2F;&#x2F;parsio.io" rel="nofollow noreferrer">https:&#x2F;&#x2F;parsio.io</a>): uses pre-trained AI models for data extraction from PDFs, emails, and other formats.<p>3. Airparser (<a href="https:&#x2F;&#x2F;airparser.com" rel="nofollow noreferrer">https:&#x2F;&#x2F;airparser.com</a>): uses GPT approach similar to ChatGPT for data extraction from PDFs, emails, and other formats.