TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why extracting data from PDFs is still a nightmare

8 点作者 lxm17 天前

2 条评论

cratermoon17 天前
&quot;Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely.&quot;<p>My personal experience relates to the statement from the article, &quot;According to several studies, approximately 80–90 percent of the world&#x27;s organizational data is stored as unstructured data in documents, much of it locked away in formats that resist easy extraction.&quot;<p>In my experience, this means Microsoft Word and PowerPoint documents authored by people who put more focus on the appearance than the structure of the content. Take one of these documents and generate PDF from it and any hint of structure that existed is gone.<p>There was an article on HN not too long ago discussing this history of Word documents and the lack of structure, but I can&#x27;t find the link. ETA: <a href="https:&#x2F;&#x2F;ia.net&#x2F;topics&#x2F;markdown-and-the-slow-fade-of-the-formatting-fetish" rel="nofollow">https:&#x2F;&#x2F;ia.net&#x2F;topics&#x2F;markdown-and-the-slow-fade-of-the-form...</a>
bediger400017 天前
My credit union will only give out monthly statements as PDFs. For some reason, the pdf-to-text converters don&#x27;t strip out rows or lines of text, but rather columns, so a semi-automated solution is out-of-the-question. I&#x27;ve resorted to mouse-copying entries in Firefox, then pasting the text into a word processor to get rows of data I can work with.<p>The tellers are magnificently ignorant about this, as is the telephone helpline. To them, the PDF actually is the data the credit union uses. No other form of data exists, except possibly in an Excel spreadsheet, and they can&#x27;t give data in that format. I blame the prevalence of Windows for this. Between the use of file name &quot;extension&quot; to indicate format of the file, hiding the &quot;extension&quot; in file browsers, the single document at a time orientation, and almost exclusive use of WYSIWYG systems like Word and Excel, it&#x27;s pretty hard to understand that a difference between &quot;the data&quot; and &quot;the formatting&quot; exists.