TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why do we convert structured data to PDFs?

18 点作者 n0rlant1s超过 2 年前
Company A has structured data. They input this into a PDF (making it unstructured) and send it to Company B. Company B now has to use PDF parsing software to turn it back into structured data.<p>Why?

6 条评论

PaulHoule超过 2 年前
Back in the day company A would send a paper document to company B and naturally somebody would have to retype it. PDF is great for that legacy workflow or anything where you need print output or screen output that exactly resembles print output.<p>PDF has facilities for tagging documents such that they can be reflowed like HTML so they can be viewed on different sized screens. It is a boon for accessibility but framing the discussion around accessibility as opposed to a better experience for everyone, particularly automated tools, is hard. (e.g. in politics there is the analogy of how we &quot;can&#x27;t have good things&quot; because policies that are good for everyone get framed as policies that benefit a racial or other group perceived as a &quot;special interest&quot;)<p>I spoke w&#x2F; Larry Masinter at Adobe and he told me Adobe would like people who want structured data in their PDF documents to simply attach files to the PDF. A scientific paper could contain a CSV file of the data, for instance, or a business document could contain a JSON or XML document.<p>Note that &quot;structured&quot; is not a panacea because the structure might not be the same in the two organizations. For exchange of structured data to take place the organizations have to agree on some ontology, something that happens in some industries some of the time, but it isn&#x27;t free, and when it is not in place people still have an excuse to continue using paper processes or processes that emulate paper processes.
评论 #34431084 未加载
aynyc超过 2 年前
Many reasons. In finance, PDF reports are passed between companies instead of JSON&#x2F;XML, etc.. because:<p>1. PDF is considered tempered proof. Obviously, not true, but legal is ok with that.<p>2. PDF can be reviewed quickly by non-technical folks, and then parsed and store into databases.<p>3. PDF is flat file that can be archived easily per legal, other formats such as word documents are used for that as well.<p>In a sense, PDF is what people want. Structured data is what machines want.
评论 #34431063 未加载
NWoodsman超过 2 年前
There are too many variables and edge cases to parse data. Dozens of text encodings, mixed with dozens of markup languages, mixed with millions of uniquely preserved legacy datasets, results in an exponential number of edge-case requirements that the world&#x27;s data is currently stored in. And when you consider the high-power companies with financial investment in legacy data, as well as high-power companies protecting the proprietary rights and trademarks of their existing formats, the world has maximum incentive to use the status quo, a postscript-generated PDF which, due to it&#x27;s legacy, happens to lack the structure you want.<p>On a more philosophical level, the PDF has structure which is probably the most generalized structure across all domains: paragraphs of text on a page. Consider that most people barely know how to search a text file for a given word, and a minuscule percent of those people who know how to query a SQL database. People simply do not have the time or resources to learn a separate domain (data structure design and interaction) apart from their own domain. In other words, there&#x27;s very few people who understand or even have motivation to use tools that provide exponential return on their time (such as manipulating&#x2F;filtering&#x2F;working with structured data). Time passes uniformly, and you typically receive no reward other than more work for learning tools to improve your own workflow.<p>Software engineers have long noticed that we can successfully create &quot;models&quot;, &quot;view models&quot;, and &quot;views&quot; of data that achieve the separation of concerns that you are seeking. A PDF is nothing more than a &quot;view&quot; of data, which has passed through a professional who has created a &quot;view model&quot; of that data (he&#x2F;she decided how best to organize the data on to the page), and then you read the document and &quot;parse&quot; the data with your intellect. There is a lot of expertise and professionalism embedded in crafting paragraphs (or other graphical representations) that you can&#x27;t discredit.<p>There is very little software options to treat generalized, domain-specific data in this three-step manner.
dredmorbius超过 2 年前
Data exchange formats are a detail which can, often is, and quite frankly <i>should</i> be specified in partnership and&#x2F;or vendor contracts.<p>PDF-based interchange of structured data as part of an ongoing relationship ... seems to reflect a poor business relationships management.<p>(And yes, there are all manner of organisations which fail to follow good practices whether on grounds of competence or malice, but generally, this is how I&#x27;d suggest addressing this issue. I&#x27;d also strongly suggest checking to see if such a data exchange option is already available.)
评论 #34448208 未加载
temp12323984超过 2 年前
Exactly the sales pitch of <a href="https:&#x2F;&#x2F;sento.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sento.io&#x2F;</a>. Their platform allows companies to directly send structured data to other companies (circumventing the error prone and potentially labour intensive ds &gt; pdf &gt; sd transformation).<p>Also interesting as a business case, because 1co to 1co requires one connector on every side. Adding another co will only require 1 (not 2 nor 3) connector on the new co side; since the others are still valid hooking into the sento.io platform.<p>Note: I&#x27;m not affiliated, I just came across them a few months ago and this reminded me of them.
rhacker超过 2 年前
Many businesses DON&#x27;T do that, and have adopted structured data transfers. I imagine you&#x27;re working in an older industry like real estate?
评论 #34431956 未加载