There are too many variables and edge cases to parse data. Dozens of text encodings, mixed with dozens of markup languages, mixed with millions of uniquely preserved legacy datasets, results in an exponential number of edge-case requirements that the world's data is currently stored in. And when you consider the high-power companies with financial investment in legacy data, as well as high-power companies protecting the proprietary rights and trademarks of their existing formats, the world has maximum incentive to use the status quo, a postscript-generated PDF which, due to it's legacy, happens to lack the structure you want.<p>On a more philosophical level, the PDF has structure which is probably the most generalized structure across all domains: paragraphs of text on a page. Consider that most people barely know how to search a text file for a given word, and a minuscule percent of those people who know how to query a SQL database. People simply do not have the time or resources to learn a separate domain (data structure design and interaction) apart from their own domain. In other words, there's very few people who understand or even have motivation to use tools that provide exponential return on their time (such as manipulating/filtering/working with structured data). Time passes uniformly, and you typically receive no reward other than more work for learning tools to improve your own workflow.<p>Software engineers have long noticed that we can successfully create "models", "view models", and "views" of data that achieve the separation of concerns that you are seeking. A PDF is nothing more than a "view" of data, which has passed through a professional who has created a "view model" of that data (he/she decided how best to organize the data on to the page), and then you read the document and "parse" the data with your intellect. There is a lot of expertise and professionalism embedded in crafting paragraphs (or other graphical representations) that you can't discredit.<p>There is very little software options to treat generalized, domain-specific data in this three-step manner.