Layout analysis is the key.
Quite a bit of work has been going on recently in this area.<p>Some papers of relevance:<p><pre><code> - Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes. "PubLayNet: largest dataset ever for document layout analysis," Aug 2019. Preprint: https://arxiv.org/abs/1908.07836 Code/Data: https://github.com/ibm-aur-nlp/PubLayNet
- B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar and P. Staar, "DocLayNet: a large human-annotated dataset for document-layout analysis," 13 August 2022. [Online]. Available: https://developer.ibm.com/exchanges/data/all/doclaynet/.
- S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie and R. Manmatha, "Docformer: End-to-end transformer for document understanding.," in The International Conference on Computer Vision (ICCV 2021), 2021.
</code></pre>
The first one is for publications. From the abstract: "...the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated".<p>The second is for documents. It contains 80K manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement.