Hi HN!
This is my pet project, written from scratch because there is so much to discover and learn in the process. The focus is on simplicity and incremental updates.
Progress is slow because I do not have much spare time to work on this, but I would love to hear some feedback.
Regards
Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?<p>Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.<p><a href="https://github.com/trailofbits/polyfile" rel="nofollow">https://github.com/trailofbits/polyfile</a>
I never knew about the J number suffix in python: <a href="https://docs.python.org/3/reference/lexical_analysis.html#imaginary-literals" rel="nofollow">https://docs.python.org/3/reference/lexical_analysis.html#im...</a> which it would appear is used to represent references: <a href="https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_parsing.py#L17" rel="nofollow">https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...</a><p>I wish you good luck, this file format has tripped up many, <i>many</i> a developer. It blew up on a pdf I had lying around:<p><pre><code> ValueError: could not convert string to float: b'5.0.0'
104 0 obj <<
/Producer (pdfTeX-1.40.10)
/Creator (TeX)
/CreationDate (D:20131209161146-08'00')
/ModDate (D:20131209161146-08'00')
/Trapped /False
/PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0)
>> endobj
</code></pre>
as it seems a string with nested parens jams up the parser
This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: <a href="https://github.com/shreevatsa/pdf-explorer" rel="nofollow">https://github.com/shreevatsa/pdf-explorer</a>)<p>In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.
I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.<p>PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.<p>However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.<p>Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.<p>There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.<p>Will be curious to see how this project develops!<p>[1] <a href="https://pypi.org/project/PyPDF2/" rel="nofollow">https://pypi.org/project/PyPDF2/</a><p>[2] <a href="https://pypi.org/project/reportlab/" rel="nofollow">https://pypi.org/project/reportlab/</a>
As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.<p>If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at <a href="https://pdf-issues.pdfa.org/32000-2-2020/index.html" rel="nofollow">https://pdf-issues.pdfa.org/32000-2-2020/index.html</a>.<p>As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see <a href="https://github.com/gettalong/annotated-pdf-spec" rel="nofollow">https://github.com/gettalong/annotated-pdf-spec</a>. That might help you in parsing some invalid PDFs
Thank you, it is much needed, right now the most reliable way of generating PDF's I used not so long time ago is
- create DOCX with content and some template variable strings, like {{}}
- unpack document and get into text, replace text
- use DOCX->PDF linux tool to generate document.<p>Maybe this will be the good solution
Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!<p>More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.
I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.
Good to see work in the PDF space. It’s still one of the most important formats. I would love to see more time invested in tools that can create PDF/A documents, which I believe to be the sane subset of PDF.
I desperately need to be able to display .SVG files with gradients on .PDFs, but no library currently exist in python as far as I know.<p>I would be willing to help make this happen, but I do not know much about the PDF format.
Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)<p>Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.
I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.<p>That shit was <i>hard</i>. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.
This reminds me back in the day where we got some properties and thought, PDF is a defined file format. Every pdf has this values…<p>We were so naiv and didn’t know.
I wish you all the best! This space has a lot of stuff in it and they’re lacking in some aspect. And that’s not a admonishment, PDF is such a complicated format that there will never be a library that doesn’t come with asterisks — it’s just a matter of picking the thing you want your library to focus on and be good at and you can pretty easily be someone’s favorite lib.
I had some luck with Camelot (<a href="https://camelot-py.readthedocs.io/en/master/" rel="nofollow">https://camelot-py.readthedocs.io/en/master/</a>). However, as many of the comments here say, PDF is a beast.
Hey desgeeko,<p>from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…<p><a href="https://github.com/systori/bericht" rel="nofollow">https://github.com/systori/bericht</a>
there is a Perl library that does this, but it only supports pdf 1.5<p><a href="https://metacpan.org/pod/CAM::PDF" rel="nofollow">https://metacpan.org/pod/CAM::PDF</a><p>I have used it in the past.
Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.
Good luck! Really! I hate ReportLab!<p>I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.
On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.<p>[1]: <a href="https://github.com/jcushman/pdfquery" rel="nofollow">https://github.com/jcushman/pdfquery</a>
Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?