Show HN: I am building a new Python library to read/write PDF files

279 pointsby desgeekoover 2 years ago

Hi HN! This is my pet project, written from scratch because there is so much to discover and learn in the process. The focus is on simplicity and incremental updates. Progress is slow because I do not have much spare time to work on this, but I would love to hear some feedback. Regards

29 comments

ESultanikover 2 years ago

Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.<a href="https://github.com/trailofbits/polyfile" rel="nofollow">https://github.com/trailofbits/polyfile</a>

评论 #33654288 未加载

评论 #33657208 未加载

mdanielover 2 years ago

I never knew about the J number suffix in python: <a href="https://docs.python.org/3/reference/lexical_analysis.html#imaginary-literals" rel="nofollow">https://docs.python.org/3/reference/lexical_analysis.html#im...</a> which it would appear is used to represent references: <a href="https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_parsing.py#L17" rel="nofollow">https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...</a>I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:<pre><code> ValueError: could not convert string to float: b'5.0.0' 104 0 obj << /Producer (pdfTeX-1.40.10) /Creator (TeX) /CreationDate (D:20131209161146-08'00') /ModDate (D:20131209161146-08'00') /Trapped /False /PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0) >> endobj </code></pre> as it seems a string with nested parens jams up the parser

评论 #33652551 未加载

评论 #33648173 未加载

评论 #33651162 未加载

svatover 2 years ago

This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: <a href="https://github.com/shreevatsa/pdf-explorer" rel="nofollow">https://github.com/shreevatsa/pdf-explorer</a>)In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.

评论 #33658397 未加载

programmarchyover 2 years ago

I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.Will be curious to see how this project develops![1] <a href="https://pypi.org/project/PyPDF2/" rel="nofollow">https://pypi.org/project/PyPDF2/</a>[2] <a href="https://pypi.org/project/reportlab/" rel="nofollow">https://pypi.org/project/reportlab/</a>

评论 #33649755 未加载

评论 #33650475 未加载

gettalongover 2 years ago

As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at <a href="https://pdf-issues.pdfa.org/32000-2-2020/index.html" rel="nofollow">https://pdf-issues.pdfa.org/32000-2-2020/index.html</a>.As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see <a href="https://github.com/gettalong/annotated-pdf-spec" rel="nofollow">https://github.com/gettalong/annotated-pdf-spec</a>. That might help you in parsing some invalid PDFs

评论 #33663345 未加载

Eatcatsover 2 years ago

Thank you, it is much needed, right now the most reliable way of generating PDF's I used not so long time ago is - create DOCX with content and some template variable strings, like {{}} - unpack document and get into text, replace text - use DOCX->PDF linux tool to generate document.Maybe this will be the good solution

评论 #33653586 未加载

评论 #33655289 未加载

UglyToadover 2 years ago

Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.

评论 #33649708 未加载

99112000over 2 years ago

I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.

评论 #33650278 未加载

评论 #33650503 未加载

评论 #33673385 未加载

评论 #33652528 未加载

评论 #33652324 未加载

评论 #33651284 未加载

评论 #33650047 未加载

jl6over 2 years ago

Good to see work in the PDF space. It’s still one of the most important formats. I would love to see more time invested in tools that can create PDF/A documents, which I believe to be the sane subset of PDF.

评论 #33654190 未加载

password4321over 2 years ago

Is there a list of open source PDF libraries for various languages?And related: the best tools to generate PDFs from HTML.

评论 #33651650 未加载

评论 #33648242 未加载

评论 #33648281 未加载

评论 #33652568 未加载

评论 #33666498 未加载

truemotiveover 2 years ago

Please, for the love of all that is holy, run away!

strangusover 2 years ago

You brave soul, I wish you luck.

scoofyover 2 years ago

I desperately need to be able to display .SVG files with gradients on .PDFs, but no library currently exist in python as far as I know.I would be willing to help make this happen, but I do not know much about the PDF format.

评论 #33649041 未加载

评论 #33649021 未加载

评论 #33652680 未加载

评论 #33651144 未加载

neilvover 2 years ago

Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.

评论 #33648211 未加载

评论 #33662692 未加载

RantyDaveover 2 years ago

I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.

评论 #33658853 未加载

larsonnnover 2 years ago

This reminds me back in the day where we got some properties and thought, PDF is a defined file format. Every pdf has this values…We were so naiv and didn’t know.

评论 #33652584 未加载

Spivakover 2 years ago

I wish you all the best! This space has a lot of stuff in it and they’re lacking in some aspect. And that’s not a admonishment, PDF is such a complicated format that there will never be a library that doesn’t come with asterisks — it’s just a matter of picking the thing you want your library to focus on and be good at and you can pretty easily be someone’s favorite lib.

评论 #33648199 未加载

cafardover 2 years ago

I had some luck with Camelot (<a href="https://camelot-py.readthedocs.io/en/master/" rel="nofollow">https://camelot-py.readthedocs.io/en/master/</a>). However, as many of the comments here say, PDF is a beast.

elmcrestover 2 years ago

Hey desgeeko,from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…<a href="https://github.com/systori/bericht" rel="nofollow">https://github.com/systori/bericht</a>

tmalyover 2 years ago

there is a Perl library that does this, but it only supports pdf 1.5<a href="https://metacpan.org/pod/CAM::PDF" rel="nofollow">https://metacpan.org/pod/CAM::PDF</a>I have used it in the past.

Silencerxyzover 2 years ago

Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.

cochneover 2 years ago

You’re in for it! I highly recommend checking out mupdf, it was one of the more pleasant Python libraries I dealt with for this purpose.

评论 #33658566 未加载

pyuser583over 2 years ago

Good luck! Really! I hate ReportLab!I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.

评论 #33651046 未加载

yupisover 2 years ago

Is it possible to directly edit the text?

voz_over 2 years ago

Wish you luck.

jonathansaover 2 years ago

Awesome work

chazeonover 2 years ago

On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.[1]: <a href="https://github.com/jcushman/pdfquery" rel="nofollow">https://github.com/jcushman/pdfquery</a>

评论 #33648844 未加载

评论 #33653840 未加载

评论 #33649187 未加载

jeremynixonover 2 years ago

Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?

评论 #33648571 未加载

评论 #33648771 未加载

评论 #33649120 未加载

评论 #33648625 未加载

评论 #33648387 未加载

评论 #33649263 未加载

评论 #33649345 未加载

评论 #33649427 未加载

sarahhenryover 2 years ago

Amazing Python guide, Thank you. How long you have been working as a developer?