Hi,
I've just finished a rebuild of this function and added a lot of new features: info, page index, minimap, inverted index,...
I think it may be useful for inspection, debugging or just as a learning resource showcasing the PDF file format.
This is a pet project and I would be happy to receive some feedback!
Regards
Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.<p>The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.<p>In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.
That's pretty cool! I would have used it a lot at my previous job if it existed back then. In my ideal world it should work somewhat like <a href="https://lapo.it/asn1js/" rel="nofollow">https://lapo.it/asn1js/</a> -- you drop a file and it does all the stuff locally.
I've used the iText RUPS (free) for a while for debugging PDFs (as I have the "privilege" to work on code that extracts data from PDFs...). It looks like your introspection stuff might be a bit stronger, which would be great. I'll take it for a whirl.
Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.<p>EDIT: Oh it's actually reasonably simple, great use of CSS! <a href="https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_text_string.html">https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...</a>
On a similar note, why haven't PDF been replaced? There are XPS, DjVu and XHTML (EPUB) but they all seem to be targeting different usecase (a packaged HTML file).<p>What I want is a simple document format that allows embedding other files and metadata without the Adobe's bloat. I should be able to hyperlink within pages, change font-size etc without text overflowing and being able to print in a consistent manner.
I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?
This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!
it does not have any dependency to a pdf parsing library, correct? That's a cool way to learn to file format and be able to work around weird pdf file.
But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!
If you're interested in manipulating PDFs, I've found QPDF [0] to be a useful tool. Its "QDF mode" lays out the objects in a form where you can directly edit them, and it can automatically fix up the xref table afterwards. It can also convert to and from a JSON format that you can manipulate with your own scripts.<p>[0] <a href="https://github.com/qpdf/qpdf">https://github.com/qpdf/qpdf</a>, <a href="https://qpdf.readthedocs.io/en/stable/" rel="nofollow">https://qpdf.readthedocs.io/en/stable/</a>