TechEcho

18 comments

Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.

评论 #43004621 未加载

评论 #43004004 未加载

评论 #43004511 未加载

评论 #43012077 未加载

评论 #43012061 未加载

评论 #43003703 未加载

评论 #43010482 未加载

Muromec3 months ago

That's pretty cool! I would have used it a lot at my previous job if it existed back then. In my ideal world it should work somewhat like <a href="https://lapo.it/asn1js/" rel="nofollow">https://lapo.it/asn1js/</a> -- you drop a file and it does all the stuff locally.

swsieber3 months ago

I've used the iText RUPS (free) for a while for debugging PDFs (as I have the "privilege" to work on code that extracts data from PDFs...). It looks like your introspection stuff might be a bit stronger, which would be great. I'll take it for a whirl.

est3 months ago

I remember there was a similar project on github allows visualize any type of binary data by a given schema. There was an TCP/IP example IIRC.

评论 #43000769 未加载

评论 #43001037 未加载

评论 #43001567 未加载

SSLy3 months ago

Damn, this is also convenient for forensics and finding watermarks.

评论 #43000919 未加载

评论 #43000833 未加载

tyilo3 months ago

Looks nice.Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.

评论 #43000868 未加载

tekkk3 months ago

This would be really nice as browser library. Could just dragn drop a file and see its insides. But impressive nonetheless.

评论 #43003852 未加载

nonrandomstring3 months ago

Well done. This is a very useful security previewing tool. PDFs are a menace.

kevmo3143 months ago

Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.EDIT: Oh it's actually reasonably simple, great use of CSS! <a href="https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_text_string.html">https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...</a>

评论 #43001537 未加载

nabaraz3 months ago

On a similar note, why haven't PDF been replaced? There are XPS, DjVu and XHTML (EPUB) but they all seem to be targeting different usecase (a packaged HTML file).What I want is a simple document format that allows embedding other files and metadata without the Adobe's bloat. I should be able to hyperlink within pages, change font-size etc without text overflowing and being able to print in a consistent manner.

评论 #43006818 未加载

评论 #43005137 未加载

评论 #43009778 未加载

评论 #43007839 未加载

评论 #43005320 未加载

评论 #43004809 未加载

escapecharacter3 months ago

I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?

评论 #43001845 未加载

评论 #43001414 未加载

评论 #43001090 未加载

flsw3 months ago

related: <a href="https://news.ycombinator.com/item?id=41377960">https://news.ycombinator.com/item?id=41377960</a>

nathan_f773 months ago

This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!

评论 #43017878 未加载

acabajoe3 months ago

Kudos to making this self-hosted. So very much appreciated!

adelpozo3 months ago

it does not have any dependency to a pdf parsing library, correct? That's a cool way to learn to file format and be able to work around weird pdf file. But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!

评论 #43004724 未加载

xeon063 months ago

Wow, I've been doing some PDF parsing at work and this is going to come in SO handy.

评论 #43001772 未加载

disqard3 months ago

This looks amazingly useful!Thank You For Making And Sharing!

LegionMammal9783 months ago

If you're interested in manipulating PDFs, I've found QPDF [0] to be a useful tool. Its "QDF mode" lays out the objects in a form where you can directly edit them, and it can automatically fix up the xref table afterwards. It can also convert to and from a JSON format that you can manipulate with your own scripts.[0] <a href="https://github.com/qpdf/qpdf">https://github.com/qpdf/qpdf</a>, <a href="https://qpdf.readthedocs.io/en/stable/" rel="nofollow">https://qpdf.readthedocs.io/en/stable/</a>

评论 #43003329 未加载

评论 #43003446 未加载

18 comments

codetrotter3 months ago

评论 #43004621 未加载

评论 #43004004 未加载

评论 #43004511 未加载

评论 #43012077 未加载

评论 #43012061 未加载

评论 #43003703 未加载

评论 #43010482 未加载

Muromec3 months ago

swsieber3 months ago

est3 months ago

I remember there was a similar project on github allows visualize any type of binary data by a given schema. There was an TCP/IP example IIRC.

评论 #43000769 未加载

评论 #43001037 未加载

评论 #43001567 未加载

SSLy3 months ago

Damn, this is also convenient for forensics and finding watermarks.

评论 #43000919 未加载

评论 #43000833 未加载

tyilo3 months ago

Looks nice.Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.

评论 #43000868 未加载

tekkk3 months ago

This would be really nice as browser library. Could just dragn drop a file and see its insides. But impressive nonetheless.

评论 #43003852 未加载

nonrandomstring3 months ago

Well done. This is a very useful security previewing tool. PDFs are a menace.

kevmo3143 months ago

评论 #43001537 未加载

nabaraz3 months ago

评论 #43006818 未加载

评论 #43005137 未加载

评论 #43009778 未加载

评论 #43007839 未加载

评论 #43005320 未加载

评论 #43004809 未加载

escapecharacter3 months ago

I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?

评论 #43001845 未加载

评论 #43001414 未加载

评论 #43001090 未加载

flsw3 months ago

related: <a href="https://news.ycombinator.com/item?id=41377960">https://news.ycombinator.com/item?id=41377960</a>

nathan_f773 months ago

This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!

评论 #43017878 未加载

acabajoe3 months ago

Kudos to making this self-hosted. So very much appreciated!

adelpozo3 months ago

评论 #43004724 未加载

xeon063 months ago

Wow, I've been doing some PDF parsing at work and this is going to come in SO handy.

Show HN: HTML visualization of a PDF file's internal structure

18 comments

Show HN: HTML visualization of a PDF file's internal structure

18 comments