TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: HTML visualization of a PDF file's internal structure

451 pointsby desgeeko3 months ago
Hi, I've just finished a rebuild of this function and added a lot of new features: info, page index, minimap, inverted index,... I think it may be useful for inspection, debugging or just as a learning resource showcasing the PDF file format. This is a pet project and I would be happy to receive some feedback! Regards

18 comments

codetrotter3 months ago
Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.<p>The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.<p>In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.
评论 #43004621 未加载
评论 #43004004 未加载
评论 #43004511 未加载
评论 #43012077 未加载
评论 #43012061 未加载
评论 #43003703 未加载
评论 #43010482 未加载
Muromec3 months ago
That&#x27;s pretty cool! I would have used it a lot at my previous job if it existed back then. In my ideal world it should work somewhat like <a href="https:&#x2F;&#x2F;lapo.it&#x2F;asn1js&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lapo.it&#x2F;asn1js&#x2F;</a> -- you drop a file and it does all the stuff locally.
swsieber3 months ago
I&#x27;ve used the iText RUPS (free) for a while for debugging PDFs (as I have the &quot;privilege&quot; to work on code that extracts data from PDFs...). It looks like your introspection stuff might be a bit stronger, which would be great. I&#x27;ll take it for a whirl.
est3 months ago
I remember there was a similar project on github allows visualize any type of binary data by a given schema. There was an TCP&#x2F;IP example IIRC.
评论 #43000769 未加载
评论 #43001037 未加载
评论 #43001567 未加载
SSLy3 months ago
Damn, this is also convenient for forensics and finding watermarks.
评论 #43000919 未加载
评论 #43000833 未加载
tyilo3 months ago
Looks nice.<p>Would be better if all of the PDF&#x27;s bytes where shown. Seems like `endobj` and `xref` are not shown.
评论 #43000868 未加载
tekkk3 months ago
This would be really nice as browser library. Could just dragn drop a file and see its insides. But impressive nonetheless.
评论 #43003852 未加载
nonrandomstring3 months ago
Well done. This is a very useful security previewing tool. PDFs are a menace.
kevmo3143 months ago
Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.<p>EDIT: Oh it&#x27;s actually reasonably simple, great use of CSS! <a href="https:&#x2F;&#x2F;github.com&#x2F;desgeeko&#x2F;pdfsyntax&#x2F;blob&#x2F;main&#x2F;docs&#x2F;simple_text_string.html">https:&#x2F;&#x2F;github.com&#x2F;desgeeko&#x2F;pdfsyntax&#x2F;blob&#x2F;main&#x2F;docs&#x2F;simple_...</a>
评论 #43001537 未加载
nabaraz3 months ago
On a similar note, why haven&#x27;t PDF been replaced? There are XPS, DjVu and XHTML (EPUB) but they all seem to be targeting different usecase (a packaged HTML file).<p>What I want is a simple document format that allows embedding other files and metadata without the Adobe&#x27;s bloat. I should be able to hyperlink within pages, change font-size etc without text overflowing and being able to print in a consistent manner.
评论 #43006818 未加载
评论 #43005137 未加载
评论 #43009778 未加载
评论 #43007839 未加载
评论 #43005320 未加载
评论 #43004809 未加载
escapecharacter3 months ago
I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?
评论 #43001845 未加载
评论 #43001414 未加载
评论 #43001090 未加载
flsw3 months ago
related: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41377960">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41377960</a>
nathan_f773 months ago
This is really cool! I&#x27;ve spent the last few years debugging lots of PDFs while working on DocSpring, so I&#x27;m always looking for new tools to make this easier. Thanks for working on pdfsyntax!
评论 #43017878 未加载
acabajoe3 months ago
Kudos to making this self-hosted. So very much appreciated!
adelpozo3 months ago
it does not have any dependency to a pdf parsing library, correct? That&#x27;s a cool way to learn to file format and be able to work around weird pdf file. But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!
评论 #43004724 未加载
xeon063 months ago
Wow, I&#x27;ve been doing some PDF parsing at work and this is going to come in SO handy.
评论 #43001772 未加载
disqard3 months ago
This looks amazingly useful!<p>Thank You For Making And Sharing!
LegionMammal9783 months ago
If you&#x27;re interested in manipulating PDFs, I&#x27;ve found QPDF [0] to be a useful tool. Its &quot;QDF mode&quot; lays out the objects in a form where you can directly edit them, and it can automatically fix up the xref table afterwards. It can also convert to and from a JSON format that you can manipulate with your own scripts.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;qpdf&#x2F;qpdf">https:&#x2F;&#x2F;github.com&#x2F;qpdf&#x2F;qpdf</a>, <a href="https:&#x2F;&#x2F;qpdf.readthedocs.io&#x2F;en&#x2F;stable&#x2F;" rel="nofollow">https:&#x2F;&#x2F;qpdf.readthedocs.io&#x2F;en&#x2F;stable&#x2F;</a>
评论 #43003329 未加载
评论 #43003446 未加载