Diff-pdf: tool to visually compare two PDFs

589 pointsby Olshansky11 months ago

27 comments

simonw11 months ago

This inspired me to have Claude 3.5 Sonnet knock out a quick web page prototype for me, using PDF.js to load and render the PDFs to canvas elements and then display visual diffs between their pages.Two prompts:<pre><code> Build a tool where I can drag and drop on two PDF files and it uses PDF.js to turn each of their pages into canvas elements and then displays those pages side by side with a third image that highlights any differences between them, if any differences exist rewrite that code to not use React at all </code></pre> Here's the result: <a href="https://tools.simonwillison.net/compare-pdfs" rel="nofollow">https://tools.simonwillison.net/compare-pdfs</a>It actually works quite well! Screenshot here: <a href="https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5ae5?permalink_comment_id=5109044#gistcomment-5109044" rel="nofollow">https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5...</a>

评论 #40860370 未加载

评论 #40861013 未加载

评论 #40896196 未加载

评论 #40860894 未加载

tomwheeler11 months ago

In a previous job, I had to validate the output of an unreliable production publishing system, so I tested dozens of PDF comparison tools available at the time. The best I found was called Delta Walker. It was proprietary commercial Mac-only software, but reasonably inexpensive, accurate, and could handle long PDFs with lots of graphics well.I remember evaluating this diff-pdf tool and finding that it fell short in some way, although it's been so long that I don't recall the specifics. Most of them failed to identify changes or reported false positives. I also remember being disappointed since this one was open source and could easily be scripted.

评论 #40857105 未加载

评论 #40862261 未加载

评论 #40861016 未加载

ydant11 months ago

Related - this might be helpful to someone.ImageMagick can do a visual PDF compare:<pre><code> magick compare -density "$DENSITY" -background white "$1[0]" "$2[0]" "$TMP" </code></pre> (density = 100, $1 and $2 are the filenames to compare, $TMP the output file)You need to do some work to support multiple pages, so I use this script:<a href="https://gist.github.com/mbafford/7e6f3bef20fc220f68e467589bb6a8aa" rel="nofollow">https://gist.github.com/mbafford/7e6f3bef20fc220f68e467589bb...</a>This also uses `imgcat` to show the difference directly in the terminal.You can also use ImageMagick get a perceptual hash difference using something like:<pre><code> convert -metric phash "$1" null: "$2" -compose Difference -layers composite -format '%[fx:mean]\n' info: </code></pre> I use the fact you can configure git to use custom diff tools and take advantage of this with the following in my .gitconfig:<pre><code> [diff "pdf"] command = ~/bin/git-diff-pdf </code></pre> And in my .gitattributes I enable the above with:<pre><code> *.pdf binary diff=pdf </code></pre> ~/bin/git-diff-pdf does a diff of the output of `pdftotext -layout` (from poppler) and also runs pdf-compare-phash.To use this custom diff with `git show`, you need to add an extra argument (`git show --ext-diff`), but it uses it automatically if running `git diff`.

评论 #40857435 未加载

thibaut_barrere11 months ago

I have been using this in a CI pipeline to maintain a business-critical PDF generation (healthcare) app (started circa 2010 I think), here is the RSpec helpers I'm using:<a href="https://gist.github.com/thbar/d1ce2afef68bf6089aeae8d9ddc05ddf" rel="nofollow">https://gist.github.com/thbar/d1ce2afef68bf6089aeae8d9ddc05d...</a>The code contains git-stored reference PDFs, and the test suite re-generate them and assert that nothing has changed.Helped a lot to audit visual changes, or PDF library upgrades!

评论 #40857000 未加载

评论 #40856228 未加载

poidos11 months ago

Reminds me of the tool Bob Nystrom wrote to help himself out when working on the physical edition of Crafting Interpreters: <a href="https://journal.stuffwithstuff.com/2020/04/05/crafting-crafting-interpreters/" rel="nofollow">https://journal.stuffwithstuff.com/2020/04/05/crafting-craft...</a>Whole article is worth reading, but if you want the relevant bits search for “ I wrote a Dart script that would take a PDF of the book”.

jaustin11 months ago

We've been using this in the Micro:bit Educational Foundation (microbit.org) to fill a gap in hardware design tooling, and get visual diffs of our schematics and gerbers during PCB design iterations. It's kinda wild that's what we ended up doing, but if you want to be sure your radio layout didn't change at all when you're making a minor revision to a different part of the board, visual diffs are perfect.That said, next project we want to try something more integrated with EDA tools. If anyone else has followed this path, we'd love to know.

mikeyinternews11 months ago

You can do this with Beyond Compare (it's not free, but not very expensive either) <a href="https://www.scootersoftware.com/" rel="nofollow">https://www.scootersoftware.com/</a>

评论 #40856793 未加载

smartmic11 months ago

I like this tool better: <a href="https://www.qtrac.eu/diffpdf.html" rel="nofollow">https://www.qtrac.eu/diffpdf.html</a>It shows the differences in the GUI side-by-side instead of overlayed.

评论 #40856284 未加载

评论 #40861171 未加载

评论 #40856138 未加载

rawbert11 months ago

We use this tool in our team regularly for comparison of PDFs we obtain from third party services that might have changed after code-changes on our side. Big thanks to the author <3

canistel11 months ago

Interestingly, Github thinks the project is 46% shell, due to the fairly huge wxwin.m4.

评论 #40855876 未加载

deckar0111 months ago

I wrote a pixel-based visual diffing algorithm long ago that was intended for a CI tool that finds all of the UI changes in a PR. I broke the layout of a page I didn’t even know existed as an intern at Inkling and have had this idea in my head ever since.<a href="https://github.com/deckar01/narcis">https://github.com/deckar01/narcis</a>

crocal11 months ago

I will just chime in to mention Draftable (<a href="https://www.draftable.com/compare" rel="nofollow">https://www.draftable.com/compare</a>). It really works well. It’s not so easy to have a visually comfortable diff of two PDFs.

ck_one11 months ago

Can anyone recommend a method to deduplicate pdfs? The hash is often different but the content and meta data is 99.99% the same.

评论 #40857499 未加载

评论 #40857888 未加载

评论 #40856929 未加载

strangus11 months ago

<a href="https://10052.ai" rel="nofollow">https://10052.ai</a> has a tool that will visually compare documents(pdfs, doc, image,etc) and cluster them together. It works amazingly well.

sva_11 months ago

Coincidentally I downloaded and tried using this just a while ago. I was trying to see if it can identify an Elsevier fingerprint between two pdfs. It can't, it only compares visible things.I used vbindiff instead.

akasakahakada11 months ago

Use this to compare university textbook edition 8 and 9 before buying.

评论 #40856113 未加载

redman2511 months ago

I created a similar in-browser version a while back with mozilla's pdf-js. The diff rendering is all run client side.<a href="https://www.parepdf.com" rel="nofollow">https://www.parepdf.com</a>The diff-pdf project was my inspiration but I wanted to create a version that was distributable to non-programmers.

TacticalCoder11 months ago

This reminds me of a book author who posted here IIRC. He had a little tool allowing him to quickly compare two revisions of his book. For example too make sure typos fixed didn't t break havoc. I remember his tool would show in red what had changed on pages thumbnails.

atum4711 months ago

back when I was writing my final paper I faced a similar issue, needed to de-duplicate a bunch of PDF's, so I came up with a simple solution<a href="https://github.com/victorqribeiro/dtf">https://github.com/victorqribeiro/dtf</a>

fwn11 months ago

I really like the overlay view and that it is not cloud based. Will try to test it at work.I rely heavily on PDF comparison via PDF-XChange Editor, which is accurate for text, but often has trouble highlighting visual changes correctly.

riedel11 months ago

I always used DiffPDF only to read on their website: > in the view of the EU’s Cyber Resilience Act and an abundance of caution, we have withdrawn all our free software[1]Good to see post-cyberresilience alternatives :)PDF diffs are really great for versioning/comparing PCB-Designs. (The only real use case I had 15 yrs back)[1] <a href="http://www.qtrac.eu/diffpdf-foss.html" rel="nofollow">http://www.qtrac.eu/diffpdf-foss.html</a>

评论 #40858885 未加载

mycall10 months ago

Of course, Adobe Compare does this too.<a href="https://www.adobe.com/acrobat/features/compare-pdfs.html" rel="nofollow">https://www.adobe.com/acrobat/features/compare-pdfs.html</a>

npack11 months ago

<a href="https://onlinetextcompare.com/pdf" rel="nofollow">https://onlinetextcompare.com/pdf</a> lets you compare text between two pdf files locally within the browser

jgalt21211 months ago

Thanks. I'll give this a shot to see if any counterparties try to sneak in any last second changes to the executable version of the doc.

asah11 months ago

Crazy, I'd have thought that modern multi-modal LLMs can do this, but when I tried Gemini, ChatGPT-4o and Claude they all pooped out:- Gemini at first only diff'd the text, and then when pushed it identified the items in the images and then hallucinated the differences between the versions. It could not produce an image output.- Claude only diff'd the text and refused to believe that there images in the PDFs.- ChatGPT attempted to write and execute python code for this, which errored out.

评论 #40856234 未加载

评论 #40856082 未加载

评论 #40855894 未加载

评论 #40856250 未加载

downboots11 months ago

Maybe this could be used to generate PDFs using LaTeX and use the diff as a distance metric to optimize.

Levitating11 months ago

No screenshots?

评论 #40855115 未加载

评论 #40856106 未加载