So you want to modify the text of a PDF by hand (2020)

325 点作者 mutant_glofish超过 1 年前

29 条评论

blincoln超过 1 年前

The PDF specification is wild. My current favourite trivia is that it supports all of Photoshop's layer blend modes for rendering overlapping elements.[1] My second-favourite is that it supports appended content that modifies earlier content, so one should always look for forensic evidence in all distinct versions represented in a given file.[2]It's also a fun example of the futility of DRM. The spec includes password-based encryption, and allows for different "owner" and "user" passwords. There's a bitfield with options for things like "prevent printing", "prevent copying text", and so forth,[3] but because reading the document necessarily involves decrypting it, one can use the "user" password to open an encrypted PDF in a non-compliant tool,[4] then save the unencrypted version to get an editable equivalent.[1] "More than just transparency" section of <a href="https://blog.adobe.com/en/publish/2022/01/31/20-years-of-transparency-in-pdf" rel="nofollow noreferrer">https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra...</a>[2] <a href="https://blog.didierstevens.com/2008/05/07/solving-a-little-pdf-puzzle/" rel="nofollow noreferrer">https://blog.didierstevens.com/2008/05/07/solving-a-little-p...</a>[3] Page 61 of <a href="https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf" rel="nofollow noreferrer">https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...</a>[4] For example, a script that uses the pypdf library.

评论 #37384523 未加载

评论 #37386443 未加载

评论 #37384817 未加载

评论 #37397483 未加载

评论 #37384752 未加载

评论 #37401451 未加载

aidos超过 1 年前

This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not.They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).

评论 #37388865 未加载

评论 #37384310 未加载

评论 #37387136 未加载

评论 #37387310 未加载

enriquto超过 1 年前

You can do this:<pre><code> pdf2ps a.pdf # convert to postscript "a.ps" vim a.ps # edit postscript by hand ps2pdf a.ps # convert back to pdf </code></pre> Some complex pdf (with embedded javascript, animations, etc) fail to work correctly after this back and forth. Yet for "plain" documents this works alright. You can easily remove watermarks, change some words and numbers, etc. Spacing is harder to modify. Of course you need to know some postscript.

评论 #37387590 未加载

评论 #37386878 未加载

评论 #37387835 未加载

ks2048超过 1 年前

This seems to be missing an important point: at the end of PDF is a table ("cross-reference" table) that stores the BYTE-OFFSET to different objects in the file.If you modify things within the file, typically these offsets will change and the file will be corrupt. It looks like in this article, maybe they were only interested in changing one number to another, so none of the positions change.But, generally, adding/removing/modifying things in the middle of the file require recomputing the xref table and thus become much easier to use a library rather than direct text editing.

评论 #37384892 未加载

评论 #37384483 未加载

评论 #37384699 未加载

评论 #37384793 未加载

jl6超过 1 年前

This seems to be missing an important step in the use of qpdf’s --qdf mode: after you’ve finished editing, you need to run the file through the fix-pdf utility to recalculate all the object offsets and rebuild the cross-reference table that lives at the end of the file (unless you only change bytes in-place rather than adding or removing bytes).My top 3 fun PDF facts:1) Although PDF documents are typically 8-bit binary files, you can make one that is valid UTF-8 “plain text”, even including images, through the use of the ASCII85 filter.[0]2) PDF allows an incredible variety of freaky features (3D objects, JavaScript, movies in an embedded flash object, invisible annotations…). PDF/A is a much saner, safer subset.3) The PDF spec allows you to write widgets (e.g. form controls) using “rich text”, which is a subset of XHTML and CSS - but this feature is very sparsely supported outside the official Adobe Reader.[0] For example: <a href="https://lab6.com/2" rel="nofollow noreferrer">https://lab6.com/2</a>

评论 #37384928 未加载

desgeeko超过 1 年前

If you want to continue this journey and learn more about PDF, you can read the anatomy of a file I documented recently: <a href="https://pdfsyntax.dev/introduction_pdf_syntax.html" rel="nofollow noreferrer">https://pdfsyntax.dev/introduction_pdf_syntax.html</a>

miki123211超过 1 年前

What people often miss about PDF is that it's closer to an image format in some ways than to a Word document. Word documents, PDFs and images are in document editing what DAW projects, midis and mp3 files are in music and what Java source code, JVM bytecode and pure x86 machine code are in software.The primary purpose of a PDF file is to tell you what to display (or print), with perfect clarity, in much fewer bytes than an actual image would take. It exploits the fact that the document creator knows about patterns in the document structure that, if expressed properly, make the document much more compressible than anything that an actual image compression algorithm could accomplish. For example, if you have access to the actual font, it's better to say "put these characters at these coordinates with that much spacing between them" than to include every occurrence of every character as a part of the image, hoping that the compression algorithm notices and compresses away the repetitions. Things like what character is part of what word, or even what unicode codepoint is mapped to which font glyph are basically unimportant if all you're after is efficiently transferring the image of a document.If you have an editable document, you care a lot more about the semantics of the content, not just about its presentation. It matters to you whether a particular break in the text is supposed to be multiple spaces, the next column in a table or just a weird page layout caused by an image being present. If you have some text at the bottom of each page, you care whether that text was put there by the document author multiple times, or whether it was entered once and set as a footer. If you add a new paragraph and have to change page layout, it matters to you that the last paragraph on this page is a footnote and should not be moved to the next one. If a section heading moves to another page, you care about the fact that the table of contents should update automatically and isn't just some text that the author has manually entered. If you're a printer or screen, you care about none of these things, you just print or display whatever you're told to print or display. For a PDF, footnotes, section headings, footers or tables of contents don't have to be special, they can just be text with some meaningless formatting applied to it. This is why making PDF work for any purpose which isn't displaying or printing is never going to be 100% accurate. Of course, there are efforts to remedy this, and a PDF-creating program is free to include any metadata it sees fit, but it's by no means required to do so.This isn't necessarily the mental model that the PDF authors had in mind, but it's an useful way to look at PDF and understand why it is the way it is.

评论 #37388754 未加载

eschaton超过 1 年前

Anybody trying to do this is missing the point of PDF: It’s a page-description format and therefore only represents the marks on a page, not document structure.One should not attempt to edit a PDF, one should edit the document from which the PDF is generated.

评论 #37385504 未加载

评论 #37385578 未加载

评论 #37385606 未加载

评论 #37391506 未加载

评论 #37385560 未加载

jaystraw超过 1 年前

20 years ago, I worked as the plate person at a newspaper: we had two million dollar kodak plate "printers" -- printer is the wrong word, but the emulsion on the plates could be hit by UV light and dissolved in a chemical bath iirc. Regularly, the kodaks would fail, and my boss would go into the postscript (or maybe eps) files manually, change a header or some other malformed bit that came from the layout software that sent us the files, and all would be well again (our giant German offset web press ran linux, btw)I think his name was Bill. He took me, a 17 year old, to a Sigur Ros concert. Great dude. Wow two stories that don't involve pdfs!

nathan_f77超过 1 年前

Great post. I've spend a lot of time reading through the PDF specification over the last ~5 years while building DocSpring [1], and I still feel like I've barely scratched the surface. qpdf is a great tool. One of my other favorites is RUPS [2], which really lets you dig into the structure of a PDF.[1] <a href="https://docspring.com" rel="nofollow noreferrer">https://docspring.com</a>[2] <a href="https://github.com/itext/i7j-rups">https://github.com/itext/i7j-rups</a>

seszett超过 1 年前

Although this is an interesting dive into the PDF format, just opening the PDF in Libreoffice or Inkscape usually works fine to modify its text.

评论 #37383973 未加载

评论 #37386457 未加载

LispSporks22超过 1 年前

As I recall, words aren’t even necessarily made up of contiguous characters. Especially true in OCRed documents in PDF.

yboris超过 1 年前

Semi-related(?) - I created a repository to convert PDF to JPG and back to PDF:<a href="https://github.com/whyboris/PDF-to-JPG-to-PDF">https://github.com/whyboris/PDF-to-JPG-to-PDF</a>A government form didn't have editable fields that needed to be filled out. And editing the PDF was impossible (password protection). This was my solution.

评论 #37387720 未加载

jordann超过 1 年前

If you don't mind using java, you can use the open source Apache PDFBox library<a href="https://pdfbox.apache.org/" rel="nofollow noreferrer">https://pdfbox.apache.org/</a>It's relatively performant and it's a mature and supported codebase that can accomplish most pdf tasks.

评论 #37389192 未加载

Const-me超过 1 年前

> I didn't see an obvious open-source tool that lets you dig into PDF internalsThat’s a matter of the toolset. I program C#, and I have good experience with that open source library: <a href="https://www.nuget.org/packages/iTextSharp-LGPL/" rel="nofollow noreferrer">https://www.nuget.org/packages/iTextSharp-LGPL/</a> It’s a decade old by now, but PDF ain’t exactly a new format. That library is not terribly bad for many practical use cases. Particularly good when you only need to create the documents as opposed to editing them, because for that use case you’d want to use an old version of the format anyway, for optimal compatibility.

schlowmo超过 1 年前

PDF is such a weird format. Not so long ago I had to write some Java code for manipulating PDFs: find a string, remove it and place an image at the former string position. I should have known better as I thought "Well, how hard can that be?”What followed was a deep dive down the rabbit hole, a lot of fiddling with the same tools the author of this gist is using trying to make sense of it all.The final solution worked better than I thought while at the same time felt incredibly wrong.I'm very thankful for all the (probably painful) work that went into those open source PDF tools.

评论 #37387093 未加载

crtified超过 1 年前

This brings back horrible memories of working with large complex maps back in the 2000s. Having various CAD and GIS applications generate messy, inefficient spaghetti-coded PDF outputs - then bouncing those PDFs around the Adobe apps of the time, to add effects and other prettifications not available in the mapping apps.It would reach the point where things would start to break, and .... "good times were had, by all".

lucascacho超过 1 年前

Every time I read about the hardships of interacting with the PDF format, I gain more respect for Photopea, which has full PDF editing support.

firexcy超过 1 年前

My understanding is that the PDF syntax essentially imitates physical printing in that it instructs the reader software to leave something at a given coordinate on a defined media with supplied resources. Thus it's easily portable but barely mutable.

pmontra超过 1 年前

Some small PDF files are saved as uncompressed text. Invoices are a typical example.This means that we can open those files, read them as one single string and match the expected text in unit tests. I've got a few projects doing that and it was fine.If the text is compressed, pipe its content to qpdf first.

mondaymusings超过 1 年前

1. The PDF format is wildly overcapable compared to the majority of actual use (view text, tables and images).2. The number of user devices with unpatched PDF readers is likely large.3. The system of paywalled scientific knowledge drives millions of students and researchers to get their science PDFs from scihub and libgen pirate sites hosted in former Soviet countries, sometimes over http (not https).These three facts combine to a huge vulnerabilty space.On the flipside a sane and open PDF replacement format that also offered reduced file size could gain many users quickly by convincing scihub and libgen to convert and offer their files in the new format to cut costs and shorten download time, with reduced vuln as a positive externality.

tomalbrc超过 1 年前

I have been using Apples Preview.app to open "encrypted" or protected PDFs for quite a while, until it stopped working (Big Sur)

herbst超过 1 年前

Just a heads up. You can edit PDFs in Gimp. AFAIK it just embeds a huge image in the end but easy to add a signature or something

rogeliodh超过 1 年前

LibreOffice can open and edit PDFs. Last time I tried it was really good. Not sure what limitations are there.

评论 #37385519 未加载

Alifatisk超过 1 年前

Is there any tool that competes with Adobe Acrobat? Like the censoring tool is rarely founs anywhere else.

评论 #37392925 未加载

yair99dd超过 1 年前

Inkscape+1.2 multipage support is Great for editing graphics and text on PDFs

dustypotato超过 1 年前

PSA, if you want to sign a PDF, firefox does it easily. Works like magic.

elyobo超过 1 年前

"want" is probably a misleading term here

aleden超过 1 年前

I'm surprised no one has mentioned qpdf.<a href="https://qpdf.readthedocs.io/en/stable/overview.html" rel="nofollow noreferrer">https://qpdf.readthedocs.io/en/stable/overview.html</a>It turns a PDF (typically everything in it is compressed binary blobs) into a mixed binary/ASCII file (which itself is a PDF) that can be edited with vim.

评论 #37384362 未加载

评论 #37384958 未加载

评论 #37384358 未加载