Most PDF scientific articles sadly don't have good embedded metadata [0], so this & the DOI issue make this not very useful (at least for the journals I read).<p>I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)<p>[0] <a href="http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/" rel="nofollow">http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/</a>
[1] <a href="http://www.sno.phy.queensu.ca/~phil/exiftool/" rel="nofollow">http://www.sno.phy.queensu.ca/~phil/exiftool/</a>
[2] <a href="http://poppler.freedesktop.org/" rel="nofollow">http://poppler.freedesktop.org/</a>
On a related note, these past couple of weeks I've found myself wanting to import several years' worth of accumulated PDFs into a BibTeX file. This has involved metadata extraction, text scraping, querying Google Scholar (good, but rate-limited) and CrossRef (no limit, but not as accurate).<p>I've written a very rough guide to the approaches I've taken so far at <a href="http://chriswarbo.net/essays/pdf-tools.html" rel="nofollow">http://chriswarbo.net/essays/pdf-tools.html</a> , with a bunch of links to external tools, some NixOS package definitions, commandline snippets and descriptions of Emacs macros.<p>Not quite the same problem as the author's, but the tools and scripts I've been using can do similar things :)
This is really neat! For work, I've found myself from time to time exploring the tech around PDFs. I find this tech strangely fascinating. It's like a shim on top of something old and ugly that enables integration with much more modern systems.<p>Some quick feedback (and a shameless plug):<p>The CLI interface should output JSON. It would be nice to combin with a CLI JSON parser such as jq[0].<p>Shameless plug: I've been working on a PDF CLI aimed at making it easier to programmatically fill out PDF forms: <a href="https://github.com/adelevie/pdfq" rel="nofollow">https://github.com/adelevie/pdfq</a>. It provides an interface and some wrappers on top of the main pdf form-filling tool, pdftk. For example, you can get json out of a pdf form like this:<p><pre><code> pdftk hello.pdf dump_data_fields | pdfq
</code></pre>
Or you can generate FDF from a json file:<p><pre><code> cat hello.json | pdfq json_to_fdf
</code></pre>
You can also fill a pdf without touching an fdf code:<p><pre><code> pdfq set foo bar input.pdf output.pdf
</code></pre>
[0] <a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>
Slightly off-topic, pardon me. But, does someone have any good tips on how to remove pdf security?<p>Let me clarify why; I frequently come across datasheets (eg to flash memory ICs) that have security enabled for some strange reason. Nothing secret, just plainly downloaded from the Internet. I can open, and print, but not highlight or add remarks.<p>Existing solutions I've found so far are inadequate since they typically are either 'download this obscure-sounding executable', 'upload and convert on this sketchy possibly-malware-injecting-website', or resort to printing the entire thing to a new pdf document (eg via PDF creator) - but this makes text un-highlightable.<p>I don't mind anything involving hex-editing, some node.js or python-lib, or chanting and dancing, as long as it gets the job done.<p>I just want to be able to highlight and copy text :(
Nice work Chris .. doesn't work on all my PDF's, though:<p><pre><code> j@w1x8-dev:~/Documents/PDF Documents {}
$ pdfx xhyve\ –\ Lightweight\ Virtualization\ on\ OS\ X\ Based\ on\ bhyve\ _\ pagetable.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdfx", line 9, in <module>
load_entry_point('pdfx==1.0.1', 'console_scripts', 'pdfx')()
File "build/bdist.macosx-10.10-x86_64/egg/pdfx/cli.py", line 66, in main
File "build/bdist.macosx-10.10-x86_64/egg/pdfx/__init__.py", line 137, in __init__
AttributeError: 'NoneType' object has no attribute 'items'
j@w1x8-dev:~/Documents/PDF Documents {}
</code></pre>
If you want some sample PDF's on which it is borked, just let me know .. in the meantime I'm using pdf_scraper for most of these ..
While nice, it really only works with the reference has a direct link to the PDF while the majority of citations use the DOI in the sciences.<p>DOI traversal would be required
Glad if this tool/lib is useful to some. I'm happy to answer any and all questions!<p>A Kivy [1] based cross-platform GUI would be a nice addition at some point.<p>[1] <a href="http://kivy.org" rel="nofollow">http://kivy.org</a>
I've installed pdfx and saw the help info as the demo. But when I tried to download the example 17 pdf files the following error msg jumped in the end<p>ERROR 2: len() takes exactly one argument (2 given)<p>What does this mean??
related Apache Tika 1.11 release<p><a href="http://mail-archives.apache.org/mod_mbox/www-announce/201510.mbox/%3CD2530D01.51642%25mattmann%40apache.org%3E" rel="nofollow">http://mail-archives.apache.org/mod_mbox/www-announce/201510...</a>