TechEcho

9 comments

rossmounceover 9 years ago

Most PDF scientific articles sadly don't have good embedded metadata [0], so this & the DOI issue make this not very useful (at least for the journals I read).I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)[0] <a href="http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/" rel="nofollow">http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/</a> [1] <a href="http://www.sno.phy.queensu.ca/~phil/exiftool/" rel="nofollow">http://www.sno.phy.queensu.ca/~phil/exiftool/</a> [2] <a href="http://poppler.freedesktop.org/" rel="nofollow">http://poppler.freedesktop.org/</a>

评论 #10452859 未加载

chriswarboover 9 years ago

On a related note, these past couple of weeks I've found myself wanting to import several years' worth of accumulated PDFs into a BibTeX file. This has involved metadata extraction, text scraping, querying Google Scholar (good, but rate-limited) and CrossRef (no limit, but not as accurate).I've written a very rough guide to the approaches I've taken so far at <a href="http://chriswarbo.net/essays/pdf-tools.html" rel="nofollow">http://chriswarbo.net/essays/pdf-tools.html</a> , with a bunch of links to external tools, some NixOS package definitions, commandline snippets and descriptions of Emacs macros.Not quite the same problem as the author's, but the tools and scripts I've been using can do similar things :)

评论 #10454828 未加载

adelevieover 9 years ago

This is really neat! For work, I've found myself from time to time exploring the tech around PDFs. I find this tech strangely fascinating. It's like a shim on top of something old and ugly that enables integration with much more modern systems.Some quick feedback (and a shameless plug):The CLI interface should output JSON. It would be nice to combin with a CLI JSON parser such as jq[0].Shameless plug: I've been working on a PDF CLI aimed at making it easier to programmatically fill out PDF forms: <a href="https://github.com/adelevie/pdfq" rel="nofollow">https://github.com/adelevie/pdfq</a>. It provides an interface and some wrappers on top of the main pdf form-filling tool, pdftk. For example, you can get json out of a pdf form like this:<pre><code> pdftk hello.pdf dump_data_fields | pdfq </code></pre> Or you can generate FDF from a json file:<pre><code> cat hello.json | pdfq json_to_fdf </code></pre> You can also fill a pdf without touching an fdf code:<pre><code> pdfq set foo bar input.pdf output.pdf </code></pre> [0] <a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>

评论 #10454041 未加载

评论 #10453310 未加载

retSavaover 9 years ago

Slightly off-topic, pardon me. But, does someone have any good tips on how to remove pdf security?Let me clarify why; I frequently come across datasheets (eg to flash memory ICs) that have security enabled for some strange reason. Nothing secret, just plainly downloaded from the Internet. I can open, and print, but not highlight or add remarks.Existing solutions I've found so far are inadequate since they typically are either 'download this obscure-sounding executable', 'upload and convert on this sketchy possibly-malware-injecting-website', or resort to printing the entire thing to a new pdf document (eg via PDF creator) - but this makes text un-highlightable.I don't mind anything involving hex-editing, some node.js or python-lib, or chanting and dancing, as long as it gets the job done.I just want to be able to highlight and copy text :(

评论 #10454789 未加载

fit2ruleover 9 years ago

Nice work Chris .. doesn't work on all my PDF's, though:<pre><code> j@w1x8-dev:~/Documents/PDF Documents {} $ pdfx xhyve\ –\ Lightweight\ Virtualization\ on\ OS\ X\ Based\ on\ bhyve\ _\ pagetable.pdf Traceback (most recent call last): File "/usr/local/bin/pdfx", line 9, in <module> load_entry_point('pdfx==1.0.1', 'console_scripts', 'pdfx')() File "build/bdist.macosx-10.10-x86_64/egg/pdfx/cli.py", line 66, in main File "build/bdist.macosx-10.10-x86_64/egg/pdfx/__init__.py", line 137, in __init__ AttributeError: 'NoneType' object has no attribute 'items' j@w1x8-dev:~/Documents/PDF Documents {} </code></pre> If you want some sample PDF's on which it is borked, just let me know .. in the meantime I'm using pdf_scraper for most of these ..

评论 #10454483 未加载

arochover 9 years ago

While nice, it really only works with the reference has a direct link to the PDF while the majority of citations use the DOI in the sciences.DOI traversal would be required

评论 #10452548 未加载

评论 #10452471 未加载

metachrisover 9 years ago

Glad if this tool/lib is useful to some. I'm happy to answer any and all questions!A Kivy [1] based cross-platform GUI would be a nice addition at some point.[1] <a href="http://kivy.org" rel="nofollow">http://kivy.org</a>

评论 #10453768 未加载

afencover 9 years ago

I've installed pdfx and saw the help info as the demo. But when I tried to download the example 17 pdf files the following error msg jumped in the endERROR 2: len() takes exactly one argument (2 given)What does this mean??

评论 #10469745 未加载

based2over 9 years ago

related Apache Tika 1.11 release<a href="http://mail-archives.apache.org/mod_mbox/www-announce/201510.mbox/%3CD2530D01.51642%25mattmann%40apache.org%3E" rel="nofollow">http://mail-archives.apache.org/mod_mbox/www-announce/201510...</a>

9 comments

rossmounceover 9 years ago

评论 #10452859 未加载

chriswarboover 9 years ago

评论 #10454828 未加载

adelevieover 9 years ago

评论 #10454041 未加载

评论 #10453310 未加载

retSavaover 9 years ago

评论 #10454789 未加载

fit2ruleover 9 years ago

评论 #10454483 未加载

arochover 9 years ago

While nice, it really only works with the reference has a direct link to the PDF while the majority of citations use the DOI in the sciences.DOI traversal would be required

Show HN: PDFx – Extract Metadata and URLs from PDFs, and Download Referenced PDFs

9 comments

Show HN: PDFx – Extract Metadata and URLs from PDFs, and Download Referenced PDFs

9 comments