TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: PDFx – Extract Metadata and URLs from PDFs, and Download Referenced PDFs

93 pointsby metachrisover 9 years ago

9 comments

rossmounceover 9 years ago
Most PDF scientific articles sadly don&#x27;t have good embedded metadata [0], so this &amp; the DOI issue make this not very useful (at least for the journals I read).<p>I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)<p>[0] <a href="http:&#x2F;&#x2F;rossmounce.co.uk&#x2F;2012&#x2F;12&#x2F;31&#x2F;pdf-metadata-why-so-poor&#x2F;" rel="nofollow">http:&#x2F;&#x2F;rossmounce.co.uk&#x2F;2012&#x2F;12&#x2F;31&#x2F;pdf-metadata-why-so-poor&#x2F;</a> [1] <a href="http:&#x2F;&#x2F;www.sno.phy.queensu.ca&#x2F;~phil&#x2F;exiftool&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.sno.phy.queensu.ca&#x2F;~phil&#x2F;exiftool&#x2F;</a> [2] <a href="http:&#x2F;&#x2F;poppler.freedesktop.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;poppler.freedesktop.org&#x2F;</a>
评论 #10452859 未加载
chriswarboover 9 years ago
On a related note, these past couple of weeks I&#x27;ve found myself wanting to import several years&#x27; worth of accumulated PDFs into a BibTeX file. This has involved metadata extraction, text scraping, querying Google Scholar (good, but rate-limited) and CrossRef (no limit, but not as accurate).<p>I&#x27;ve written a very rough guide to the approaches I&#x27;ve taken so far at <a href="http:&#x2F;&#x2F;chriswarbo.net&#x2F;essays&#x2F;pdf-tools.html" rel="nofollow">http:&#x2F;&#x2F;chriswarbo.net&#x2F;essays&#x2F;pdf-tools.html</a> , with a bunch of links to external tools, some NixOS package definitions, commandline snippets and descriptions of Emacs macros.<p>Not quite the same problem as the author&#x27;s, but the tools and scripts I&#x27;ve been using can do similar things :)
评论 #10454828 未加载
adelevieover 9 years ago
This is really neat! For work, I&#x27;ve found myself from time to time exploring the tech around PDFs. I find this tech strangely fascinating. It&#x27;s like a shim on top of something old and ugly that enables integration with much more modern systems.<p>Some quick feedback (and a shameless plug):<p>The CLI interface should output JSON. It would be nice to combin with a CLI JSON parser such as jq[0].<p>Shameless plug: I&#x27;ve been working on a PDF CLI aimed at making it easier to programmatically fill out PDF forms: <a href="https:&#x2F;&#x2F;github.com&#x2F;adelevie&#x2F;pdfq" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;adelevie&#x2F;pdfq</a>. It provides an interface and some wrappers on top of the main pdf form-filling tool, pdftk. For example, you can get json out of a pdf form like this:<p><pre><code> pdftk hello.pdf dump_data_fields | pdfq </code></pre> Or you can generate FDF from a json file:<p><pre><code> cat hello.json | pdfq json_to_fdf </code></pre> You can also fill a pdf without touching an fdf code:<p><pre><code> pdfq set foo bar input.pdf output.pdf </code></pre> [0] <a href="https:&#x2F;&#x2F;stedolan.github.io&#x2F;jq&#x2F;" rel="nofollow">https:&#x2F;&#x2F;stedolan.github.io&#x2F;jq&#x2F;</a>
评论 #10454041 未加载
评论 #10453310 未加载
retSavaover 9 years ago
Slightly off-topic, pardon me. But, does someone have any good tips on how to remove pdf security?<p>Let me clarify why; I frequently come across datasheets (eg to flash memory ICs) that have security enabled for some strange reason. Nothing secret, just plainly downloaded from the Internet. I can open, and print, but not highlight or add remarks.<p>Existing solutions I&#x27;ve found so far are inadequate since they typically are either &#x27;download this obscure-sounding executable&#x27;, &#x27;upload and convert on this sketchy possibly-malware-injecting-website&#x27;, or resort to printing the entire thing to a new pdf document (eg via PDF creator) - but this makes text un-highlightable.<p>I don&#x27;t mind anything involving hex-editing, some node.js or python-lib, or chanting and dancing, as long as it gets the job done.<p>I just want to be able to highlight and copy text :(
评论 #10454789 未加载
fit2ruleover 9 years ago
Nice work Chris .. doesn&#x27;t work on all my PDF&#x27;s, though:<p><pre><code> j@w1x8-dev:~&#x2F;Documents&#x2F;PDF Documents {} $ pdfx xhyve\ –\ Lightweight\ Virtualization\ on\ OS\ X\ Based\ on\ bhyve\ _\ pagetable.pdf Traceback (most recent call last): File &quot;&#x2F;usr&#x2F;local&#x2F;bin&#x2F;pdfx&quot;, line 9, in &lt;module&gt; load_entry_point(&#x27;pdfx==1.0.1&#x27;, &#x27;console_scripts&#x27;, &#x27;pdfx&#x27;)() File &quot;build&#x2F;bdist.macosx-10.10-x86_64&#x2F;egg&#x2F;pdfx&#x2F;cli.py&quot;, line 66, in main File &quot;build&#x2F;bdist.macosx-10.10-x86_64&#x2F;egg&#x2F;pdfx&#x2F;__init__.py&quot;, line 137, in __init__ AttributeError: &#x27;NoneType&#x27; object has no attribute &#x27;items&#x27; j@w1x8-dev:~&#x2F;Documents&#x2F;PDF Documents {} </code></pre> If you want some sample PDF&#x27;s on which it is borked, just let me know .. in the meantime I&#x27;m using pdf_scraper for most of these ..
评论 #10454483 未加载
arochover 9 years ago
While nice, it really only works with the reference has a direct link to the PDF while the majority of citations use the DOI in the sciences.<p>DOI traversal would be required
评论 #10452548 未加载
评论 #10452471 未加载
metachrisover 9 years ago
Glad if this tool&#x2F;lib is useful to some. I&#x27;m happy to answer any and all questions!<p>A Kivy [1] based cross-platform GUI would be a nice addition at some point.<p>[1] <a href="http:&#x2F;&#x2F;kivy.org" rel="nofollow">http:&#x2F;&#x2F;kivy.org</a>
评论 #10453768 未加载
afencover 9 years ago
I&#x27;ve installed pdfx and saw the help info as the demo. But when I tried to download the example 17 pdf files the following error msg jumped in the end<p>ERROR 2: len() takes exactly one argument (2 given)<p>What does this mean??
评论 #10469745 未加载
based2over 9 years ago
related Apache Tika 1.11 release<p><a href="http:&#x2F;&#x2F;mail-archives.apache.org&#x2F;mod_mbox&#x2F;www-announce&#x2F;201510.mbox&#x2F;%3CD2530D01.51642%25mattmann%40apache.org%3E" rel="nofollow">http:&#x2F;&#x2F;mail-archives.apache.org&#x2F;mod_mbox&#x2F;www-announce&#x2F;201510...</a>