TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

PgPDF: Pdf Type and Functions for Postgres

97 点作者 fforflo6 个月前

12 条评论

nathanwallace6 个月前
Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I&#x27;m a lead on the project.)<p>1 - <a href="https:&#x2F;&#x2F;github.com&#x2F;turbot&#x2F;steampipe">https:&#x2F;&#x2F;github.com&#x2F;turbot&#x2F;steampipe</a>
评论 #42126243 未加载
评论 #42126214 未加载
评论 #42125738 未加载
评论 #42127360 未加载
评论 #42127342 未加载
评论 #42126856 未加载
评论 #42126490 未加载
评论 #42129229 未加载
branko_d6 个月前
Following the links, I find...<p><pre><code> pgPDF: The actual PDF parsing is done by poppler. Poppler is a PDF rendering library based on the xpdf-3.0 code base. Xpdf is based on XpdfWidget&#x2F;Qt™, by Glyph &amp; Cog. XpdfWidget is based on the same proven code used in Glyph &amp; Cog&#x27;s XpdfViewer library. The XpdfViewer® library &#x2F; ActiveX control provides a PDF file viewer component for use in Windows applications. </code></pre> Quite the rabbit hole!<p>Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.
评论 #42127088 未加载
评论 #42127703 未加载
评论 #42126674 未加载
xrd6 个月前
This is fun. It would be interesting to add the able to query references inside the page, like images. That could be modeled as a foreign key relationship to the page. I&#x27;m using some Python libraries to do that and everything is wrapped in try&#x2F;except blocks because PDFs are a mess. I wonder how poppler handles those kind of files.
评论 #42125305 未加载
fforflo6 个月前
Interesting: I posted this a few days back, and certainly not &quot;10 hours ago&quot;. Who was kind enough to re-surface this? Thanks :)<p>Some clarifications on a few comments I see downstream:<p>The motivating example was to easily support Full-Text Search (FTS) on PDFs with SQL only (see blog post <a href="https:&#x2F;&#x2F;tselai.com&#x2F;full-text-search-pdf-postgres" rel="nofollow">https:&#x2F;&#x2F;tselai.com&#x2F;full-text-search-pdf-postgres</a> ). You can treat `pdf` as an alias for `text` and do everything possible.<p>On the next iteration, I made `pdf` a type (typical varlena object of bytes) to avoid hitting disk all the time. The file is loaded from the disk only once (if it&#x27;s a valid pdf). One can store the `pdf` type (blob of bytes) as a standard Postgres type. And use that for subsequent calls. Postgres will do it&#x27;s magic as usual. There is a potential next step of storing the parsed document just to save some time from re-parsing the bytes, but I deemed it a premature optimization.
评论 #42127850 未加载
aargh_aargh6 个月前
For one minute I thought - what a stupid idea, wrong level of abstraction. Now I think I might actually use this in an analysis setting for convenience. I guess I&#x27;ll quickly find out what kinds of timeouts I&#x27;ll run into once I ask for the titles of 10k documents.
评论 #42125386 未加载
fernandohur6 个月前
The postgres ecosystem keeps impressing me with it&#x27;s creativity
评论 #42125656 未加载
ok1234566 个月前
It would be neat to see this as a TOAST type in Postgres, where the PDF was kept in a data structure with the PDF parsed. It would be relatively straightforward to perform searches and index&#x2F;reindex deep into the documents.
评论 #42127585 未加载
dennisy6 个月前
Whilst this is cool, why would we want to push this logic into the DB?<p>It seems cleaner to keep this in the service layer and use any PDF parsing library and subsequent schema to store the parsed files.
LunaSea6 个月前
Interesting!<p>I wonder what the use case is compared to extracting this information in the programming language and then storing it alongside the PDF in separate table columns?
评论 #42125176 未加载
ape46 个月前
Slightly related - Are PDFs natively compressed? They would probably compress well since they&#x27;re often mostly text. Saving space in the database.
评论 #42127018 未加载
评论 #42126150 未加载
skwee3576 个月前
Im having difficulty to understand what’s the use case for this..
评论 #42126701 未加载
评论 #42126694 未加载
anonu6 个月前
Now create a dbeaver extension to view the type