科技回声

12 条评论

Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)1 - <a href="https://github.com/turbot/steampipe">https://github.com/turbot/steampipe</a>

评论 #42126243 未加载

评论 #42126214 未加载

评论 #42125738 未加载

评论 #42127360 未加载

评论 #42127342 未加载

评论 #42126856 未加载

评论 #42126490 未加载

评论 #42129229 未加载

branko_d6 个月前

Following the links, I find...<pre><code> pgPDF: The actual PDF parsing is done by poppler. Poppler is a PDF rendering library based on the xpdf-3.0 code base. Xpdf is based on XpdfWidget/Qt™, by Glyph & Cog. XpdfWidget is based on the same proven code used in Glyph & Cog's XpdfViewer library. The XpdfViewer® library / ActiveX control provides a PDF file viewer component for use in Windows applications. </code></pre> Quite the rabbit hole!Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.

评论 #42127088 未加载

评论 #42127703 未加载

评论 #42126674 未加载

xrd6 个月前

This is fun. It would be interesting to add the able to query references inside the page, like images. That could be modeled as a foreign key relationship to the page. I'm using some Python libraries to do that and everything is wrapped in try/except blocks because PDFs are a mess. I wonder how poppler handles those kind of files.

评论 #42125305 未加载

fforflo6 个月前

Interesting: I posted this a few days back, and certainly not "10 hours ago". Who was kind enough to re-surface this? Thanks :)Some clarifications on a few comments I see downstream:The motivating example was to easily support Full-Text Search (FTS) on PDFs with SQL only (see blog post <a href="https://tselai.com/full-text-search-pdf-postgres" rel="nofollow">https://tselai.com/full-text-search-pdf-postgres</a> ). You can treat `pdf` as an alias for `text` and do everything possible.On the next iteration, I made `pdf` a type (typical varlena object of bytes) to avoid hitting disk all the time. The file is loaded from the disk only once (if it's a valid pdf). One can store the `pdf` type (blob of bytes) as a standard Postgres type. And use that for subsequent calls. Postgres will do it's magic as usual. There is a potential next step of storing the parsed document just to save some time from re-parsing the bytes, but I deemed it a premature optimization.

评论 #42127850 未加载

aargh_aargh6 个月前

For one minute I thought - what a stupid idea, wrong level of abstraction. Now I think I might actually use this in an analysis setting for convenience. I guess I'll quickly find out what kinds of timeouts I'll run into once I ask for the titles of 10k documents.

评论 #42125386 未加载

fernandohur6 个月前

The postgres ecosystem keeps impressing me with it's creativity

评论 #42125656 未加载

ok1234566 个月前

It would be neat to see this as a TOAST type in Postgres, where the PDF was kept in a data structure with the PDF parsed. It would be relatively straightforward to perform searches and index/reindex deep into the documents.

评论 #42127585 未加载

dennisy6 个月前

Whilst this is cool, why would we want to push this logic into the DB?It seems cleaner to keep this in the service layer and use any PDF parsing library and subsequent schema to store the parsed files.

LunaSea6 个月前

Interesting!I wonder what the use case is compared to extracting this information in the programming language and then storing it alongside the PDF in separate table columns?

评论 #42125176 未加载

ape46 个月前

Slightly related - Are PDFs natively compressed? They would probably compress well since they're often mostly text. Saving space in the database.

评论 #42127018 未加载

评论 #42126150 未加载

skwee3576 个月前

Im having difficulty to understand what’s the use case for this..

评论 #42126701 未加载

评论 #42126694 未加载

anonu6 个月前

Now create a dbeaver extension to view the type

12 条评论

nathanwallace6 个月前

评论 #42126243 未加载

评论 #42126214 未加载

评论 #42125738 未加载

评论 #42127360 未加载

评论 #42127342 未加载

评论 #42126856 未加载

评论 #42126490 未加载

评论 #42129229 未加载

branko_d6 个月前

评论 #42127088 未加载

评论 #42127703 未加载

评论 #42126674 未加载

xrd6 个月前

评论 #42125305 未加载

fforflo6 个月前

评论 #42127850 未加载

aargh_aargh6 个月前

评论 #42125386 未加载

fernandohur6 个月前

The postgres ecosystem keeps impressing me with it's creativity

评论 #42125656 未加载

ok1234566 个月前

评论 #42127585 未加载

dennisy6 个月前

LunaSea6 个月前

Interesting!I wonder what the use case is compared to extracting this information in the programming language and then storing it alongside the PDF in separate table columns?

评论 #42125176 未加载

ape46 个月前

Slightly related - Are PDFs natively compressed? They would probably compress well since they're often mostly text. Saving space in the database.

PgPDF: Pdf Type and Functions for Postgres

12 条评论

PgPDF: Pdf Type and Functions for Postgres

12 条评论