Show HN: Query SQLite files stored in S3

160 pointsby polyrandover 2 years ago

20 comments

simonwover 2 years ago

This is really neat. It looks like it's using a similar trick to <a href="https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/" rel="nofollow">https://phiresky.github.io/blog/2021/hosting-sqlite-database...</a> from last year - querying SQLite databases hosted as static files using the HTTP range header to fetch back just the pages from the database file that are needed by a specific query.

评论 #32831849 未加载

评论 #32831725 未加载

psanfordover 2 years ago

I made a similar sqlite vfs in Go[1]. I really love the VFS api. HTTP range queries is about the most mild thing you can do with it. The VFS implementations that use non-traditional backing stores are more fun. Like say, using DynamoDB as a read/write backing store for your sqlite database[2].[1]: <a href="https://github.com/psanford/sqlite3vfshttp" rel="nofollow">https://github.com/psanford/sqlite3vfshttp</a>[2]: <a href="https://github.com/psanford/donutdb" rel="nofollow">https://github.com/psanford/donutdb</a>

评论 #32829766 未加载

评论 #32831230 未加载

Rapzidover 2 years ago

Something really neat about SQLite is how self contained the database file is.. You can take say, a large book or log output, then copy it into a table with full text search. Optimize the DB to get rid of empty space, then ship that out to S3 or something. Now you can fetch that down and have indexed and fts access to the contents.Cool way to store data with indexes into it in a way that makes it as easy to use as copying around and opening a file; no need for dumps or restores or server processes.

评论 #32832899 未加载

评论 #32832348 未加载

ks2048over 2 years ago

This is cool. It would be interesting to see some stats like how many separate requests are needed for a single SELECT statement, since S3 charges per GET, even on a small range, I believe. Obviously that would depend on the query and various DB settings like page_size, etc. Also wondering if this does aggressive local caching for the same reason.

评论 #32830355 未加载

gkaemmerover 2 years ago

This is cool, I'll probably end up using this (we store sqlite snapshots in S3).It'd be very handy if this was packaged as a CLI that worked like `sqlite3`, so you could get a SQL repl by running:<pre><code> s3sqlite mybucket/mydb.sqlite3</code></pre>

评论 #32831927 未加载

svnpennover 2 years ago

I feel like people are pushing SQLite beyond its limits with this stuff. If you're going to do something like this, wouldn't just a normal client/server option like PostgreSQL be way more performant?I think SQLite is fantastic, but this just smacks of scope creep.

评论 #32830878 未加载

评论 #32830688 未加载

sntranover 2 years ago

This would be quite convenient when dealing with MBtiles files hosted on S3 or similar. There are efforts such as PMTiles[1] that let us use HTTP Range query to retrieve tiles directly from S3, but we still have to convert MBTiles to PMTiles first.More importantly that OSM is considering providing MBTiles directly (besides PBF)[2].[1]: <a href="https://protomaps.com/docs/pmtiles/" rel="nofollow">https://protomaps.com/docs/pmtiles/</a>[2]: <a href="https://github.com/openstreetmap/operations/issues/565" rel="nofollow">https://github.com/openstreetmap/operations/issues/565</a>

评论 #32834408 未加载

traceroute66over 2 years ago

I admire people for their inventiveness, however surely attempting to square-peg-round-hole something like this into AWS S3 is only going to end up biting them in the backside ?S3 is delivered over a stateless protocol (http) and AWS makes no promises about any given request being fulfilled (the user is merely invited to retry). The safety of your data is then further dependent on what storage tier you chose.There just seem to be so many footgun opportunities here. Why not just use the right tool for the job ? (i.e. hosted DB or compute).

评论 #32834859 未加载

评论 #32834613 未加载

civilizedover 2 years ago

This is nice, but before you go to the trouble of using it, consider just putting your CSV, JSON, or parquet data file in S3 and query it with S3 Select <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html" rel="nofollow">https://docs.aws.amazon.com/AmazonS3/latest/userguide/select...</a>I haven't tried this myself but parquet + S3 select should be especially good for analytical queries.

评论 #32832328 未加载

wanghqover 2 years ago

If I need to query the db repeatedly (N times per second), isn't it more efficient (low latency and aws cost) to just download the db and run queries locally?

rgovostesover 2 years ago

I was thinking it would be nice if there were a generic VFS library for C that offered stdio-like APIs (fread, fprintf, etc.) in order to do tricks like this beyond SQLite, without having to make major changes porting existing codebases over to different APIs.It occurs to me that because a FILE * is an opaque pointer, it ought to be possible for a libc to provide this functionality: the FILE struct could contain function pointers to the the underlying implementations for writing, flushing, etc.This ought to be backwards compatible with existing FILE * returning APIs (fopen, fdopen, popen), while allowing new backing stores without having to rewrite (or even recompile) existing code.

评论 #32831531 未加载

评论 #32831535 未加载

JaggerFooover 2 years ago

I read of someone using AWS EFS for sqlite, which has nfs functionality, if I recall correctly.However I like the idea of using a local store with litestream to S3 and updating to the latest db data from S3, if using my application on another device or computer. Sort of like github.Tailscale was looking into this option, instead of Postgres and other DBs. <a href="https://tailscale.com/blog/database-for-2022/" rel="nofollow">https://tailscale.com/blog/database-for-2022/</a>Cheers

banana_giraffeover 2 years ago

I implement a similar concept for Python using APSW and boto3.<a href="https://gist.github.com/Q726kbXuN/56a095d0828c69625fa44c23118e5221" rel="nofollow">https://gist.github.com/Q726kbXuN/56a095d0828c69625fa44c2311...</a>It's not really useful for deep dives into data, though it has gotten me out of a few jams that would have otherwise required a real database just to pull out a value or two from a large dataset.

aleph-over 2 years ago

Nifty, reminds me of <a href="https://github.com/backtrace-labs/verneuil" rel="nofollow">https://github.com/backtrace-labs/verneuil</a> or possibly <a href="https://litestream.io/" rel="nofollow">https://litestream.io/</a>

endisneighover 2 years ago

There have been many of these custom vfs implementations. Is there a benchmark on performance between them?

staticassertionover 2 years ago

A comparison to S3 Select would make sense since S3 Select lets you leverage RANGE, leverage compact formats like compressed Parquet, and would presumably have a similar fundamental cost ie: get + data transfer.

Datageneratorover 2 years ago

This could be done by mounting S3 as filesystem and plain sqlite. Any differences?

评论 #32830018 未加载

评论 #32830708 未加载

redleader55over 2 years ago

Sorry for the stupid question: does this work with MinIO?

评论 #32831581 未加载

lootsauceover 2 years ago

Has anyone created a vfs for redis?

remramover 2 years ago

Is there a benefit over the existing S3 VFS sqlite3-s3vfs,which also uses APSW?

评论 #32829729 未加载

评论 #32829692 未加载