This is really neat. It looks like it's using a similar trick to <a href="https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/" rel="nofollow">https://phiresky.github.io/blog/2021/hosting-sqlite-database...</a> from last year - querying SQLite databases hosted as static files using the HTTP range header to fetch back just the pages from the database file that are needed by a specific query.
I made a similar sqlite vfs in Go[1]. I really love the VFS api. HTTP range queries is about the most mild thing you can do with it. The VFS implementations that use non-traditional backing stores are more fun. Like say, using DynamoDB as a read/write backing store for your sqlite database[2].<p>[1]: <a href="https://github.com/psanford/sqlite3vfshttp" rel="nofollow">https://github.com/psanford/sqlite3vfshttp</a><p>[2]: <a href="https://github.com/psanford/donutdb" rel="nofollow">https://github.com/psanford/donutdb</a>
Something really neat about SQLite is how self contained the database file is.. You can take say, a large book or log output, then copy it into a table with full text search. Optimize the DB to get rid of empty space, then ship that out to S3 or something. Now you can fetch that down and have indexed and fts access to the contents.<p>Cool way to store data with indexes into it in a way that makes it as easy to use as copying around and opening a file; no need for dumps or restores or server processes.
This is cool. It would be interesting to see some stats like how many separate requests are needed for a single SELECT statement, since S3 charges per GET, even on a small range, I believe. Obviously that would depend on the query and various DB settings like page_size, etc. Also wondering if this does aggressive local caching for the same reason.
This is cool, I'll probably end up using this (we store sqlite snapshots in S3).<p>It'd be very handy if this was packaged as a CLI that worked like `sqlite3`, so you could get a SQL repl by running:<p><pre><code> s3sqlite mybucket/mydb.sqlite3</code></pre>
I feel like people are pushing SQLite beyond its limits with this stuff. If you're going to do something like this, wouldn't just a normal client/server option like PostgreSQL be way more performant?<p>I think SQLite is fantastic, but this just smacks of scope creep.
This would be quite convenient when dealing with MBtiles files hosted on S3 or similar. There are efforts such as PMTiles[1] that let us use HTTP Range query to retrieve tiles directly from S3, but we still have to convert MBTiles to PMTiles first.<p>More importantly that OSM is considering providing MBTiles directly (besides PBF)[2].<p>[1]: <a href="https://protomaps.com/docs/pmtiles/" rel="nofollow">https://protomaps.com/docs/pmtiles/</a><p>[2]: <a href="https://github.com/openstreetmap/operations/issues/565" rel="nofollow">https://github.com/openstreetmap/operations/issues/565</a>
I admire people for their inventiveness, however surely attempting to square-peg-round-hole something like this into AWS S3 is only going to end up biting them in the backside ?<p>S3 is delivered over a stateless protocol (http) and AWS makes no promises about any given request being fulfilled (the user is merely invited to retry). The safety of your data is then further dependent on what storage tier you chose.<p>There just seem to be so many footgun opportunities here. Why not just use the right tool for the job ? (i.e. hosted DB or compute).
This is nice, but before you go to the trouble of using it, consider just putting your CSV, JSON, or parquet data file in S3 and query it with S3 Select <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html" rel="nofollow">https://docs.aws.amazon.com/AmazonS3/latest/userguide/select...</a><p>I haven't tried this myself but parquet + S3 select should be especially good for analytical queries.
If I need to query the db repeatedly (N times per second), isn't it more efficient (low latency and aws cost) to just download the db and run queries locally?
I was thinking it would be nice if there were a generic VFS library for C that offered stdio-like APIs (fread, fprintf, etc.) in order to do tricks like this beyond SQLite, without having to make major changes porting existing codebases over to different APIs.<p>It occurs to me that because a FILE * is an opaque pointer, it ought to be possible for a libc to provide this functionality: the FILE struct could contain function pointers to the the underlying implementations for writing, flushing, etc.<p>This ought to be backwards compatible with existing FILE * returning APIs (fopen, fdopen, popen), while allowing new backing stores without having to rewrite (or even recompile) existing code.
I read of someone using AWS EFS for sqlite, which has nfs functionality, if I recall correctly.<p>However I like the idea of using a local store with litestream to S3 and updating to the latest db data from S3, if using my application on another device or computer. Sort of like github.<p>Tailscale was looking into this option, instead of Postgres and other DBs.
<a href="https://tailscale.com/blog/database-for-2022/" rel="nofollow">https://tailscale.com/blog/database-for-2022/</a><p>Cheers
I implement a similar concept for Python using APSW and boto3.<p><a href="https://gist.github.com/Q726kbXuN/56a095d0828c69625fa44c23118e5221" rel="nofollow">https://gist.github.com/Q726kbXuN/56a095d0828c69625fa44c2311...</a><p>It's not really useful for deep dives into data, though it has gotten me out of a few jams that would have otherwise required a real database just to pull out a value or two from a large dataset.
Nifty, reminds me of <a href="https://github.com/backtrace-labs/verneuil" rel="nofollow">https://github.com/backtrace-labs/verneuil</a> or possibly <a href="https://litestream.io/" rel="nofollow">https://litestream.io/</a>
A comparison to S3 Select would make sense since S3 Select lets you leverage RANGE, leverage compact formats like compressed Parquet, and would presumably have a similar fundamental cost ie: get + data transfer.