Show HN: Stanchion – Column-oriented tables in SQLite

273 pointsby dgllghrover 1 year ago

Hello HN!I built stanchion to scratch my own itch. I have personal applications running on SQLite where I want to store log and metric data (basically forever) but don't want to add complexity by bringing in a new database or whole separate server. Some of these applications are running on a Raspberry Pi, where storage and compute are limited, but I still want data warehouse like capabilities.I envision stanchion being used in similar scenarios: on phones and resource-limited devices in applications that are already using SQLite. I know that there are alternatives like DuckDB (which is very cool), but I want stanchion to be "good enough" that it is useful without having to add whole new database technology.If you think stanchion may be a good fit for your use case and you are interested in contributing, please test it and provide feedback by opening issues for any bugs, difficulties, or missing features you would need! Ideas are also welcome in this thread or as a github issue. Of course stars are always appreciated as well. The CONTRIBUTING doc in the repository has more details.- Dan

15 comments

alexgarcia-xyzover 1 year ago

This is an awesome project, I love SQLite extensions and I think they have a ton of use. Giants props to Dan here, I haven't seen many SQLite extensions written in Zig, and I'm learning a ton reading through the source code.The column-oriented data is stored in large BLOBs inside of regular SQLite tables. It uses the SQLite incremental BLOB I/O API [0] to incrementally read/write data in a column oriented way.However, this project (and other SQLite extensions) will eventually hit a limit with SQLite's virtual table API. When you create a virtual table, you can perform a number of optimizations on queries. For examples, SQLite will tell your virtual table implementation the WHERE clauses that appear on the virtual table, any ORDER BYs, which columns are SELECT'ed, and other limited information. This allows extension developers to do things like predicate + projection pushdowns to make queries faster.Unfortunately, it doesn't offer many ways to make analytical queries faster. For example, no matter what you do, a `SELECT COUNT(*) FROM my_vtab` will always iterate through every single row in your virtual table to determine a count. There's no "shortcut" to provide top-level counts. Same with other aggregate functions like SUM() or AVERAGE(), SQLite will perform full scans and do calculations themselves.So for this project, while column-oriented datasets could make analytical queries like that much faster, the SQLite API does limit you quite a bunch. I'm sure there are workarounds around this (by custom UDFs or exposing other query systems), but would be hefty to add.That being said, I still love this project! Really would love to see if there's any size benefit to this, and will definitely contribute more when I get a chance. Great job Dan![0] <a href="https://www.sqlite.org/c3ref/blob_open.html" rel="nofollow">https://www.sqlite.org/c3ref/blob_open.html</a>

评论 #39221203 未加载

评论 #39220938 未加载

评论 #39220353 未加载

评论 #39222325 未加载

jitlover 1 year ago

Note that DuckDB can ingest SQLite tables. If you need OLAP on your SQLite system today instead of whenever this is stable, consider pairing up DuckDB. In the simplest case if you don’t need amazing latency and have small data (<1gb), I would copy the whole table from SQLite into an in-memory or temp file DuckDB, do your OLAP queries, then throw it away. For larger datasets, you can incrementally replicate the SQLite table into DuckDB by adding a logical clock column or something to the SQLite table, and then copy rows where logical_clock>last_duckdb_change before running your next DuckDB query.We’re currently doing a bake-off between a few databases for small datasets (<10m rows) with dynamic schemas, and have pretty comparable read latencies between DuckDB and SQLite for our workflow, so you might be able to get away with switching wholesale from SQLite to DuckDB if you don’t do a lot of update.

评论 #39221264 未加载

评论 #39220721 未加载

tbraginover 1 year ago

Interesting project! Thank you for open sourcing and sharing. Agree that local and embedded analytics are an increasing trend, I see it too.A couple of questions:* I’m curious what the difficulties were in the implementation. I suspect it is quite a challenge to implement this support in the current SQLite architecture, and would curious to know which parts were tricky and any design trade-off you were faced with.* Aside from ease-of-use (install extension, no need for a separate analytical database system), I wonder if there are additional benefits users can anticipate resulting from a single system architecture vs running an embedded OLAP store like DuckDB or clickhouse-local / chdb side-by-side with SQLite? Do you anticipate performance or resource efficiency gains, for instance?* I am also curious, what the main difficulty with bringing in a separate analytical database is, assuming it natively integrates with SQLite. I may be biased, but I doubt anything can approach the performance of native column-oriented systems, so I'm curious what the tipping point might be for using this extension vs using an embedded OLAP store in practice.Btw, would love for you or someone in the community to benchmark Stanchion in ClickBench and submit results! (<a href="https://github.com/ClickHouse/ClickBench/">https://github.com/ClickHouse/ClickBench/</a>)Disclaimer: I work on ClickHouse.

评论 #39223971 未加载

breadchrisover 1 year ago

It seems like there are a lot of extensions that are being built for sqlite. I would like to use these extensions, but I am skeptical about their support over time. I like sqlite for how freakin stable it is. How do people feel about sqlite extensions?

评论 #39224120 未加载

评论 #39220652 未加载

评论 #39224611 未加载

hifikunoover 1 year ago

This is great. I love using DuckDB to do initial analytics on new datasets we get at work. The issue I run into a lot is that our sources can give us data that has non UTF-8 characters in it which DuckDB won't read so I spend a bit of time just trying to find the few non UTF-8 characters in a multi-GB file.SQLite does not seem to care about the bogus characters so I would be really interested in using Stanchion and SQLite to see if it speeds up my exploration.Also, I do not know how I have never known about SQLite extensions before now. I feel like I just found out that Narwhals are actually real.

评论 #39233128 未加载

nattaylorover 1 year ago

This is particularly interesting to me for Android/iOS. I can't even picture the use case where there'd be enough data on the device for the row-based format to b a bottleneck, but maybe some case that involves many, many aggregations

评论 #39220257 未加载

datadrivenangelover 1 year ago

I'm excited to see how this compares to DuckDB. More local/embedded analytics databases will be good for the industry.

评论 #39219565 未加载

评论 #39219807 未加载

speedgooseover 1 year ago

If you are into alternative storage engines for SQLite, there is also an LSM (Log-Structured Merge-tree) extension in the main repository that is not announced nor documented but seems to work. It’s based on the SQLite 4 project.<a href="https://github.com/sqlite/sqlite/tree/master/ext/lsm1">https://github.com/sqlite/sqlite/tree/master/ext/lsm1</a><a href="https://www.charlesleifer.com/blog/lsm-key-value-storage-in-sqlite3/" rel="nofollow">https://www.charlesleifer.com/blog/lsm-key-value-storage-in-...</a>

评论 #39219934 未加载

vincent-manisover 1 year ago

This sounds like an excellent project, and I hope it continues to a production release.I do have one qualm, though. SQLite is written in C, and if I were writing plugins for it, I would seriously consider myself bound to using C for them, regardless of the merits of the language. I can easily imagine myself trying to build an extended SQLite library where one plugin is written in Zig, another in Rust, and perhaps a third in Hare. A fourth plugin might be written in C, but be built using Meson. Yet another plugin written in C is built with Bazel. And here I come, writing my plugin in Chibi Scheme!Eventually, the toolchain needed to build the library and all its plugins overflows.I would strongly recommend that people who write plugins for programs or libraries give serious consideration to using the programming language(s) and build system(s) of the parent software.

didgetmasterover 1 year ago

Is this really an important feature for DB users here? I built my own hobby database system that was column oriented (using the metadata tagging system I invented for a file system replacement I built). It does lightning fast analytics and even outperforms SQLite in a bunch of other operations. I posted a benchmark video on my YouTube channel and made a free beta download available; but the response has been lukewarm at best.<a href="https://YouTube.com/@didgetsdemos" rel="nofollow">https://YouTube.com/@didgetsdemos</a> <a href="https://www.didgets.com" rel="nofollow">https://www.didgets.com</a>

评论 #39223705 未加载

devdiaryover 1 year ago

Can you ELI5, how did you make this possible? I see the readme asking to download a binary and then also mentioning sqlite extension. I have never had exp with sqlite ext, so not quite sure how do they work. Appreciate how does this solution work under the hood.

评论 #39232083 未加载

thamerover 1 year ago

The "Data Storage Internals" section[1] of the README sounds to me like it has its own column-oriented format for these tables, at least that's how I'm reading the part about segments. Is that the case? If so, have you tried using Apache Arrow or Parquet to see how they compare?[1] <a href="https://github.com/dgllghr/stanchion#data-storage-internals">https://github.com/dgllghr/stanchion#data-storage-internals</a>

评论 #39221404 未加载

ngrillyover 1 year ago

Great example of Zig code, and of seamless interaction with C!

thecowgoesover 1 year ago

Can't wait to see compression!Any timeline for that?

评论 #39224012 未加载

skissaneover 1 year ago

Question: Why do you choose LGPL-3.0? For many, one of the most attractive features of SQLite is its license (or should I say lack thereof).I realise some people view public domain as legally problematic. I think the best answer for that is public-domain equivalent licenses such as 0BSD [0] or MIT-0 [1] – technically still copyrighted, but effectively not, since they let you do anything under zero conditions. (There are other, possibly more well-known options such as Unlicense or WTFPL or CC0; however, those tend to upset lawyers more than 0BSD and MIT-0 do.)Of course, it is your work, and you are free to license it however you like. Still, some potential users are going to be put off by the licensing.[0] <a href="https://opensource.org/license/0bsd/" rel="nofollow">https://opensource.org/license/0bsd/</a>[1] <a href="https://opensource.org/license/mit-0/" rel="nofollow">https://opensource.org/license/mit-0/</a>

评论 #39234382 未加载