InfluxDB creator here. I've actually been working on this one myself and am excited to answer any questions. The project is InfluxDB IOx (short for iron oxide, pronounced eye-ox). Among other things, it's an in-memory columnar database with object storage as the persistence layer.
InfluxData is arguably playing catch-up with Thanos, Cortex, and other scale-out Prometheus backends for the metrics use case. Given that, I wonder why they decided to write a new storage backend from scratch instead of building on the work Thano and Cortex have done. Those two competing projects are successfully sharing a lot of code that allows all data to be stored in object storage like S3.<p><a href="https://grafana.com/blog/2020/07/29/how-blocks-storage-in-cortex-reduces-operational-complexity-for-running-prometheus-at-massive-scale/" rel="nofollow">https://grafana.com/blog/2020/07/29/how-blocks-storage-in-co...</a>
What a ride. You're close to releasing Influx 2.0 without a clear migration strategy for your customers, and then you think it's a good idea to announce yet another storage rewrite? Why should customers stick with you guys when you have a track record for shipping half baked software, rewriting it, and leaving people out in the cold?
I'm using Apache Arrow to store CSV-like data and it works amazingly well.<p>The datasets I work with contains a few billion records with 5-10 fields each, almost all 64-bit longs; I originally started with CSV and soon switched to a binary format with fixed-sized records (mmap'd) which gave great performance improvements, but the flexibility, size gains due to columnar compression and the much greater performance of Arrow for queries that span a single column or a small number of them won me over.<p>For anyone who has to process even a few million records locally, I would highly recommend it.
"Columnar databases aren’t new, so why are we building yet another one? We weren’t able to find one in open source that was optimized for time series."<p>This is the direction that QuestDB (www.questdb.io) has taken: columnar database, partitions by time and open source (apache 2.0). It is written is zero-GC java and c++, leveraging SIMD instructions. The live demo has been shown to HN recently, with sub second queries for 1.6 billion rows: <a href="https://news.ycombinator.com/item?id=23616878" rel="nofollow">https://news.ycombinator.com/item?id=23616878</a><p>NB: I am a co-founder of questdb.
Another SQL engine on data lake that heavily uses arrow is Dremio.<p><a href="https://www.dremio.com/webinars/apache-arrow-calcite-parquet-relational-cache/" rel="nofollow">https://www.dremio.com/webinars/apache-arrow-calcite-parquet...</a><p><a href="https://github.com/dremio/dremio-oss" rel="nofollow">https://github.com/dremio/dremio-oss</a><p>If you have parquet on S3, using an engine like Dremio (or any engine based on arrow) can give you some impressive performance. Key innovations in OSS on data analytics/data lake:<p>Arrow - Columnar in memory format;
Gandiva - LLVM based execution kernel;
Arrow flight - Wire protocol based on arrow;
Project Nessie - A git like workflow for data lakes<p><a href="https://arrow.apache.org/" rel="nofollow">https://arrow.apache.org/</a>.
<a href="https://arrow.apache.org/docs/format/Flight.html" rel="nofollow">https://arrow.apache.org/docs/format/Flight.html</a>.
<a href="https://arrow.apache.org/blog/2018/12/05/gandiva-donation/" rel="nofollow">https://arrow.apache.org/blog/2018/12/05/gandiva-donation/</a>
<a href="https://github.com/projectnessie/nessie" rel="nofollow">https://github.com/projectnessie/nessie</a>
I didn't see it in the post, but a huge amount of this is due to Andy Grove's work on Rust implementations of Apache Arrow and DataFusion.<p>I imagine he's happy where he is, but I hope there's some opportunity for InfluxDB to give credit and support for his great work.
How does InfluxDB compare to TimescaleDB? My understanding is that the use case is pretty similar (time series/metrics), are they good at different things?
There is already a general purpose working-in-progress OLAP project written in Rust.<p><a href="https://tensorbase.io/" rel="nofollow">https://tensorbase.io/</a><p>1. TensorBase is highly hackable. If you know the Rust and C, then you can control all of the world. This is obvious not for Apache Arrow and DataFusion (on the top of Arrow).<p>2. TensorBase uses the whole-stage JIT optimization which is (in complex cases possibly hugely) faster than that done in Gandiva. Expression based computing kernel is far from provoding the top performance for OLAP like bigdata system.<p>3. TensorBase keeps some kinds of OLTP in mind (although in the early stage its still in OLAP). There is no truely OLTP or OLAP viewpoints in users. Users just want all their queries being fastest.<p>4. TensorBase is now APL v2 based. Enjoy to hack it yourself!<p>ps: One recent writting about TensorBase (and those compared with some query engine and project in Rust works included) could be seen in this presentation:
<a href="https://tensorbase.io/2020/11/08/rustfest2020.html" rel="nofollow">https://tensorbase.io/2020/11/08/rustfest2020.html</a><p>Disclaimer: I am the author of TensorBase.
Just built InfluxDB IOx from sources [1] and compared data ingestion performance to VictoriaMetrics by using Billy tool [2] on a laptop with Intel i5-8265U CPU (it contains 4 CPU cores) and 32GB RAM. Results for 1M measurements are the following:<p>- InfluxDB IOx: 600K rows/sec<p>- VictoriaMetrics: 4M rows/sec<p>I.e. VictoriaMetrics outperforms InfluxDB IOx by more than 6x in this benchmark. I hope InfluxDB IOx performance will be improved over time, since it is written in Rust.<p>[1] <a href="https://github.com/influxdata/influxdb_iox" rel="nofollow">https://github.com/influxdata/influxdb_iox</a><p>[2] <a href="https://github.com/VictoriaMetrics/billy" rel="nofollow">https://github.com/VictoriaMetrics/billy</a>
The new direction is really promising.
However, we at ScyllaDB (/me am a co-founder) already meet most of the requirements. We use C++20 with an advanced shard per core, we have an open format for the files, you can easily import/export them. One can use Scylla as a general DB and also run KairosDB for timeseries specific if needed.<p>Recently we have a MSc project to add Parquet which is a very good direction, couldn't agree more.
How does datafusion based engine like this compare to timely/differential dataflow(naiad)?<p>PS: Im a rookie in this whole domain.. so any pointers would be really helpful.
Very cool! - I'm curious how far Influx will then move into being a general purpose columnar database system outside of typical timeseries workloads - moving more into being a general purpose OLAP DB for analytical "data science" workload?<p>Will there be any type of transactional guarantees (ACID) using MVCC or similar?<p>Is the execution engine vectorised?
key takeaway:<p>> As an added bonus, within the Rust set of Apache Arrow tools is DataFusion, a Rust native SQL query engine for Apache Arrow. Given that we’re building with DataFusion as the core, this means that InfluxDB IOx will support a subset of SQL out of the box