InfluxDB is betting on Rust and Apache Arrow for next-gen data store

205 pointsby mhall119over 4 years ago

16 comments

pauldixover 4 years ago

InfluxDB creator here. I've actually been working on this one myself and am excited to answer any questions. The project is InfluxDB IOx (short for iron oxide, pronounced eye-ox). Among other things, it's an in-memory columnar database with object storage as the persistence layer.

评论 #25055772 未加载

评论 #25053879 未加载

评论 #25055525 未加载

评论 #25051727 未加载

评论 #25056991 未加载

评论 #25054655 未加载

评论 #25056415 未加载

评论 #25053201 未加载

评论 #25054512 未加载

sciurusover 4 years ago

InfluxData is arguably playing catch-up with Thanos, Cortex, and other scale-out Prometheus backends for the metrics use case. Given that, I wonder why they decided to write a new storage backend from scratch instead of building on the work Thano and Cortex have done. Those two competing projects are successfully sharing a lot of code that allows all data to be stored in object storage like S3.<a href="https://grafana.com/blog/2020/07/29/how-blocks-storage-in-cortex-reduces-operational-complexity-for-running-prometheus-at-massive-scale/" rel="nofollow">https://grafana.com/blog/2020/07/29/how-blocks-storage-in-co...</a>

评论 #25051346 未加载

评论 #25052187 未加载

评论 #25052098 未加载

candiddevmikeover 4 years ago

What a ride. You're close to releasing Influx 2.0 without a clear migration strategy for your customers, and then you think it's a good idea to announce yet another storage rewrite? Why should customers stick with you guys when you have a track record for shipping half baked software, rewriting it, and leaving people out in the cold?

评论 #25052231 未加载

评论 #25057814 未加载

评论 #25052074 未加载

评论 #25053536 未加载

评论 #25054821 未加载

thamerover 4 years ago

I'm using Apache Arrow to store CSV-like data and it works amazingly well.The datasets I work with contains a few billion records with 5-10 fields each, almost all 64-bit longs; I originally started with CSV and soon switched to a binary format with fixed-sized records (mmap'd) which gave great performance improvements, but the flexibility, size gains due to columnar compression and the much greater performance of Arrow for queries that span a single column or a small number of them won me over.For anyone who has to process even a few million records locally, I would highly recommend it.

评论 #25051820 未加载

评论 #25051849 未加载

nhourcardover 4 years ago

"Columnar databases aren’t new, so why are we building yet another one? We weren’t able to find one in open source that was optimized for time series."This is the direction that QuestDB (www.questdb.io) has taken: columnar database, partitions by time and open source (apache 2.0). It is written is zero-GC java and c++, leveraging SIMD instructions. The live demo has been shown to HN recently, with sub second queries for 1.6 billion rows: <a href="https://news.ycombinator.com/item?id=23616878" rel="nofollow">https://news.ycombinator.com/item?id=23616878</a>NB: I am a co-founder of questdb.

评论 #25057747 未加载

gizmodo59over 4 years ago

Another SQL engine on data lake that heavily uses arrow is Dremio.<a href="https://www.dremio.com/webinars/apache-arrow-calcite-parquet-relational-cache/" rel="nofollow">https://www.dremio.com/webinars/apache-arrow-calcite-parquet...</a><a href="https://github.com/dremio/dremio-oss" rel="nofollow">https://github.com/dremio/dremio-oss</a>If you have parquet on S3, using an engine like Dremio (or any engine based on arrow) can give you some impressive performance. Key innovations in OSS on data analytics/data lake:Arrow - Columnar in memory format; Gandiva - LLVM based execution kernel; Arrow flight - Wire protocol based on arrow; Project Nessie - A git like workflow for data lakes<a href="https://arrow.apache.org/" rel="nofollow">https://arrow.apache.org/</a>. <a href="https://arrow.apache.org/docs/format/Flight.html" rel="nofollow">https://arrow.apache.org/docs/format/Flight.html</a>. <a href="https://arrow.apache.org/blog/2018/12/05/gandiva-donation/" rel="nofollow">https://arrow.apache.org/blog/2018/12/05/gandiva-donation/</a> <a href="https://github.com/projectnessie/nessie" rel="nofollow">https://github.com/projectnessie/nessie</a>

评论 #25052963 未加载

jdubover 4 years ago

I didn't see it in the post, but a huge amount of this is due to Andy Grove's work on Rust implementations of Apache Arrow and DataFusion.I imagine he's happy where he is, but I hope there's some opportunity for InfluxDB to give credit and support for his great work.

评论 #25052617 未加载

hnarnover 4 years ago

How does InfluxDB compare to TimescaleDB? My understanding is that the use case is pretty similar (time series/metrics), are they good at different things?

评论 #25054016 未加载

评论 #25053556 未加载

评论 #25053652 未加载

jinmingjianover 4 years ago

There is already a general purpose working-in-progress OLAP project written in Rust.<a href="https://tensorbase.io/" rel="nofollow">https://tensorbase.io/</a>1. TensorBase is highly hackable. If you know the Rust and C, then you can control all of the world. This is obvious not for Apache Arrow and DataFusion (on the top of Arrow).2. TensorBase uses the whole-stage JIT optimization which is (in complex cases possibly hugely) faster than that done in Gandiva. Expression based computing kernel is far from provoding the top performance for OLAP like bigdata system.3. TensorBase keeps some kinds of OLTP in mind (although in the early stage its still in OLAP). There is no truely OLTP or OLAP viewpoints in users. Users just want all their queries being fastest.4. TensorBase is now APL v2 based. Enjoy to hack it yourself!ps: One recent writting about TensorBase (and those compared with some query engine and project in Rust works included) could be seen in this presentation: <a href="https://tensorbase.io/2020/11/08/rustfest2020.html" rel="nofollow">https://tensorbase.io/2020/11/08/rustfest2020.html</a>Disclaimer: I am the author of TensorBase.

评论 #25058209 未加载

valyalaover 4 years ago

Just built InfluxDB IOx from sources [1] and compared data ingestion performance to VictoriaMetrics by using Billy tool [2] on a laptop with Intel i5-8265U CPU (it contains 4 CPU cores) and 32GB RAM. Results for 1M measurements are the following:- InfluxDB IOx: 600K rows/sec- VictoriaMetrics: 4M rows/secI.e. VictoriaMetrics outperforms InfluxDB IOx by more than 6x in this benchmark. I hope InfluxDB IOx performance will be improved over time, since it is written in Rust.[1] <a href="https://github.com/influxdata/influxdb_iox" rel="nofollow">https://github.com/influxdata/influxdb_iox</a>[2] <a href="https://github.com/VictoriaMetrics/billy" rel="nofollow">https://github.com/VictoriaMetrics/billy</a>

评论 #25058431 未加载

thekozmoover 4 years ago

The new direction is really promising. However, we at ScyllaDB (/me am a co-founder) already meet most of the requirements. We use C++20 with an advanced shard per core, we have an open format for the files, you can easily import/export them. One can use Scylla as a general DB and also run KairosDB for timeseries specific if needed.Recently we have a MSc project to add Parquet which is a very good direction, couldn't agree more.

devjover 4 years ago

How does datafusion based engine like this compare to timely/differential dataflow(naiad)?PS: Im a rookie in this whole domain.. so any pointers would be really helpful.

评论 #25054626 未加载

jaymebbover 4 years ago

Very cool! - I'm curious how far Influx will then move into being a general purpose columnar database system outside of typical timeseries workloads - moving more into being a general purpose OLAP DB for analytical "data science" workload?Will there be any type of transactional guarantees (ACID) using MVCC or similar?Is the execution engine vectorised?

评论 #25051893 未加载

estover 4 years ago

key takeaway:> As an added bonus, within the Rust set of Apache Arrow tools is DataFusion, a Rust native SQL query engine for Apache Arrow. Given that we’re building with DataFusion as the core, this means that InfluxDB IOx will support a subset of SQL out of the box

la6471over 4 years ago

What is the difference between Apache spark and Apache arrow?

评论 #25058575 未加载

ncmncmover 4 years ago

Layoffs coming, then?(If you don't know, don't guess.)