TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

InfluxDB is betting on Rust and Apache Arrow for next-gen data store

205 pointsby mhall119over 4 years ago

16 comments

pauldixover 4 years ago
InfluxDB creator here. I've actually been working on this one myself and am excited to answer any questions. The project is InfluxDB IOx (short for iron oxide, pronounced eye-ox). Among other things, it's an in-memory columnar database with object storage as the persistence layer.
评论 #25055772 未加载
评论 #25053879 未加载
评论 #25055525 未加载
评论 #25051727 未加载
评论 #25056991 未加载
评论 #25054655 未加载
评论 #25056415 未加载
评论 #25053201 未加载
评论 #25054512 未加载
sciurusover 4 years ago
InfluxData is arguably playing catch-up with Thanos, Cortex, and other scale-out Prometheus backends for the metrics use case. Given that, I wonder why they decided to write a new storage backend from scratch instead of building on the work Thano and Cortex have done. Those two competing projects are successfully sharing a lot of code that allows all data to be stored in object storage like S3.<p><a href="https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2020&#x2F;07&#x2F;29&#x2F;how-blocks-storage-in-cortex-reduces-operational-complexity-for-running-prometheus-at-massive-scale&#x2F;" rel="nofollow">https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2020&#x2F;07&#x2F;29&#x2F;how-blocks-storage-in-co...</a>
评论 #25051346 未加载
评论 #25052187 未加载
评论 #25052098 未加载
candiddevmikeover 4 years ago
What a ride. You&#x27;re close to releasing Influx 2.0 without a clear migration strategy for your customers, and then you think it&#x27;s a good idea to announce yet another storage rewrite? Why should customers stick with you guys when you have a track record for shipping half baked software, rewriting it, and leaving people out in the cold?
评论 #25052231 未加载
评论 #25057814 未加载
评论 #25052074 未加载
评论 #25053536 未加载
评论 #25054821 未加载
thamerover 4 years ago
I&#x27;m using Apache Arrow to store CSV-like data and it works amazingly well.<p>The datasets I work with contains a few billion records with 5-10 fields each, almost all 64-bit longs; I originally started with CSV and soon switched to a binary format with fixed-sized records (mmap&#x27;d) which gave great performance improvements, but the flexibility, size gains due to columnar compression and the much greater performance of Arrow for queries that span a single column or a small number of them won me over.<p>For anyone who has to process even a few million records locally, I would highly recommend it.
评论 #25051820 未加载
评论 #25051849 未加载
nhourcardover 4 years ago
&quot;Columnar databases aren’t new, so why are we building yet another one? We weren’t able to find one in open source that was optimized for time series.&quot;<p>This is the direction that QuestDB (www.questdb.io) has taken: columnar database, partitions by time and open source (apache 2.0). It is written is zero-GC java and c++, leveraging SIMD instructions. The live demo has been shown to HN recently, with sub second queries for 1.6 billion rows: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23616878" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23616878</a><p>NB: I am a co-founder of questdb.
评论 #25057747 未加载
gizmodo59over 4 years ago
Another SQL engine on data lake that heavily uses arrow is Dremio.<p><a href="https:&#x2F;&#x2F;www.dremio.com&#x2F;webinars&#x2F;apache-arrow-calcite-parquet-relational-cache&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.dremio.com&#x2F;webinars&#x2F;apache-arrow-calcite-parquet...</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;dremio&#x2F;dremio-oss" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dremio&#x2F;dremio-oss</a><p>If you have parquet on S3, using an engine like Dremio (or any engine based on arrow) can give you some impressive performance. Key innovations in OSS on data analytics&#x2F;data lake:<p>Arrow - Columnar in memory format; Gandiva - LLVM based execution kernel; Arrow flight - Wire protocol based on arrow; Project Nessie - A git like workflow for data lakes<p><a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;arrow.apache.org&#x2F;</a>. <a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;format&#x2F;Flight.html" rel="nofollow">https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;format&#x2F;Flight.html</a>. <a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;blog&#x2F;2018&#x2F;12&#x2F;05&#x2F;gandiva-donation&#x2F;" rel="nofollow">https:&#x2F;&#x2F;arrow.apache.org&#x2F;blog&#x2F;2018&#x2F;12&#x2F;05&#x2F;gandiva-donation&#x2F;</a> <a href="https:&#x2F;&#x2F;github.com&#x2F;projectnessie&#x2F;nessie" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;projectnessie&#x2F;nessie</a>
评论 #25052963 未加载
jdubover 4 years ago
I didn&#x27;t see it in the post, but a huge amount of this is due to Andy Grove&#x27;s work on Rust implementations of Apache Arrow and DataFusion.<p>I imagine he&#x27;s happy where he is, but I hope there&#x27;s some opportunity for InfluxDB to give credit and support for his great work.
评论 #25052617 未加载
hnarnover 4 years ago
How does InfluxDB compare to TimescaleDB? My understanding is that the use case is pretty similar (time series&#x2F;metrics), are they good at different things?
评论 #25054016 未加载
评论 #25053556 未加载
评论 #25053652 未加载
jinmingjianover 4 years ago
There is already a general purpose working-in-progress OLAP project written in Rust.<p><a href="https:&#x2F;&#x2F;tensorbase.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tensorbase.io&#x2F;</a><p>1. TensorBase is highly hackable. If you know the Rust and C, then you can control all of the world. This is obvious not for Apache Arrow and DataFusion (on the top of Arrow).<p>2. TensorBase uses the whole-stage JIT optimization which is (in complex cases possibly hugely) faster than that done in Gandiva. Expression based computing kernel is far from provoding the top performance for OLAP like bigdata system.<p>3. TensorBase keeps some kinds of OLTP in mind (although in the early stage its still in OLAP). There is no truely OLTP or OLAP viewpoints in users. Users just want all their queries being fastest.<p>4. TensorBase is now APL v2 based. Enjoy to hack it yourself!<p>ps: One recent writting about TensorBase (and those compared with some query engine and project in Rust works included) could be seen in this presentation: <a href="https:&#x2F;&#x2F;tensorbase.io&#x2F;2020&#x2F;11&#x2F;08&#x2F;rustfest2020.html" rel="nofollow">https:&#x2F;&#x2F;tensorbase.io&#x2F;2020&#x2F;11&#x2F;08&#x2F;rustfest2020.html</a><p>Disclaimer: I am the author of TensorBase.
评论 #25058209 未加载
valyalaover 4 years ago
Just built InfluxDB IOx from sources [1] and compared data ingestion performance to VictoriaMetrics by using Billy tool [2] on a laptop with Intel i5-8265U CPU (it contains 4 CPU cores) and 32GB RAM. Results for 1M measurements are the following:<p>- InfluxDB IOx: 600K rows&#x2F;sec<p>- VictoriaMetrics: 4M rows&#x2F;sec<p>I.e. VictoriaMetrics outperforms InfluxDB IOx by more than 6x in this benchmark. I hope InfluxDB IOx performance will be improved over time, since it is written in Rust.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;influxdata&#x2F;influxdb_iox" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;influxdata&#x2F;influxdb_iox</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;VictoriaMetrics&#x2F;billy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;VictoriaMetrics&#x2F;billy</a>
评论 #25058431 未加载
thekozmoover 4 years ago
The new direction is really promising. However, we at ScyllaDB (&#x2F;me am a co-founder) already meet most of the requirements. We use C++20 with an advanced shard per core, we have an open format for the files, you can easily import&#x2F;export them. One can use Scylla as a general DB and also run KairosDB for timeseries specific if needed.<p>Recently we have a MSc project to add Parquet which is a very good direction, couldn&#x27;t agree more.
devjover 4 years ago
How does datafusion based engine like this compare to timely&#x2F;differential dataflow(naiad)?<p>PS: Im a rookie in this whole domain.. so any pointers would be really helpful.
评论 #25054626 未加载
jaymebbover 4 years ago
Very cool! - I&#x27;m curious how far Influx will then move into being a general purpose columnar database system outside of typical timeseries workloads - moving more into being a general purpose OLAP DB for analytical &quot;data science&quot; workload?<p>Will there be any type of transactional guarantees (ACID) using MVCC or similar?<p>Is the execution engine vectorised?
评论 #25051893 未加载
estover 4 years ago
key takeaway:<p>&gt; As an added bonus, within the Rust set of Apache Arrow tools is DataFusion, a Rust native SQL query engine for Apache Arrow. Given that we’re building with DataFusion as the core, this means that InfluxDB IOx will support a subset of SQL out of the box
la6471over 4 years ago
What is the difference between Apache spark and Apache arrow?
评论 #25058575 未加载
ncmncmover 4 years ago
Layoffs coming, then?<p>(If you don&#x27;t know, don&#x27;t guess.)