How should you build a high-performance column store for the 2020s?

164 点作者 deafcalculus超过 7 年前

15 条评论

Most of these techniques are already in production:Microsoft SQL Server has columnstore indexes and can even be combined with its in-memory tables. MemSQL has been doing this for years and v6 is incredibly fast, also combines in-memory row-stores. ClickHouse is very good if you don't mind more operations work. MariaDB has the ColumnStore storage engine, Postgre has the cstore_fdw extension. Vertica, Greenplum, Druid, etc. EventQL was an interesting project but abandoned now.AWS RedShift, Azure SQL Data Warehouse, Snowflake Data, Google BigQuery are the hosted options, with BQ being the most advanced with its vertical integration.If you want to operationalize Apache Arrow today, Dremio is built around it and works similar to Apache Drill and Spark to run distributed queries and joins across data sources.

评论 #15675756 未加载

lima超过 7 年前

Yandex's recently open sourced ClickHouse[1] column store does some of these.It heavily relies on compression, data locality and SIMD instructions and supports external dictionaries for lookup.[1]: <a href="https://clickhouse.yandex/" rel="nofollow">https://clickhouse.yandex/</a>

评论 #15674070 未加载

评论 #15673868 未加载

xoogler_thr超过 7 年前

This already exists, in Google BigQuery. Uses darn near every trick in the book, and some that aren’t in the book. Source: shipped it.

评论 #15675644 未加载

评论 #15674039 未加载

评论 #15673750 未加载

elvinyung超过 7 年前

I think one interesting project in the near future could be to try and build a column-oriented storage engine that's "good enough" for both OLAP and OLTP workloads.The main precedent here is Spanner's Ressi storage engine, which, according to the most recent paper [1], uses a PAX-like format (with blocks arranged row-oriented, but values within a block are column-oriented, so kind of like Parquet) for on-disk data, but combines it with a traditional log-structured merge tree for writes and point queries.[1] <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf" rel="nofollow">https://static.googleusercontent.com/media/research.google.c...</a>

评论 #15678426 未加载

bboreham超过 7 年前

I wonder if the 2020s column store would outperform kdb, which was written in the 1990s with a UI from the 1950s.

评论 #15675930 未加载

评论 #15675838 未加载

dustingetz超过 7 年前

Datastore of 2020s will be designed around an immutable log because it permits both strong consistency and horizontal scaling (like git).Once you're both distributed and consistent, the problems today's stores are architected around, go away. Your distributed queries can index the immutable log however they like. column-oriented, row-oriented, documents, time-oriented, graphs, immutability means you can do all of it, as a library in your application process<a href="http://www.datomic.com/" rel="nofollow">http://www.datomic.com/</a> - it's what you get when Facebook's graph datastore has a baby with immutability.

评论 #15673692 未加载

评论 #15675007 未加载

评论 #15675474 未加载

评论 #15674601 未加载

评论 #15676539 未加载

twotwotwo超过 7 年前

If I had to guess new capabilities chips will add in the 2020s, hardware-accelerated compression or compact encoding are near the top of the list. That could be anything from branch-free instructions to read/write varints to fully-self-contained (un)packer(s) you just point at some data and run. I'm most interested in something so fast as to be worth considering to replace fast software algos or to use in places we don't think about compressing at all now, though hardware accelerated zlib would obviously have applications too.Some existing stabs in this direction include that some Samsung SoCs had a simple "MComp" memory compressor (<a href="https://github.com/XileForce/Vindicator-S6-Unified/blob/master/drivers/memory/exynos-mcomp.h" rel="nofollow">https://github.com/XileForce/Vindicator-S6-Unified/blob/mast...</a>), that the new Qualcomm Centriq chips use compression (<a href="https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/qualcomm-amberwing-memory-compression.jpg" rel="nofollow">https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-...</a>), and that some IBM POWER CPUs have dedicated circuitry for memory compression (<a href="http://ibmsystemsmag.com/aix/administrator/lpar/ame-intro/" rel="nofollow">http://ibmsystemsmag.com/aix/administrator/lpar/ame-intro/</a>). There's also hardware zlib offload, like Intel QuickAssist.I'd expect more of this in the future because 1) space-is-speed is just a fundamental thing we deal with computing, 2) chips keep getting faster relative to RAM, 3) you already see lots of special-purpose instructions being added (crypto, strings, fancier vector stuff...) as it gets more expensive to improve general-purpose IPC. Maybe there's some additional value given the arrival of 3D XPoint and (probably) other future NVRAMs--would help you fit more on them without spending more time compressing than writing--but regardless, the trends seem to point to compression assists being interesting.One reason I could turn out wrong is if the general-purpose facilities we have make software the best place to write compressors anyway, i.e., fast software packers get so good it becomes difficult to justify hardware assists. General-purpose low- and medium-compression algos like LZ4 and Zstd run pretty fast already, and we have even faster memory compressors (WKdm, Density). Of course, that's on big Intel cores; maybe special-purpose compressor hardware will continue to mostly be more interesting alongside smaller cores.

jaffee超过 7 年前

Pilosa, <a href="https://github.com/pilosa/pilosa" rel="nofollow">https://github.com/pilosa/pilosa</a> which is mentioned, is actually open source, and a relatively readable Go codebase if anyone is interested in what "an entire data engine on top of bit-vectors" looks like.

gopalv超过 7 年前

(man, I'd love to go work on this for three years, without worrying about a "customer" or "backwards compatability")> That is, if you have N distinct values, you can store them using ceil(log(N)/log(2)) bitsIdeally you don't need to do ceil, if you had an low number like 5 items, then it looks like you need 3 bits to store it, but you can store it in 2.4 bits (just pack 10 values into 24 bits instead of 30).Getting distinct and repeated values by tearing apart data so that you can use these algorithms is something which I could use some papers to refer to.For instance, here's[1] what we're trying to do with Double encoding loops, but it still suffers from the problems of a car moving from 0.3 -> 0.2 location.[1] - <a href="http://bit.ly/2zt70iL" rel="nofollow">http://bit.ly/2zt70iL</a>

alberth超过 7 年前

Are column-store databases relevant on SSD/NVME?I ask because on a physical medium like hard disk, storing data on physical disk in column orientation can make a significant improvement to read operations.But with SSD/NVME, you don’t have to worry about the inherent slowness of physical platters that exist in hard disk.

评论 #15675442 未加载

评论 #15675966 未加载

评论 #15675502 未加载

rwmj超过 7 年前

Possibly naive question, but isn't an index (in a classical relational database) the same as a column store?

评论 #15675486 未加载

评论 #15675720 未加载

argimenes超过 7 年前

Not Invented Here, huh?

misterHN超过 7 年前

put data in text files, ASCII printable characters, one data point per lineput data files in directoryname data files after columnsuse ".data" filename extension for data fileswrite a tool to create index files (append ".index" to the name of the input text file) that map record number to byte offset in data fileIf data files are all < 4GB, use a 32 bit unsigned integer to represent the byte offset in the index fileEach index file is a packed array of 32 bit integersWrite a tool to create length files ".length" that count the number of entries in a data fileGenerate .length files for all data filesUse mmap to access index filesUse C for all of the aboveThis is for variable-length data values. Not every column will have these, making the .index files redundant in this case; the .index files should not be created in this case and program logic should support both uniform value length access and nonuniform value length access. The reason to prefer two access modes is to keep data from the .index files out of the cache when it is redundant.When all of this is done, the next thing to do is write a tool to test the cache characteristics on your processor by implementing sorting algorithms and testing their performance. Unless you are using a GPU (why?) all data your algorithm touches will go through every level of the cache hierarchy, forcing other data out. If possible, use a tool that reports hardware diagnostics. These tools may be provided by the processor vendor.Now, there is a trend to give the programmer control over cache behavior<a href="https://stackoverflow.com/questions/9544094/how-to-mark-some-memory-ranges-as-non-cacheable-from-c" rel="nofollow">https://stackoverflow.com/questions/9544094/how-to-mark-some...</a>I don't know if this is worth exploring or a wild goose chase. It may improve performance for some tasks, but it sounds a little strange for the programmer to tell the computer how to use the cache...shouldn't the operating system do this?Anyway, that's a start.

评论 #15674060 未加载

dogruck超过 7 年前

Sure comes across as arrogant for Prof Abadi to remark:> I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression).In this blog post, I don’t agree with:> Almost every single major data-processing platform that has emerged in the last decade has been either open source.That’s somewhat true by definition. OTOH, I also know most financial firms use proprietary solutions (which leverage open source components).

WilsonPhillips超过 7 年前

I think incorporating a blockchain element could prove an interesting way to implement this in practice.

评论 #15675044 未加载