1.1B Taxi Rides Using OmniSciDB and a MacBook Pro

205 pointsby tmostakalmost 5 years ago

15 comments

twoodfinalmost 5 years ago

Whenever I look at large aggregation benchmarks like this, I try to estimate cycles/value or better cycles/byte.Take this query:<pre><code> SELECT cab_type, count(*) FROM trips GROUP BY cab_type; </code></pre> This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.Even adding LZ4 should only cost you about a cycle a byte.That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.

评论 #23989686 未加载

评论 #23992382 未加载

评论 #23994152 未加载

评论 #23992721 未加载

tmostakalmost 5 years ago

For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: <a href="https://www.omnisci.com/mac-preview" rel="nofollow">https://www.omnisci.com/mac-preview</a>. This is a bit of an experiment for us, so we'd love your feedback! Note that the Mac preview doesn't yet have the scalable rendering capabilities our platform is known for, but stay tuned.You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (<a href="https://www.omnisci.com/platform/downloads/open-source" rel="nofollow">https://www.omnisci.com/platform/downloads/open-source</a>) or by following the build instructions for Mac in our git repo: <a href="https://github.com/omnisci/omniscidb" rel="nofollow">https://github.com/omnisci/omniscidb</a> (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.

评论 #23994695 未加载

评论 #23992876 未加载

hodgesrmalmost 5 years ago

It would be great to understand why OmniSciDB does so well on this benchmark but seems to do far less well on others.The ClickHouse team was (obviously!) very interested in Mark's result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: <a href="https://presentations.clickhouse.tech/original_website/benchmark.html#[%2210000000%22,[%22ClickHouse%22,%22OmniSci%22],[%220%22,%221%22,%222%22]" rel="nofollow">https://presentations.clickhouse.tech/original_website/bench...</a>]Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.Disclaimer: I work at Altinity, which supports ClickHouse.Edit: Fixed bad link

评论 #23992009 未加载

评论 #23994478 未加载

评论 #23991096 未加载

skavialmost 5 years ago

I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.

评论 #23989089 未加载

评论 #23989045 未加载

评论 #23989232 未加载

xiaodaialmost 5 years ago

I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.

评论 #23989259 未加载

评论 #23992018 未加载

评论 #23989525 未加载

angryyellowmanalmost 5 years ago

>The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...

评论 #23990239 未加载

dansoalmost 5 years ago

<pre><code> COPY trips FROM '/Users/mark/taxi_csv/*.gz' WITH (HEADER='false'); </code></pre> > The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.I’m really curious how a simple import (no indexes or data typing) into SQLite would compare. But I don’t have 700GB of free SSD space to spare.

评论 #23988780 未加载

评论 #23988731 未加载

wencalmost 5 years ago

For a list of benchmarks by the same author.<a href="https://tech.marksblogg.com/benchmarks.html" rel="nofollow">https://tech.marksblogg.com/benchmarks.html</a>Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.

评论 #23988851 未加载

rexreedalmost 5 years ago

Side note: you can see a cool demo of OmniSci at <a href="https://www.aidemoshowcase.com/2020/07/08/omnisci/" rel="nofollow">https://www.aidemoshowcase.com/2020/07/08/omnisci/</a>

pvtmertalmost 5 years ago

I wonder why nobody uses -C and --strip options of tar. It handles those stuff automatically.<pre><code> mkdir -p application tar --strip=1 -C application -xf archive.tar </code></pre> For example installing NodeJS from archive when composed with curl:<pre><code> OS=$(uname -s)-x64 VER=12.16.1 curl -#L "https://nodejs.org/dist/v${VER}/node-v${VER}-${OS,,}.tar.gz" \ | sudo tar --strip=1 -xzC /usr/local </code></pre> IMHO thats probably the reason why most apps are having proprietary installers :)

zX41ZdbWalmost 5 years ago

Follow-up: <a href="https://news.ycombinator.com/item?id=23990844" rel="nofollow">https://news.ycombinator.com/item?id=23990844</a>

coolgeekalmost 5 years ago

> <a href="https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.html" rel="nofollow">https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.htm...</a>Is this what a keyword-stuffed URL looks like? This is terrible! This does nothing at all to communicate semantic meaning about the post's content

popotamongaalmost 5 years ago

Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?

antb123almost 5 years ago

I wonder how it compares with a similar pandas instance?

innocenatalmost 5 years ago

Someone shout put 'analytic' in the title. It make no sense to me right now.