1.1B Taxi Rides Using OmniSciDB and a MacBook Pro

205 点作者 tmostak将近 5 年前

15 条评论

twoodfin将近 5 年前

Whenever I look at large aggregation benchmarks like this, I try to estimate cycles/value or better cycles/byte.Take this query:<pre><code> SELECT cab_type, count(*) FROM trips GROUP BY cab_type; </code></pre> This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.Even adding LZ4 should only cost you about a cycle a byte.That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.

评论 #23989686 未加载

评论 #23992382 未加载

评论 #23994152 未加载

评论 #23992721 未加载

tmostak将近 5 年前

For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: <a href="https://www.omnisci.com/mac-preview" rel="nofollow">https://www.omnisci.com/mac-preview</a>. This is a bit of an experiment for us, so we'd love your feedback! Note that the Mac preview doesn't yet have the scalable rendering capabilities our platform is known for, but stay tuned.You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (<a href="https://www.omnisci.com/platform/downloads/open-source" rel="nofollow">https://www.omnisci.com/platform/downloads/open-source</a>) or by following the build instructions for Mac in our git repo: <a href="https://github.com/omnisci/omniscidb" rel="nofollow">https://github.com/omnisci/omniscidb</a> (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.

评论 #23994695 未加载

评论 #23992876 未加载

hodgesrm将近 5 年前

It would be great to understand why OmniSciDB does so well on this benchmark but seems to do far less well on others.The ClickHouse team was (obviously!) very interested in Mark's result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: <a href="https://presentations.clickhouse.tech/original_website/benchmark.html#[%2210000000%22,[%22ClickHouse%22,%22OmniSci%22],[%220%22,%221%22,%222%22]" rel="nofollow">https://presentations.clickhouse.tech/original_website/bench...</a>]Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.Disclaimer: I work at Altinity, which supports ClickHouse.Edit: Fixed bad link

评论 #23992009 未加载

评论 #23994478 未加载

评论 #23991096 未加载

skavi将近 5 年前

I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.

评论 #23989089 未加载

评论 #23989045 未加载

评论 #23989232 未加载

xiaodai将近 5 年前

I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.

评论 #23989259 未加载

评论 #23992018 未加载

评论 #23989525 未加载

angryyellowman将近 5 年前

>The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...

评论 #23990239 未加载

danso将近 5 年前

<pre><code> COPY trips FROM '/Users/mark/taxi_csv/*.gz' WITH (HEADER='false'); </code></pre> > The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.I’m really curious how a simple import (no indexes or data typing) into SQLite would compare. But I don’t have 700GB of free SSD space to spare.

评论 #23988780 未加载

评论 #23988731 未加载

wenc将近 5 年前

For a list of benchmarks by the same author.<a href="https://tech.marksblogg.com/benchmarks.html" rel="nofollow">https://tech.marksblogg.com/benchmarks.html</a>Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.

评论 #23988851 未加载

rexreed将近 5 年前

Side note: you can see a cool demo of OmniSci at <a href="https://www.aidemoshowcase.com/2020/07/08/omnisci/" rel="nofollow">https://www.aidemoshowcase.com/2020/07/08/omnisci/</a>

pvtmert将近 5 年前

I wonder why nobody uses -C and --strip options of tar. It handles those stuff automatically.<pre><code> mkdir -p application tar --strip=1 -C application -xf archive.tar </code></pre> For example installing NodeJS from archive when composed with curl:<pre><code> OS=$(uname -s)-x64 VER=12.16.1 curl -#L "https://nodejs.org/dist/v${VER}/node-v${VER}-${OS,,}.tar.gz" \ | sudo tar --strip=1 -xzC /usr/local </code></pre> IMHO thats probably the reason why most apps are having proprietary installers :)

zX41ZdbW将近 5 年前

Follow-up: <a href="https://news.ycombinator.com/item?id=23990844" rel="nofollow">https://news.ycombinator.com/item?id=23990844</a>

coolgeek将近 5 年前

> <a href="https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.html" rel="nofollow">https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.htm...</a>Is this what a keyword-stuffed URL looks like? This is terrible! This does nothing at all to communicate semantic meaning about the post's content

popotamonga将近 5 年前

Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?

antb123将近 5 年前

I wonder how it compares with a similar pandas instance?

innocenat将近 5 年前

Someone shout put 'analytic' in the title. It make no sense to me right now.