TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

1.1B Taxi Rides Using OmniSciDB and a MacBook Pro

205 点作者 tmostak将近 5 年前

15 条评论

twoodfin将近 5 年前
Whenever I look at large aggregation benchmarks like this, I try to estimate cycles&#x2F;value or better cycles&#x2F;byte.<p>Take this query:<p><pre><code> SELECT cab_type, count(*) FROM trips GROUP BY cab_type; </code></pre> This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.<p>He&#x27;s got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they&#x27;re all running all out. Let&#x27;s say 3B cycles&#x2F;core&#x2F;second. So in .134 seconds (the best measured time) he&#x27;s burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles&#x2F;value.<p>While that&#x27;s ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I&#x27;m sure OmniSciDB is using, it&#x27;s less efficient than I might have expected.<p>Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I&#x27;d expect opportunities both for computationally friendly compact encoding (&quot;yellow&quot; is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.<p>Even adding LZ4 should only cost you about a cycle a byte.<p>That&#x27;s not to denigrate OmniSciDB: They&#x27;re already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there&#x27;s substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.
评论 #23989686 未加载
评论 #23992382 未加载
评论 #23994152 未加载
评论 #23992721 未加载
tmostak将近 5 年前
For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: <a href="https:&#x2F;&#x2F;www.omnisci.com&#x2F;mac-preview" rel="nofollow">https:&#x2F;&#x2F;www.omnisci.com&#x2F;mac-preview</a>. This is a bit of an experiment for us, so we&#x27;d love your feedback! Note that the Mac preview doesn&#x27;t yet have the scalable rendering capabilities our platform is known for, but stay tuned.<p>You can also install the open source version of OmniSciDB, either via tar&#x2F;deb&#x2F;rpm&#x2F;Docker for Linux (<a href="https:&#x2F;&#x2F;www.omnisci.com&#x2F;platform&#x2F;downloads&#x2F;open-source" rel="nofollow">https:&#x2F;&#x2F;www.omnisci.com&#x2F;platform&#x2F;downloads&#x2F;open-source</a>) or by following the build instructions for Mac in our git repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;omnisci&#x2F;omniscidb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;omnisci&#x2F;omniscidb</a> (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.
评论 #23994695 未加载
评论 #23992876 未加载
hodgesrm将近 5 年前
It would be great to understand why OmniSciDB does so well on this benchmark but seems to do far less well on others.<p>The ClickHouse team was (obviously!) very interested in Mark&#x27;s result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: <a href="https:&#x2F;&#x2F;presentations.clickhouse.tech&#x2F;original_website&#x2F;benchmark.html#[%2210000000%22,[%22ClickHouse%22,%22OmniSci%22],[%220%22,%221%22,%222%22]" rel="nofollow">https:&#x2F;&#x2F;presentations.clickhouse.tech&#x2F;original_website&#x2F;bench...</a>]<p>Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.<p>Disclaimer: I work at Altinity, which supports ClickHouse.<p>Edit: Fixed bad link
评论 #23992009 未加载
评论 #23994478 未加载
评论 #23991096 未加载
skavi将近 5 年前
I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.<p>Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.
评论 #23989089 未加载
评论 #23989045 未加载
评论 #23989232 未加载
xiaodai将近 5 年前
I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.
评论 #23989259 未加载
评论 #23992018 未加载
评论 #23989525 未加载
angryyellowman将近 5 年前
&gt;The GPU won&#x27;t be used by OmniSciDB in this benchmark but for the record it&#x27;s an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won&#x27;t have a material impact on this benchmark.<p>He lost me here...I get that it doesn&#x27;t matter, but come on, if you don&#x27;t know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...
评论 #23990239 未加载
danso将近 5 年前
<p><pre><code> COPY trips FROM &#x27;&#x2F;Users&#x2F;mark&#x2F;taxi_csv&#x2F;*.gz&#x27; WITH (HEADER=&#x27;false&#x27;); </code></pre> &gt; <i>The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB&#x27;s internal format.</i><p>I’m really curious how a simple import (no indexes or data typing) into SQLite would compare. But I don’t have 700GB of free SSD space to spare.
评论 #23988780 未加载
评论 #23988731 未加载
wenc将近 5 年前
For a list of benchmarks by the same author.<p><a href="https:&#x2F;&#x2F;tech.marksblogg.com&#x2F;benchmarks.html" rel="nofollow">https:&#x2F;&#x2F;tech.marksblogg.com&#x2F;benchmarks.html</a><p>Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn&#x27;t seem to have been tested. YMMV.
评论 #23988851 未加载
rexreed将近 5 年前
Side note: you can see a cool demo of OmniSci at <a href="https:&#x2F;&#x2F;www.aidemoshowcase.com&#x2F;2020&#x2F;07&#x2F;08&#x2F;omnisci&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.aidemoshowcase.com&#x2F;2020&#x2F;07&#x2F;08&#x2F;omnisci&#x2F;</a>
pvtmert将近 5 年前
I wonder why nobody uses -C and --strip options of tar. It handles those stuff automatically.<p><pre><code> mkdir -p application tar --strip=1 -C application -xf archive.tar </code></pre> For example installing NodeJS from archive when composed with curl:<p><pre><code> OS=$(uname -s)-x64 VER=12.16.1 curl -#L &quot;https:&#x2F;&#x2F;nodejs.org&#x2F;dist&#x2F;v${VER}&#x2F;node-v${VER}-${OS,,}.tar.gz&quot; \ | sudo tar --strip=1 -xzC &#x2F;usr&#x2F;local </code></pre> IMHO thats probably the reason why most apps are having proprietary installers :)
zX41ZdbW将近 5 年前
Follow-up: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23990844" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23990844</a>
coolgeek将近 5 年前
&gt; <a href="https:&#x2F;&#x2F;tech.marksblogg.com&#x2F;omnisci-macos-macbookpro-mbp.html" rel="nofollow">https:&#x2F;&#x2F;tech.marksblogg.com&#x2F;omnisci-macos-macbookpro-mbp.htm...</a><p>Is this what a keyword-stuffed URL looks like? This is terrible! This does nothing at all to communicate semantic meaning about the post&#x27;s content
popotamonga将近 5 年前
Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?
antb123将近 5 年前
I wonder how it compares with a similar pandas instance?
innocenat将近 5 年前
Someone shout put &#x27;analytic&#x27; in the title. It make no sense to me right now.