TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Breaking the trillion-rows-per-second barrier with MemSQL

123 pointsby navinsylvesterabout 7 years ago

11 comments

JVerstryabout 7 years ago
I have doubts about the &quot;We’re actually spending some time on every row&quot; claim. Saving data using columnstore often comes with meta-data saved at the page level, such as min, max and count values for the data in that page. These values are used to filter and optimize the flow of data processed for a query (I mean &#x27;skipped&#x27; here). If you run a &#x27;count&#x27; query 10 times, it&#x27;s very unlikely the DB will count rows 10 times. It will rely on the page&#x27;s existing meta-data when available (i.e., already computed). The tests described in the post are misleading IMHO.<p>EDIT: This comes on top of the fact that DBs can store queries results too. Moreover the post does not tell whether they have implemented clustered or filtered indexes on the considered columns. It does not explain how partition has been performed too. All this has a big impact on execution time.
评论 #16618183 未加载
评论 #16618375 未加载
nmstokerabout 7 years ago
Impressive although reading that they used 448 cores on super-expensive Intel lab machines takes the edge off a little
评论 #16617451 未加载
评论 #16617446 未加载
ddorian43about 7 years ago
So some questions::<p>1. Isn&#x27;t (3)vectorization and (4)SIMD the same thing ?<p>2. I don&#x27;t see the data-size before-after compression ?<p>3. How much RAM has each server ?<p>4. How do all cores work for all queries ? Is the data sharded by core on each machine or each core can work on whatever data ?<p>5. What&#x27;s a comparison open-source tool to this ? Only I can think about is snappydata.
评论 #16619340 未加载
评论 #16617473 未加载
评论 #16617387 未加载
danbrucabout 7 years ago
Meh. They used 448 cores to count the frequency of bit patterns of some small length in a probably more or less continuous block of memory. They had 57,756,221,440 total rows, that are 128,920,138 rows per core. If the data set contained 256 or less different stock symbols, then the task boils down to finding the byte histogram of a 123 MiB block of memory. My several years old laptop does this with the most straight forward C# implementation in 170 ms. That is less than a factor of 4 away from their 45.1 ms and given that AVX-512 can probably process 64 bytes at a time, we should have quite a bit room to spare for all the other steps involved in processing the query.<p>Don&#x27;t get me wrong, in some sense it is really impressive that we reached that level of processing power and that this database engine can optimize that query down to counting bytes and generating highly performant code to do so, but as an indicator that this database can process trillions of rows per second it is just a publicity stunt. Sure, it can do it with this setup and this query, but don&#x27;t be to surprised if you don&#x27;t get anywhere near that with other queries.
评论 #16619967 未加载
评论 #16617825 未加载
jnordwickabout 7 years ago
And of course, how does it compare to kdb? So it seems less expensive, but also lacks the advanced query language.<p>The last tests I saw for kdb was the 1.1 billion taxi ride.<p><a href="http:&#x2F;&#x2F;tech.marksblogg.com&#x2F;billion-nyc-taxi-kdb.html" rel="nofollow">http:&#x2F;&#x2F;tech.marksblogg.com&#x2F;billion-nyc-taxi-kdb.html</a><p>Where it basically outperformed every other CPU based system with slightly more complex queries.<p>Any comparisons planned?
评论 #16624045 未加载
thinkMOARabout 7 years ago
&quot;When you deliver response time that drops down to about a quarter of a second, results seem to be instantaneous to users.&quot;<p>I don&#x27;t think everybody agrees with this statement.
评论 #16617546 未加载
评论 #16618061 未加载
评论 #16618989 未加载
评论 #16617558 未加载
评论 #16617518 未加载
n0tmeabout 7 years ago
If all this data just fits in memory then what is surprising about the speed?
评论 #16617437 未加载
评论 #16620399 未加载
paulsutterabout 7 years ago
How fast is data import? Loading into RAM? (for example booting up a cluster for an existing imported database on AWS)<p>Working with some datasets with 100s of billions of short rows, curious to give it a try.
评论 #16619252 未加载
tabtababout 7 years ago
The speed of light is a hard limit. I don&#x27;t believe there is any free lunch[1], but trade-offs to manage. I&#x27;m skeptical of any claim that implies free or easy speed without potentially significant trade-offs.<p>If you can live with somewhat out-of-date and&#x2F;or out-of-sync data, you can throw mass parallelism at big read-only queries to get speed. The trade-offs often are best tuned from a domain perspective such that it&#x27;s not really a technology problem, although technology may make certain tunings&#x2F;tradeoffs easier to manage.<p>[1] (Faster hardware may give us incremental improvements, but the speed of light probably prevents any tradeoff-free breakthroughs.)
评论 #16619550 未加载
ameliusabout 7 years ago
How fast can it sort those rows?
评论 #16624986 未加载
EGregabout 7 years ago
Who from HN would need this and why?<p>Serious question. I would like to know different real use cases from people on HN, given our backgrounds.
评论 #16617426 未加载
评论 #16618091 未加载
评论 #16617526 未加载