TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Performance in Big Data Land: Every CPU cycle matters

66 点作者 ptothek2超过 9 年前

12 条评论

zmmmmm超过 9 年前
So this person says every CPU cycle matters and then immediately takes the single CPU cycle and multiplies it by billions, the scale of their data.<p>So no, it&#x27;s not &quot;every&quot; CPU cycle, it&#x27;s the ones that scale with the highest dimension of your data that matter. Which is the same old story we have always had, save your energy for optimising the parts that matter, because the ones that matter probably matter orders of magnitudes more than the ones that don&#x27;t.
GeneralMayhem超过 9 年前
&gt;If AUTOCOMMIT = ON (jdbc driver default), each statement is treated as a complete transaction. When a statement completes changes are automatically committed to the database. When AUTOCOMMIT = OFF, the transaction continues until manually run COMMIT or ROLLBACK. The locks are kept on objects for transaction duration.<p>This made me cringe. Whether a series of operations takes place in one transaction or many isn&#x27;t something you can just turn on and off depending on what looks more expensive!<p>The article ended up suggesting more transactionality, which is generally good (although the reason given is not the important one, namely &quot;you&#x27;re less likely to have all your data completely ruined&quot;), but if you make the process distributed and aren&#x27;t careful about sharding you may end up trading average-case cost in network load for much worse worst-case cost due to lock contention and transaction failures.<p>Optimizing database access patterns at scale is <i>hard</i>, and blithely making major changes to things that impact correctness is not the way to do it.
评论 #10470461 未加载
jrbancel超过 9 年前
Is 100 Billion (order of a few TB) Big Data?<p>In my experience, CPU is rarely the big issue when dealing with a lot of data (I am talking about tens of PB per day). IO is the main problem and designing systems that move the least amount of data is the real challenge.
评论 #10469727 未加载
评论 #10470750 未加载
评论 #10469695 未加载
评论 #10469623 未加载
gtrubetskoy超过 9 年前
CPU is probably not the best example, but the point is very valid, that at 100B scale anything is large.<p>We humans are not very good at appreciating orders of magnitude. I usually explain it this way: if it takes you 1 hour to process 1M records, then 10M will take 10 hours, and 100M will take 4.2 days while 10B will take over a year.
brendangregg超过 9 年前
I hope later posts in this series explore Linux perf_events or flame graphs, which is the origin of the (unattributed) background image (<a href="http:&#x2F;&#x2F;www.brendangregg.com&#x2F;FlameGraphs&#x2F;cpuflamegraphs.html" rel="nofollow">http:&#x2F;&#x2F;www.brendangregg.com&#x2F;FlameGraphs&#x2F;cpuflamegraphs.html</a>). :)
评论 #10470471 未加载
评论 #10468667 未加载
syed99超过 9 年前
&quot;Different data types will force Vertica to use a different number of CPU cycles to process a data point&quot; At the end of the day that performance bump comes down to the data point itself, sometimes the decrease in that CPU cycle wouldn&#x27;t be as significant as expected.<p>Would love to see if the performance bump is highly significant on a much larger and complex data set.
martin_超过 9 年前
Am I misunderstanding something? If one CPU cycle accounts for 27 seconds, then the savings of 10 seconds suggest we saved one half of a CPU cycle per iteration? Or do the queries not touch every row?<p>Optimizing data types and minimizing locks seem like general optimization tips, I was hoping for more advanced techniques for 100B rows.
评论 #10467978 未加载
shulu超过 9 年前
Totally agree with the &quot;Every CPU cycle matters&quot;. It might be more easier to save cpu cycle by saving I&#x2F;O, utilizing data locality (with in datacenter racks) or even better serialization (binary, columnar or indexed).<p>Reducing locking and using shorter data type seem inadequate for the &quot;Big Data&quot; scene.
评论 #10473668 未加载
andmarios超过 9 年前
Then why big data land is dominated by JVM-based frameworks?
评论 #10468389 未加载
评论 #10468268 未加载
评论 #10468800 未加载
评论 #10468456 未加载
评论 #10468773 未加载
sargun超过 9 年前
I might suggest a new definition for &quot;Big Data&quot; - Data, whose size is greater than fits in one machine&#x27;s memory.
评论 #10468238 未加载
评论 #10467918 未加载
评论 #10468103 未加载
评论 #10468214 未加载
评论 #10469704 未加载
yummyfajitas超过 9 年前
Is this really about cpu rather than disk? I don&#x27;t see anywhere where he attempted to control for disk io by passing the integers.<p>In fact, since vertica is column oriented, I don&#x27;t think you can pad things easily.
mobiuscog超过 9 年前
I&#x27;m struggling to understand why they wouldn&#x27;t just use non-locking selects instead of turning auto-commit off.<p>Does the auto-commit add <i>additional</i> lock overhead for some reason ?