TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Running Awk in parallel to process 256M records

35 点作者 ketanmaheshwari超过 3 年前

4 条评论

winrid超过 3 年前
Late night semi related rant while I can&#x27;t sleep.<p>I worked at one place where we had this big distributed system for processing about 1M table rows (recalculate some stuff with latest code to see if latest code has regressions).<p>I joined a couple years after launch and it took months to get it working okay and have good visibility on it.<p>It took about eight hours to run, eventually got it down to three. The actual calculations only took like a second per object, so with 24 or so VMs you get the 8 hours. Sometimes it would take too long and the cron would seed the next batch of items in the queue without checking if it was empty, resulting in a process that never finished!<p>You&#x27;re probably thinking, just add more nodes! Scale horizontally! We&#x27;ll, we were on kubernetes. Except we weren&#x27;t allowed to use k8s. We had to use this abstraction provided by devops AROUND k8s. This framework had some limitations.<p>Also, simply scaling horizontally would take down the production DB due to number of connections, instead of say using multithreading and reusing connections.<p>I had a solution that ran through all the data on my MacBook with GNU parallel in less than an hour, but I could never convince the architects to let me deploy it. :)<p>So, distributed stuff can be really nice. But if you&#x27;re having trouble building the simple version done well, probably don&#x27;t make it distributed yet. I might still have PTSD from &quot;hey can you run the Thing on your laptop again? The Thing in prod won&#x27;t finish for another 9 hours.&quot;
eggy超过 3 年前
AWK seems to be having a renaissance, and I wonder if it is only because Perl sort of lost favor for a while to Python and others, while Perl 6 being renamed Raku added further confusion. Using options like &#x27;perl -pie&#x27; gives you a nice subsystem to perform your AWK-like operations, and you have a lot more in Perl to back it up if needed. I am seeing AWK pop up here and in other forums, but maybe I am just focused on it for now.
评论 #28346251 未加载
asicsp超过 3 年前
Previous discussion: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23394024" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23394024</a><p>See also &quot;Command-line Tools can be 235x Faster than your Hadoop Cluster&quot;: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=22188877" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=22188877</a>
nafizh超过 3 年前
You can definitely use awk (I use it myself). But lets not pretend it’s readable for anyone after the original writer. It has a single purpose and that is to get the text munging task in front of you done as quickly as possible.
评论 #28347132 未加载
评论 #28347914 未加载