TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Running Awk in parallel to process 256M records

358 点作者 ketanmaheshwari将近 5 年前

22 条评论

tetha将近 5 年前
Hm. I&#x27;m fully aware that I&#x27;m currently turning into a bearded DBA. And I may be just misreading the article and I probably don&#x27;t understand the article.<p>But, I started being somewhat confused by something:<p>&gt; Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (&#x2F;dev&#x2F;shm) bound ie. the data is read from and written to &#x2F;dev&#x2F;shm.<p>&gt; The total data size is 329GB.<p>At first glance, that&#x27;s an awful lot of hardware for a ... decently sized but not awfully large dataset. We&#x27;re dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.<p>The article presents a lot more AWK knowledge than I have. I&#x27;m impressed by that. I acknowledge that.<p>But I&#x27;d probably put all of that into a postgres instance, compute indexes and rely on automated query optimization and parallelization from there. Maybe tinker with pgstorm to offload huge index operations to a GPU. A lot of the shown scripting would be done by postgres, the parallelization is done automatically based on indexes, while eliminating the string serializations.<p>I do agree with the underlying sentiment of &quot;We don&#x27;t need hadoop&quot;. I&#x27;m impressed that AWK goes so far. I&#x27;d still recommend postgres in this case as a first solution. Maybe I just work with too many silly people at the moment.
评论 #23402525 未加载
评论 #23398007 未加载
评论 #23399770 未加载
评论 #23397224 未加载
评论 #23397809 未加载
评论 #23397049 未加载
评论 #23397227 未加载
评论 #23397278 未加载
评论 #23400714 未加载
tobias2014将近 5 年前
I&#x27;m sure that with tools like MPI-Bash [1] and more generally libcircle [2] many embarrassingly parallelizable problems can easily be tackled with standard *nix tools.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;lanl&#x2F;MPI-Bash" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lanl&#x2F;MPI-Bash</a> [2] <a href="http:&#x2F;&#x2F;hpc.github.io&#x2F;libcircle&#x2F;" rel="nofollow">http:&#x2F;&#x2F;hpc.github.io&#x2F;libcircle&#x2F;</a>
评论 #23394741 未加载
tannhaeuser将近 5 年前
It&#x27;s odd that TFA has this focus on performance but doesn&#x27;t mention <i>which</i> awk implementation has been used; at least I haven&#x27;t found any mention of it. There are 3-4 implementations in mainstream use: nawk (the one true awk, an ancient version of which is installed on Mac OS by default), mawk (installed on eg. Ubuntu by default), gawk (on RH by default last I checked), or busybox-awk. Tip: mawk is much faster than the others and to get performance out of gawk you should use LANG=C (and also because of crashes with complex regexpes in Unicode locales in some versions of gawk 3 and 4).
评论 #23396245 未加载
hidiegomariani将近 5 年前
this somewhat reminds me of taco bell programming <a href="http:&#x2F;&#x2F;widgetsandshit.com&#x2F;teddziuba&#x2F;2010&#x2F;10&#x2F;taco-bell-programming.html" rel="nofollow">http:&#x2F;&#x2F;widgetsandshit.com&#x2F;teddziuba&#x2F;2010&#x2F;10&#x2F;taco-bell-progra...</a>
评论 #23399476 未加载
评论 #23396064 未加载
评论 #23394962 未加载
评论 #23405765 未加载
Upvoter33将近 5 年前
I&#x27;ve always wanted to build a parallel awk. And call it pawk. And have an O&#x27;Reilly book about it. With a chicken on the cover. pawk, pawk, pawk! This is a true story, sadly.
评论 #23398131 未加载
评论 #23401392 未加载
评论 #23397939 未加载
svnpenn将近 5 年前
<p><pre><code> !($1 in a) &amp;&amp; FILENAME ~ &#x2F;aminer&#x2F; { print } </code></pre> This uses a regular expression. As regex is not actually needed in this case, you might be able to get better performance with something like this:<p><pre><code> !($1 in a) &amp;&amp; index(FILENAME, &quot;aminer&quot;) != 0 { print }</code></pre>
co_dh将近 5 年前
I like the idea of using AWK for this. But you can give kdb&#x2F;q a try. 250M rows is nothing for kdb, and it seemed that you can afford the license.
评论 #23396380 未加载
评论 #23395949 未加载
mjcohen将近 5 年前
gawk has been my go-to text processing program for many years. I have written a number of multi-thousand line programs in it. I always use the lint option. Catches many of my errors. One of these had to read a 300,000,000 byte file into a single string so it could be searched. The file was in 32-byte lines. At first, I read each line in and appended it to the result string, but that took way to long since the result string was reallocated each time it was appended to. So I read in about 1000 of the 32-byte lines, appending them to a local string. This 32000-byte string was then appended to the result, so this only was done 10000 times. Worked fine.
FDSGSG将近 5 年前
Spending <i>minutes</i> on these tasks on hardware like this is pretty silly. awk is fine if these are just one-off scripts where development time is the priority, otherwise you&#x27;re wasting tons of compute time.<p>Querying things like these on such a small dataset should take seconds, not minutes.
评论 #23403956 未加载
ineedasername将近 5 年前
Definitely an under-appreciated tool. Very useful for one-off tasks that would take a fair bit longer to code in something like python.
Zeebrommer将近 5 年前
I am often impressed by the things that can be done with these old-school UNIX tools. I&#x27;m trying to learn a few of them, and the most difficult part are these very implicit syntax constructions. How is the naive observer to know that in bash `$(something)` is command substitution, but in a Makefile `$(something)` is just a normal variable? With `awk`, `sed` and friends it gets even worse of course.<p>Is the proper answer &#x27;just learn it&#x27;? Are these tools one of these things (like musical instruments or painting) where the initial learning phase is tedious and frustrating, but the potential is basically limitless?
评论 #23400800 未加载
评论 #23401668 未加载
arendtio将近 5 年前
First, I think it is great that you found a tool that suits your needs. A few weeks ago I was mangling some data too (just about 17 million records) and would like to contribute my experience.<p>My tools of choice were awk, R, and Go (in that order). Sometimes I could calculate something within a few seconds with awk. But for various calculations, R proved to be a lot faster. At some point, I reached a problem where the simple R implementation I borrowed from Stack Overflow (which was supposed to be much faster than the other posted solutions) did not satisfy my expectations and I spend 4 hours writing an implementation in Go which was a magnitude faster (I think it was about 20 minutes vs. 20 seconds).<p>So my advice is to broaden your toolset. When you reach the point where a single execution of your awk program takes 48 minutes, it might be worth considering using another tool. However, that doesn&#x27;t mean awk isn&#x27;t a good tool, I still use it for simple things, as writing 2 lines in awk is much faster than writing 30 in Go for the same task.
schmichael将近 5 年前
<a href="https:&#x2F;&#x2F;mobile.twitter.com&#x2F;awkdb" rel="nofollow">https:&#x2F;&#x2F;mobile.twitter.com&#x2F;awkdb</a> was a joke account made in frustration by a coworker trying to operate a Hadoop cluster almost a decade ago. Maybe it&#x27;s time to hand over the account...
gautamcgoel将近 5 年前
Your system had 512-core Xeons? Did you mean that you had 5 12-core xeons? Or 512 cores total?
评论 #23394523 未加载
winrid将近 5 年前
and here I am working on a big distributed system that has to handle 200k records a day (and hardly does successfully). sigh.
_wldu将近 5 年前
Turning json data into tabular data using jq was pretty neat. So many json apis in use today yet still a need for csv and excel docs.
评论 #23410375 未加载
tarun_anand将近 5 年前
Amazing work. Keep it up.
nmz将近 5 年前
You couldn&#x27;t have used FS=&quot;\036&quot; or &quot;\r&quot;?
nojito将近 5 年前
Why not just use data.table?<p>The solution would be much less error prone and most likely much quicker as well.
评论 #23395404 未加载
评论 #23398068 未加载
gh123man将近 5 年前
Slightly off topic, but as a swift developer (<a href="https:&#x2F;&#x2F;swift.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;swift.org&#x2F;</a>) the usage of swift&#x2F;T in this project really confused me. Is swift&#x2F;T in any way related to Apple&#x27;s swift language?<p>The naming conflict makes googling the differences fairly challenging.
评论 #23396190 未加载
skanga将近 5 年前
Try mawk if you can. I find that it does even faster.
评论 #23395806 未加载
flatfilefan将近 5 年前
GNU Parallel + AWK = even less code to write.