Running Awk in parallel to process 256M records

358 点作者 ketanmaheshwari将近 5 年前

22 条评论

tetha将近 5 年前

Hm. I'm fully aware that I'm currently turning into a bearded DBA. And I may be just misreading the article and I probably don't understand the article.But, I started being somewhat confused by something:> Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data is read from and written to /dev/shm.> The total data size is 329GB.At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.The article presents a lot more AWK knowledge than I have. I'm impressed by that. I acknowledge that.But I'd probably put all of that into a postgres instance, compute indexes and rely on automated query optimization and parallelization from there. Maybe tinker with pgstorm to offload huge index operations to a GPU. A lot of the shown scripting would be done by postgres, the parallelization is done automatically based on indexes, while eliminating the string serializations.I do agree with the underlying sentiment of "We don't need hadoop". I'm impressed that AWK goes so far. I'd still recommend postgres in this case as a first solution. Maybe I just work with too many silly people at the moment.

评论 #23402525 未加载

评论 #23398007 未加载

评论 #23399770 未加载

评论 #23397224 未加载

评论 #23397809 未加载

评论 #23397049 未加载

评论 #23397227 未加载

评论 #23397278 未加载

评论 #23400714 未加载

tobias2014将近 5 年前

I'm sure that with tools like MPI-Bash [1] and more generally libcircle [2] many embarrassingly parallelizable problems can easily be tackled with standard *nix tools.[1] <a href="https://github.com/lanl/MPI-Bash" rel="nofollow">https://github.com/lanl/MPI-Bash</a> [2] <a href="http://hpc.github.io/libcircle/" rel="nofollow">http://hpc.github.io/libcircle/</a>

评论 #23394741 未加载

tannhaeuser将近 5 年前

It's odd that TFA has this focus on performance but doesn't mention which awk implementation has been used; at least I haven't found any mention of it. There are 3-4 implementations in mainstream use: nawk (the one true awk, an ancient version of which is installed on Mac OS by default), mawk (installed on eg. Ubuntu by default), gawk (on RH by default last I checked), or busybox-awk. Tip: mawk is much faster than the others and to get performance out of gawk you should use LANG=C (and also because of crashes with complex regexpes in Unicode locales in some versions of gawk 3 and 4).

评论 #23396245 未加载

hidiegomariani将近 5 年前

this somewhat reminds me of taco bell programming <a href="http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html" rel="nofollow">http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...</a>

评论 #23399476 未加载

评论 #23396064 未加载

评论 #23394962 未加载

评论 #23405765 未加载

Upvoter33将近 5 年前

I've always wanted to build a parallel awk. And call it pawk. And have an O'Reilly book about it. With a chicken on the cover. pawk, pawk, pawk! This is a true story, sadly.

评论 #23398131 未加载

评论 #23401392 未加载

评论 #23397939 未加载

svnpenn将近 5 年前

<pre><code> !($1 in a) && FILENAME ~ /aminer/ { print } </code></pre> This uses a regular expression. As regex is not actually needed in this case, you might be able to get better performance with something like this:<pre><code> !($1 in a) && index(FILENAME, "aminer") != 0 { print }</code></pre>

co_dh将近 5 年前

I like the idea of using AWK for this. But you can give kdb/q a try. 250M rows is nothing for kdb, and it seemed that you can afford the license.

评论 #23396380 未加载

评论 #23395949 未加载

mjcohen将近 5 年前

gawk has been my go-to text processing program for many years. I have written a number of multi-thousand line programs in it. I always use the lint option. Catches many of my errors. One of these had to read a 300,000,000 byte file into a single string so it could be searched. The file was in 32-byte lines. At first, I read each line in and appended it to the result string, but that took way to long since the result string was reallocated each time it was appended to. So I read in about 1000 of the 32-byte lines, appending them to a local string. This 32000-byte string was then appended to the result, so this only was done 10000 times. Worked fine.

FDSGSG将近 5 年前

Spending minutes on these tasks on hardware like this is pretty silly. awk is fine if these are just one-off scripts where development time is the priority, otherwise you're wasting tons of compute time.Querying things like these on such a small dataset should take seconds, not minutes.

评论 #23403956 未加载

ineedasername将近 5 年前

Definitely an under-appreciated tool. Very useful for one-off tasks that would take a fair bit longer to code in something like python.

Zeebrommer将近 5 年前

I am often impressed by the things that can be done with these old-school UNIX tools. I'm trying to learn a few of them, and the most difficult part are these very implicit syntax constructions. How is the naive observer to know that in bash `$(something)` is command substitution, but in a Makefile `$(something)` is just a normal variable? With `awk`, `sed` and friends it gets even worse of course.Is the proper answer 'just learn it'? Are these tools one of these things (like musical instruments or painting) where the initial learning phase is tedious and frustrating, but the potential is basically limitless?

评论 #23400800 未加载

评论 #23401668 未加载

arendtio将近 5 年前

First, I think it is great that you found a tool that suits your needs. A few weeks ago I was mangling some data too (just about 17 million records) and would like to contribute my experience.My tools of choice were awk, R, and Go (in that order). Sometimes I could calculate something within a few seconds with awk. But for various calculations, R proved to be a lot faster. At some point, I reached a problem where the simple R implementation I borrowed from Stack Overflow (which was supposed to be much faster than the other posted solutions) did not satisfy my expectations and I spend 4 hours writing an implementation in Go which was a magnitude faster (I think it was about 20 minutes vs. 20 seconds).So my advice is to broaden your toolset. When you reach the point where a single execution of your awk program takes 48 minutes, it might be worth considering using another tool. However, that doesn't mean awk isn't a good tool, I still use it for simple things, as writing 2 lines in awk is much faster than writing 30 in Go for the same task.

schmichael将近 5 年前

<a href="https://mobile.twitter.com/awkdb" rel="nofollow">https://mobile.twitter.com/awkdb</a> was a joke account made in frustration by a coworker trying to operate a Hadoop cluster almost a decade ago. Maybe it's time to hand over the account...

gautamcgoel将近 5 年前

Your system had 512-core Xeons? Did you mean that you had 5 12-core xeons? Or 512 cores total?

评论 #23394523 未加载

winrid将近 5 年前

and here I am working on a big distributed system that has to handle 200k records a day (and hardly does successfully). sigh.

_wldu将近 5 年前

Turning json data into tabular data using jq was pretty neat. So many json apis in use today yet still a need for csv and excel docs.

评论 #23410375 未加载

tarun_anand将近 5 年前

Amazing work. Keep it up.

nmz将近 5 年前

You couldn't have used FS="\036" or "\r"?

nojito将近 5 年前

Why not just use data.table?The solution would be much less error prone and most likely much quicker as well.

评论 #23395404 未加载

评论 #23398068 未加载

gh123man将近 5 年前

Slightly off topic, but as a swift developer (<a href="https://swift.org/" rel="nofollow">https://swift.org/</a>) the usage of swift/T in this project really confused me. Is swift/T in any way related to Apple's swift language?The naming conflict makes googling the differences fairly challenging.

评论 #23396190 未加载

skanga将近 5 年前

Try mawk if you can. I find that it does even faster.

评论 #23395806 未加载

flatfilefan将近 5 年前

GNU Parallel + AWK = even less code to write.