Take this post with a grain of salt, since I have the zeal of a recently saved sinner, but you should try using Hive and Hadoop for this sort of thing.<p>We recently switched from a workflow that is very similar to the one you describe to using Hive with Amazon's elastic map reduce. Hive presents a SQL-like layer of abstraction over exactly this sort of thing. Instead of doing the sorting and merging by hand, you simply write it as a series of joins. It's like writing SQL, except the actual implementation works almost exactly like what you're doing.<p>Integrating simple Ruby scripts for JSON processing was also trivial.<p>Elastic MapReduce also had near-zero infrastructure and management overhead for us (besides the 10% Amazon charges for the machine instances). We use S3 for all data input and output, which is perfect for us.<p>Even when running on a single machine, using Hive was a big win in terms of development time, and performance of the jobs seemed only slightly slower that using Unix utilities on big text files. It's almost a bonus that we can also scale it out to dozens of machines, for a huge speedup. Running a job that took several hours on a single machine took less than five minutes, and only a few hours of EC2 machine time. Cheap and easy!