With all the Hipster tech being released recently, the headline statement holds true for a lot of things, unfortunately.<p>We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...<p>EDIT: To give everyone a sense of scale, those $200 each bare metal servers are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm. We retain logs for 30 days and handle 4-5TB per day.
I did some testing on the same (kind of) dataset and task:<p>First test: A single 2.9GB file<p>time rg Result all.pgn | sort --radixsort | uniq -c
13 [Result "<i>"]
1106547 [Result "0-1"]
1377248 [Result "1-0"]
1077663 [Result "1/2-1/2"]
rg Result all.pgn 1.12s user 0.55s system 99% cpu 1.680 total
sort --radixsort 3.87s user 0.37s system 71% cpu 5.911 total
uniq -c 2.69s user 0.02s system 45% cpu 5.909 total<p>Using Apache Flink and a naive implementation It took 13.969 seconds.<p>Second test: same dataset, split between 4 files<p>time rg Result chessdata/</i> | awk -F ':' '{print $2}' - | sort --radixsort | uniq -c
13 [Result "<i>"]
1106547 [Result "0-1"]
1377248 [Result "1-0"]
1077663 [Result "1/2-1/2"]
rg Result chessdata/</i> 1.70s user 0.97s system 42% cpu 6.292 total
awk -F ':' '{print $2}' - 5.47s user 0.07s system 88% cpu 6.289 total
sort --radixsort 4.13s user 0.42s system 43% cpu 10.559 total
uniq -c 2.73s user 0.03s system 26% cpu 10.559 total<p>Flink: 12.724s<p>Conclusion: For this kind of workload, both approaches have comparable runtimes, even tough taco bell programming has the upper hand (as is should for simply filtering a text file). It took me about equally long to implement both. I think both approaches have their use case.<p>I ran this locally on my Laptop with 4 logical cores.
A classic from 2015 along the same lines: Scalability, but at what COST?<p><a href="http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html" rel="nofollow">http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...</a>
The author is experimenting with 1.75Gigs of data. At that scale sure, a local machine will be faster. Hadoop's real use-case though is when your data doesn't fit in memory and even this is kind of debatable. It makes sense to measure the performance with some prototypes and then make a final design rather than just use whatever AWS offers. Besides packaged services in AWS are also a bit more costly than basic services like EC2 instances and network goodies.
This reminds me my experience from a company internal hackathon. My colleague started writing a Spark program that would process the data we needed (a few hundreds GB uncompressed). Before he finished writing it, I was able to process all the data on a single machine with a unix pipeline. The computationally intensive steps were basically just grep, sort and uniq. When he finished the program, it couldn't run because of some operational issues on the cluster at the moment, so we didn't even find out the speed to compare.<p>For me, the morale is that the cheap hardware saves money/time twice:<p>1. It's faster if a program can run on a single machine.<p>2. It's easier to write a program that runs on a single machine.<p>With this in mind, cloud works great for analytical data processing. Just start a big enough machine, download data, do the computation, upload the result and turn the machine off. If you develop the program on a sample of the data so you can do it locally, it will be even cheap because you use only short time of the powerful server.
The two approaches aren't necessarily mutually exclusive. Spark can easily shell out using pipe(). Plus you can use that to compose and schedule arbitrarily large data sets through your bash pipeline through a multi-node cluster.<p>Beyond that, while the Unix tools are amazing for per-line FIFO-based processing, they really don't do a great job at anything requiring any sort of relational algebra.
very simple processing, not memory bound, tiny data-set - of course its going to be faster locally when the command itself takes less time than the networking, distribution, coordination and collation overhead of using any distributed tool...
once you get to the stage where your laptop is just not enough anymore (or your laptop has some cores you want to add to the processing as well), gnu parallel might be of use.<p><a href="https://www.gnu.org/software/parallel/" rel="nofollow">https://www.gnu.org/software/parallel/</a>
Is there any benefit of<p><pre><code> cat files* | grep pattern
</code></pre>
over this<p><pre><code> grep -h pattern files*
</code></pre>
aside from result color highlighting?
When all you have is a hammer, every problem starts looking like a nail.<p>The basic premise is fine: If you have a simple problem, using simple tools will give you a good result. Here you have text files, you just want to iterate through them and find a result from ONE line that's the same in every file, collate the results. No further analysis required.<p>Every problem in the world can be solved by a bash one-liner, right!?<p>There's an interesting dichotomy with bash scripts: One school says any bash script over 100 lines should be rewritten in Python, because it's overcomplex already. Another school says any Python script used daily over 100 lines should be rewritten in bash so there are no delusions about it being easy to maintain.<p>The original article is from 2013, and doesn't try to do any optimization (I guess, the original article is unavailable at the time of writing of this comment), so it would be an interesting question to see what you could do at the Hadoop end to make the query faster. I would imagine quite a lot.
We had a poorly performing service which reads from a number of rest endpoints and writes to s3 in date prefixed format. Offshore wrote 3,600 lines of codes targeting kinesis firehose. By just piping the url endpoints to a named pipe and cycling the s3 file in python, my code was 55 lines of code and did the same thing without kinesis. Wrapping things in GNU parallel and using bash flags, it handles any failure cases super gracefully, which is something the offshore code did not do. The India offshore code had a global exception catch-all, and would print the error and return exit success return code... but I guess someone got to put Kinesis on their resume.
I maintain here a very small command line cheatsheet that I get back for reference for mostly data analysis tasks <a href="https://tinyurl.com/tomercli" rel="nofollow">https://tinyurl.com/tomercli</a>
Been saying that for years. Also, get this, 99.999% of companies do not need "big data" or distributed systems of any kind. I feel like the old "cheap commodity hardware" pendulum swung way too far. More expensive, less "commodity" hardware can often be cheaper, if correctly deployed. I.e. you don't need a distributed database if your database is below 1TB and QPS is reasonable (and what's "reasonable" can surprise you today with large NVME SSDs, hundreds of gigabytes of RAM, and 64-core machines being affordable).
This was a straw man article in 2014, it was a straw man article the other times it’s been posted to HN in the intervening years, and it’s still a straw man article in 2020. As noted in another comment here, the contemporary technology of Apache Flink really isn’t far off command-line tools running on a single machine. Meanwhile, HDFS has made a lot of progress on its overhead, particularly unnecessary buffer copies. There are datasets where a Hadoop approach makes sense. But not for ones where the data fits in RAM on a single system. No one has ever argued that.
While I personally would use a similar pipeline as OP for such a small data set, saying Hadoop would take 50min for this is just flat wrong. It shows a clear lack of understanding of how to use Hadoop.
Amen. You can do a lot with pipes, various utils (sed, awk, grep, gnu parallel, etc.), sockets, so on and so forth. I see folks abuse Hadoop way too often for simple jobs.
If you're disappointed with the speed and complexity of your Hadoop cluster, and especially if you're trying to crack a bit, you should give ClickHouse a spin.
if you're doing spark or hadoop today and are a python shop...you should definitely look at Dask <a href="https://dask.org/" rel="nofollow">https://dask.org/</a><p>works as good as spark. very lightweight. works through docker.<p>Ground up integrated with kubernetes (runs on EKS/GKE,etc).<p>and no serialization betweek java/python, fatjar stuff, etc
command line tools like grep,awk,sed,etc are great for structured and line-based files like logs. For json documents I can add a recommendation for jq:<p><a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>
Cloud computing is kind of a joke. Yeah keep paying someone for shared "virtual computers", that sounds suspiciously similar to shared hosting from a decade or 2 ago... Oh but this is different, you get isolation from containers/VMs! Yeah ok, meanwhile new exploits emerge every couple weeks. It's like tech debt ideology on steroids... just keep pumping out instances until the company either goes hyperbolic, or goes bankrupt. Realistically, just buy a few physical servers and actually work to build efficiency into the system instead of just throwing compute at your public-facing web app.<p>I recently bought a DELL r710 just for fun and was pleasantly surprised how even days after spinning up a bunch of VMs, I don't have a 30gb logfile for all the failed attempts at getting into my instance (this was my experience recently with 2 cloud providers!)<p>It'll be interesting to see how the mkt reacts when you have a "first-of-its-kind" massive, massive security breach that affect popular "pure play" internet companies hosted on top of the mythical "cloud."<p>Seriously, $READER, look at your cloud computing-dependent startup, and calculate egress costs for your storage, as if you HAD to stop using cloud tomorrow. How much does it cost you? How could you adapt? It's designed to keep you dependent on 3rd parties... Idk, IMO it is really not great.