Command-line tools can be faster than a Hadoop cluster (2014)

463 pointsby matthbergover 5 years ago

25 comments

fxtentacleover 5 years ago

With all the Hipster tech being released recently, the headline statement holds true for a lot of things, unfortunately.We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...EDIT: To give everyone a sense of scale, those $200 each bare metal servers are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm. We retain logs for 30 days and handle 4-5TB per day.

评论 #22189533 未加载

评论 #22189412 未加载

评论 #22189375 未加载

评论 #22189272 未加载

评论 #22190964 未加载

评论 #22190941 未加载

评论 #22189313 未加载

评论 #22191059 未加载

评论 #22190987 未加载

评论 #22189343 未加载

评论 #22190816 未加载

评论 #22194358 未加载

评论 #22191655 未加载

评论 #22191246 未加载

评论 #22190911 未加载

评论 #22189232 未加载

评论 #22191898 未加载

评论 #22189454 未加载

评论 #22189468 未加载

评论 #22190024 未加载

评论 #22189457 未加载

评论 #22192303 未加载

评论 #22189304 未加载

评论 #22190278 未加载

评论 #22191435 未加载

评论 #22189569 未加载

maxmunzelover 5 years ago

I did some testing on the same (kind of) dataset and task:First test: A single 2.9GB filetime rg Result all.pgn | sort --radixsort | uniq -c 13 [Result ""] 1106547 [Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result all.pgn 1.12s user 0.55s system 99% cpu 1.680 total sort --radixsort 3.87s user 0.37s system 71% cpu 5.911 total uniq -c 2.69s user 0.02s system 45% cpu 5.909 totalUsing Apache Flink and a naive implementation It took 13.969 seconds.Second test: same dataset, split between 4 filestime rg Result chessdata/ | awk -F ':' '{print $2}' - | sort --radixsort | uniq -c 13 [Result ""] 1106547 [Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result chessdata/ 1.70s user 0.97s system 42% cpu 6.292 total awk -F ':' '{print $2}' - 5.47s user 0.07s system 88% cpu 6.289 total sort --radixsort 4.13s user 0.42s system 43% cpu 10.559 total uniq -c 2.73s user 0.03s system 26% cpu 10.559 totalFlink: 12.724sConclusion: For this kind of workload, both approaches have comparable runtimes, even tough taco bell programming has the upper hand (as is should for simply filtering a text file). It took me about equally long to implement both. I think both approaches have their use case.I ran this locally on my Laptop with 4 logical cores.

评论 #22191444 未加载

dangover 5 years ago

A thread from 2018: <a href="https://news.ycombinator.com/item?id=17135841" rel="nofollow">https://news.ycombinator.com/item?id=17135841</a>2016: <a href="https://news.ycombinator.com/item?id=12472905" rel="nofollow">https://news.ycombinator.com/item?id=12472905</a>2015: <a href="https://news.ycombinator.com/item?id=8908462" rel="nofollow">https://news.ycombinator.com/item?id=8908462</a>

评论 #22189296 未加载

beagle3over 5 years ago

A classic from 2015 along the same lines: Scalability, but at what COST?<a href="http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html" rel="nofollow">http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...</a>

thecleanerover 5 years ago

The author is experimenting with 1.75Gigs of data. At that scale sure, a local machine will be faster. Hadoop's real use-case though is when your data doesn't fit in memory and even this is kind of debatable. It makes sense to measure the performance with some prototypes and then make a final design rather than just use whatever AWS offers. Besides packaged services in AWS are also a bit more costly than basic services like EC2 instances and network goodies.

评论 #22189761 未加载

评论 #22191531 未加载

评论 #22192957 未加载

krabover 5 years ago

This reminds me my experience from a company internal hackathon. My colleague started writing a Spark program that would process the data we needed (a few hundreds GB uncompressed). Before he finished writing it, I was able to process all the data on a single machine with a unix pipeline. The computationally intensive steps were basically just grep, sort and uniq. When he finished the program, it couldn't run because of some operational issues on the cluster at the moment, so we didn't even find out the speed to compare.For me, the morale is that the cheap hardware saves money/time twice:1. It's faster if a program can run on a single machine.2. It's easier to write a program that runs on a single machine.With this in mind, cloud works great for analytical data processing. Just start a big enough machine, download data, do the computation, upload the result and turn the machine off. If you develop the program on a sample of the data so you can do it locally, it will be even cheap because you use only short time of the powerful server.

评论 #22190823 未加载

dcolkittover 5 years ago

The two approaches aren't necessarily mutually exclusive. Spark can easily shell out using pipe(). Plus you can use that to compose and schedule arbitrarily large data sets through your bash pipeline through a multi-node cluster.Beyond that, while the Unix tools are amazing for per-line FIFO-based processing, they really don't do a great job at anything requiring any sort of relational algebra.

评论 #22191258 未加载

bandramiover 5 years ago

Wait... are you telling me people over-engineer solutions to ultimately simple problems? You're kidding.

supermattover 5 years ago

very simple processing, not memory bound, tiny data-set - of course its going to be faster locally when the command itself takes less time than the networking, distribution, coordination and collation overhead of using any distributed tool...

评论 #22189293 未加载

评论 #22189281 未加载

评论 #22189258 未加载

toolsliveover 5 years ago

once you get to the stage where your laptop is just not enough anymore (or your laptop has some cores you want to add to the processing as well), gnu parallel might be of use.<a href="https://www.gnu.org/software/parallel/" rel="nofollow">https://www.gnu.org/software/parallel/</a>

sm4rk0over 5 years ago

Is there any benefit of<pre><code> cat files* | grep pattern </code></pre> over this<pre><code> grep -h pattern files* </code></pre> aside from result color highlighting?

评论 #22204633 未加载

评论 #22193945 未加载

评论 #22191362 未加载

评论 #22192983 未加载

reagent_finderover 5 years ago

When all you have is a hammer, every problem starts looking like a nail.The basic premise is fine: If you have a simple problem, using simple tools will give you a good result. Here you have text files, you just want to iterate through them and find a result from ONE line that's the same in every file, collate the results. No further analysis required.Every problem in the world can be solved by a bash one-liner, right!?There's an interesting dichotomy with bash scripts: One school says any bash script over 100 lines should be rewritten in Python, because it's overcomplex already. Another school says any Python script used daily over 100 lines should be rewritten in bash so there are no delusions about it being easy to maintain.The original article is from 2013, and doesn't try to do any optimization (I guess, the original article is unavailable at the time of writing of this comment), so it would be an interesting question to see what you could do at the Hadoop end to make the query faster. I would imagine quite a lot.

commandlinefanover 5 years ago

If you can fit your data on a single disk drive, you don't need Hadoop.

StreamBrightover 5 years ago

The bottom 90% of data users are in the gigabytes range. Anything works.

评论 #22191686 未加载

JimmyRuskaover 5 years ago

We had a poorly performing service which reads from a number of rest endpoints and writes to s3 in date prefixed format. Offshore wrote 3,600 lines of codes targeting kinesis firehose. By just piping the url endpoints to a named pipe and cycling the s3 file in python, my code was 55 lines of code and did the same thing without kinesis. Wrapping things in GNU parallel and using bash flags, it handles any failure cases super gracefully, which is something the offshore code did not do. The India offshore code had a global exception catch-all, and would print the error and return exit success return code... but I guess someone got to put Kinesis on their resume.

tomerbdover 5 years ago

I maintain here a very small command line cheatsheet that I get back for reference for mostly data analysis tasks <a href="https://tinyurl.com/tomercli" rel="nofollow">https://tinyurl.com/tomercli</a>

评论 #22189629 未加载

m0zgover 5 years ago

Been saying that for years. Also, get this, 99.999% of companies do not need "big data" or distributed systems of any kind. I feel like the old "cheap commodity hardware" pendulum swung way too far. More expensive, less "commodity" hardware can often be cheaper, if correctly deployed. I.e. you don't need a distributed database if your database is below 1TB and QPS is reasonable (and what's "reasonable" can surprise you today with large NVME SSDs, hundreds of gigabytes of RAM, and 64-core machines being affordable).

jonstewartover 5 years ago

This was a straw man article in 2014, it was a straw man article the other times it’s been posted to HN in the intervening years, and it’s still a straw man article in 2020. As noted in another comment here, the contemporary technology of Apache Flink really isn’t far off command-line tools running on a single machine. Meanwhile, HDFS has made a lot of progress on its overhead, particularly unnecessary buffer copies. There are datasets where a Hadoop approach makes sense. But not for ones where the data fits in RAM on a single system. No one has ever argued that.

jsjohnstover 5 years ago

While I personally would use a similar pipeline as OP for such a small data set, saying Hadoop would take 50min for this is just flat wrong. It shows a clear lack of understanding of how to use Hadoop.

评论 #22205027 未加载

openstepover 5 years ago

Amen. You can do a lot with pipes, various utils (sed, awk, grep, gnu parallel, etc.), sockets, so on and so forth. I see folks abuse Hadoop way too often for simple jobs.

评论 #22190999 未加载

pts_over 5 years ago

That's because Hadoop is a big favorite of wining and dining suits, who scram at the sight of the command line.

barrkelover 5 years ago

If you're disappointed with the speed and complexity of your Hadoop cluster, and especially if you're trying to crack a bit, you should give ClickHouse a spin.

评论 #22190174 未加载

sandGorgonover 5 years ago

if you're doing spark or hadoop today and are a python shop...you should definitely look at Dask <a href="https://dask.org/" rel="nofollow">https://dask.org/</a>works as good as spark. very lightweight. works through docker.Ground up integrated with kubernetes (runs on EKS/GKE,etc).and no serialization betweek java/python, fatjar stuff, etc

philshemover 5 years ago

command line tools like grep,awk,sed,etc are great for structured and line-based files like logs. For json documents I can add a recommendation for jq:<a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>

arthurcolleover 5 years ago

Cloud computing is kind of a joke. Yeah keep paying someone for shared "virtual computers", that sounds suspiciously similar to shared hosting from a decade or 2 ago... Oh but this is different, you get isolation from containers/VMs! Yeah ok, meanwhile new exploits emerge every couple weeks. It's like tech debt ideology on steroids... just keep pumping out instances until the company either goes hyperbolic, or goes bankrupt. Realistically, just buy a few physical servers and actually work to build efficiency into the system instead of just throwing compute at your public-facing web app.I recently bought a DELL r710 just for fun and was pleasantly surprised how even days after spinning up a bunch of VMs, I don't have a 30gb logfile for all the failed attempts at getting into my instance (this was my experience recently with 2 cloud providers!)It'll be interesting to see how the mkt reacts when you have a "first-of-its-kind" massive, massive security breach that affect popular "pure play" internet companies hosted on top of the mythical "cloud."Seriously, $READER, look at your cloud computing-dependent startup, and calculate egress costs for your storage, as if you HAD to stop using cloud tomorrow. How much does it cost you? How could you adapt? It's designed to keep you dependent on 3rd parties... Idk, IMO it is really not great.

评论 #22189563 未加载

评论 #22190757 未加载

评论 #22189554 未加载