As some one who is currently dealing with these sort of things I can tell this article hits the nail on its head.<p>Most, heck something like 99.99% of all so-called big data I've dealt is something I wouldn't even classify as small data. I've seen data feeds in KB's sent over to be handled in as big data. It happens all the time. A simple data problem sufficient enough to be easily solved on something like a small db solution like sqlite is generally taken to 'grid' these days. It reminds me of the XML days when everything had to be XML. I mean every damn thing, these days its NoSQL and Big data.<p>People wrongly do their schema design just so that it can get into a NoSQL, then use something like Pig to generate data for it. The net result is they end up badly reinventing parts of SQL all over the place. If only they understand a little SQL and why its exists they can save themselves all that pointless complexity they get into. Besides avoiding to use SQL where its appropriate creates all sorts of data problems in your system. You will go endlessly reinventing ways doing things similar to what SQL offers while bloating your code. You will go reading a big part of the code, only to figure out the person actually intended to something like a nested select query albeit here very badly.<p>Besides I find much of this big data thing a total sham. Back in the yester years we would write Perl scripts to do all sorts complex data processing(With SQL of course). Heck I've run some very big analytic systems, and automation set ups in Perl to do far difficult things people do using 'Big data tools' today.<p>In larger corporation this has become fashion now. If you want to be known as a great 'architect' all you need to do is bring in these pointless complexities. Ensure the set up becomes so complicated it can't explained without the help of a hundred jargons totally incomprehensible to anybody beyond your cubicle. That is how you get promoted to become a architect these days.
From the Berkeley paper on Facebook:<p><i>Nonetheless, large jobs are important too. Over 80% of the IO and over 90% of cluster cycles are consumed by
less than 10% of the largest jobs (7% of the largest jobs in the Facebook cluster). These large jobs, in the clusters we considered, are typically revenue-generating critical production jobs feeding front-end applications.</i><p>So MR job characteristics might follow a power law distribution, and @mims is focusing on one end of the tail. Sure, that's cool!<p>But then @mims also selectively quotes the TC article, which ends with an excellent point that contradicts his thesis:<p><i>The big data fallacy may sound disappointing, but it is actually a strong argument for why we need even bigger data. Because the amount of valuable insights we can derive from big data is so very tiny, we need to collect even more data and use more powerful analytics to increase our chance of finding them.</i><p>I think @mims over-pursues the stupid Forbes/BI straw man here. As one would expect with data, the story is complicated. Mom and pop stores don't need to worry about Cloudera's latest offering, but companies working on the cutting edge of analysis still absolutely need tools like Hadoop, Impala, and Redshift.
I've maintained for awhile now that the distinction isn't between "big" and "small" data, but between coarse and fine data. Now that everything is done through the web, previously common data sources (surveys, sales summaries, etc) are being supplanted by microdata (web logs, click logs, etc). It does take a different skill set to analyze noisy, machine-generated data than to analyze clean, survey-like data; it's a skill set that is more biased towards computational knowledge than classical experimental design, hence the shift in emphasis.
Of course one could see it as "IT's revenge" after Scott McNealy so famously said it was dead. There is a lot of power to be had by creating an interface for the customer and then keeping everything behind that interface 'obscure'. They have to have that interface to survive, and if they don't know what goes on behind it they have no way of discerning outrageous costs from reasonable ones. The current exemplar seems to be medical costs.<p>Back in the 60's there was this chamber of secrets called "the Machine Room" which had the "Mainframe" and various and sundry high priests who went in and out, and if you literally played your cards, as in punched cards, right you could get a report on how sales or manufacturing was doing this month.<p>That got lost when everyone had a PC on their desk, and now some folks are trying to reclaim it :-)<p>That said the article is still poorly argued. The cost of data management <i>is</i> fairly high. And generally a big chunk of that cost is the cost of specialists who provide business 'continuance' which is code for "makes sure that you can always get your data when you need it, and you can get the answers you need from it in a timely and repeatable fashion." That hasn't changed at all, and whether you have some youngster doing "IT" on the creaky Windows 2000 machine running Back Office or you are using a SaaS company like Salesforce.com, data management is and will continue to be a mission critical part of staying in business.
If I ever want to get rich, I'll set up shop convincing small businesses they need to do things the way Google does, if only they want to remain competitive.<p>Oracle has used exactly this business model to great success, and obscene profit, for over 30 years.
There's an important distinction to be made between the storage layer and the analysis layer. Something like HDFS can make sense as a storage layer once you hit the > 10TB range even if your average dataset for analysis is reasonably small (and it should be; 99% of the time you can get by with sampling down to single-machine size). That doesn't mean you need to be setting up all your analysis jobs to run via map-reduce; you can usually dump the dataset to a dedicated machine and do it all in one go with sequential algorithms. As a side benefit, you have access to algorithms that are really difficult to express efficiently as map-reduce (eg, computations over ordered time series).
I think that big data has made math sexy, and selling applied statistics and operations research to small and medium-sized businesses under the guise of "big data" with the intention of providing applied mathematical tools is what is happening in the market.
I am grateful to finally see this in an article. The "big data" craze is being pushed in areas where it really doesn't make sense. We've been bit by the Big Data bug where I'm at, but it's not coming from the statisticians. It's usually the executives proposing a shift to big data.<p>People underestimate how much work it would be to shift an old server onto modern technologies and tell the statisticians to use MapReduce and NoSQL instead of SAS and SQL. If the Fortune 500 world has taken this long to catch on to R, imagine how long it'll take to completely change the DBMS and analysis software!
Sure if you're dealing with 1GB of data it probably isn't worth spinning up a Hadoop cluster to run your analysis. However, if you already have Hadoop up an running for something that genuinely requires it, that 1GB job might make sense there. The data may already be in HDFS, and you already have the infrastructure there to manage and monitor jobs.<p>The references to Facebook & Yahoo running small jobs on huge clusters may be a little misleading. It may be simply the easiest place for them to deploy those jobs consistently.<p>But yeah... "Big Data" is a total meaningless buzzard.
For most data, it is in fact a waste of money.<p>Personally, I am loading the data I play with on a postgreSQL database on my laptop (if you have a mac and want to do that quickly, you may want to check out the link I just submitted <a href="http://en.blog.guylhem.net/post/50310070182/running-postgresql-on-mac-osx-mountain-lion-in-2" rel="nofollow">http://en.blog.guylhem.net/post/50310070182/running-postgres...</a> )<p>You can do crazy things with the current hardware specs. Like loading all the data the world bank offers you to download, index it and use it for regressions (I do). In 2013 you only need a laptop for that.<p>Most data is not big. Big data is "big" like in a gold rush, where the ones selling the tools are making the biggest profits.<p>EDIT: Thanks for the postgresapp.com link! It is a little bit diffent- here I wanted to use the very same sources as Apple, without adding too much cruft (like a UI to start/stop the daemon as I had seen in other packages). I also wanted to see by myself how hard it was to 'make it work' with OSX (quite easy besides the missing AEP.make and the logfile error). It was basically an experiment in recompiling from the sources given by apple opensource website, while staying as close to the OSX spirit as possible (ex: keeping the same user group, using dscl, using launchdaemon to start the daemon automatically during the boot sequence like for Apache)<p>That being said, you're right, for most people postgresapp.com will be a simpler and faster way to run a postgresql server :-)
As also some one that has been in the thick of some of the "big data" projects in the industry recently, I have to agree with the article.<p>One of the terms I learnt in the PyData Silicon Valley in March is "Medium Data". Unless you are dealing with terabytes of RAM and Exa bytes of storage, google style, the overhead of having to maintain a cluster is something most (intelligent) people try to avoid.<p>When you cant avoid hundreds of machines, the cluster is a necessity and you design that way. But given where the Moore's law curve stands today, most organisations really dont need that.<p>You can buy servers on Amazon with 250 gigs of RAM for a few dollars an hour. They specifically call it the big data cluster. It is possible to analyse the data using tools like Pandas/Matplotlib and others in the Scientific Python eco system fairly easily.<p>These tools are being used by scientists and industry for a really long time, except they aren't really advertised that way.<p>For instance, here is some analysis I was doing recently of the children names in the US, from 1880, with 3 million records: <a href="http://nbviewer.ipython.org/53ec0c5a2fabcfebb358" rel="nofollow">http://nbviewer.ipython.org/53ec0c5a2fabcfebb358</a>. My Mac could handle it without even breaking a sweat.
Even if the data isn't big, there can be a benefit from the Hadoop infrastructure. Say you have just 86,400 rows of data but each row takes 1 second. That adds up to 24 hours of elapsed time, and waiting for that run can be painful, especially if you are trying to experiment, iterate. With HDFS/MapReduce you can distribute that work across N machines and divide the elapsed time by N, speeding up the pace of iteration. I've worked on a project that had exactly this challenge, before Hadoop was available, and so we had to invent our own crappy ways of distributing the data to the N machines, monitoring them, collecting the results. Hadoop HDFS and Map/Reduce, with Job Tracker, etc, would have been much better than what we came up with.
I've seen a fair number of startups that throw around how they are going to make big money by utilizing the data they gather (called "big data" regardless of size) - it's all a bit of magical underpants thinking: we'll gather a bunch of people/users, we can't figure out how to make money off of advertising or charging them, so then we'll talk about how the "big data" they produce will be worth a fortune and people will pay to have access to it. Know some folks in the HR SaaS space that think this is how they'll hit $100m. It's just comedy.
"We don't have big data. Our data is small, and could be easily stored in a MySQL or even a flat file" said no dev team ever. Everyone is "just like Google" so they need NoSQL, scaling, clouds and so on.
As someone who runs jobs on giant clusters day in day out, I just looked at my last job. It indeed had input data of ~100GB. However size of input data is misleading. Job does a lot of processing and generates ~5TB of intermediate data and it took 800+ machine hours to complete. If I'd ran that on my desktop I would be waiting for a month to finish. On cluster it took ~4 hours.<p>I'd to smile at the statement "Is more data always better? Hardly". There is old saying the world of data scientists: There is no data like more data. Yes, the value of it may be diminishing but when your competitor is trying to squeeze out gain in second decimal, you are probably better off accepting more data.<p>So the moral of the story is, all these really depends. People do get fired for buying clusters. Modern cluster management software track several utilization metrics and someone some day would going to look at it and point out how bad decision it was.
The reason the "big data" pimps can get away with this is that most of the people that should know, (that aren't DB programmers, DBAs, true scientists or engineers in the domain), don't know shit about data and generally too fscking lazy to learn. So they buy into the latest wave of buzz words and hype.
It's precious to read through almost every post in this thread complaining about 'big data' and saying that everyone can just use a normal relational database or whatever. But 'big data' has brought markets to exploit to feed HN-type entrepreneurs, and jobs and loads of prestige for HN-type engineers - who I have never noticed to be shy about bragging on how much data is in their systems without regard to whether that data is particularly meaningful.
I'm getting to the age where things start coming back under new branding. I remember in my childhood when my father would talk about bell-bottoms and how trendy they were once. They they came back and he was shocked.<p>I remember Doc Martens. They're back.
I remember gumby haircut. Its back.
I remember ripped jean...also back.<p>Technology follows this cyclical trend as well, we just give it fancy names like Big Data, Cloud and Anything-as-a-Service.
> Most data isn’t “big,” and businesses are wasting money pretending it is<p>Most business leaders are not rational, and we should stop pretending they are.
Related discussion from three weeks ago: <a href="https://news.ycombinator.com/item?id=5602727" rel="nofollow">https://news.ycombinator.com/item?id=5602727</a>
In an effort to <i>Store All the Things</i> a lot of companies have talked themselves into a rhetorical corner of poorly fitting shoes. At this point, we've stopped wasting time when asked about Big Table and NoSQL and instead demo their storage stack on a different framework until their eyes widen.<p>Then, when questions about how much engineering went into this "thing" that does such a good job of keeping data secure, and so much of it, we say it's built on Postgres.<p>As DevOps Borat says :
<a href="https://twitter.com/DEVOPS_BORAT/status/313322958997295104" rel="nofollow">https://twitter.com/DEVOPS_BORAT/status/313322958997295104</a><p>How well you can utilize it and how quickly is just as important as what kind of data you store in the first place.
Be wary about drawing conclusions from "most of the jobs were small." Most of my jobs are small -- because I'm running experiments so I won't have to redo the big one.<p>That said, I'm a huge proponent of running stuff simply at first. Few businesses will ever grow to the point that they need more than a single large database server and one or two backups. Don't waste your time prepping for something you'll probably never need, especially when fixing the problem when the time comes is only marginally more painful than doing it right in the first place.
For me, "big" data is increasing the linkage between your data. It's not simply more data, but much richer, less formal data relationships. It's taking your sales data and linking it to your website clicks, linking that to the weather (or whatever). Or you take something traditionally static and add a temporal dimension.<p>This kind of deep linking you can't measure with straight megabytes. A few gig doesn't seem that large, but if it's a complex graph with a complex hypothesis - then, sure - that's big.
Most companies today are already using scaled up servers to host their medium size warehouses (think Teradata or Exadata). That approach is very expensive (> millions of dollars), only works well with well-defined data, and does not scale well beyond a few TBs.<p>Hadoop is not just about running large jobs on very large data. Hadoop also makes sense when trying to scale on commodity hardware or running ad hoc queries (which can target a small amount of data) on medium to large data sets.
Oracle writes shitty 'enterprise apps' (god I hate that phrase) that they sell to big companies, because their salesmen/women wear great attire and are good at mirroring dumb ceos/cios, like the ones that run several companies I have worked for. Will someone please end this nonsense? At what point does usability/stability/utility become factors?
There are two ways to define BigData.<p>1. The accumulation, integration and analysis of a larger number of data sources.<p>2. A volume of data that presents challenges running analysis functions across them... Due to the limits of the tools available.<p>1 is fraught with the kind of statistical pitfalls that are mentioned in the posted article. 2 describes a set of problems and boundaries that are time sensitive. What was BigData in 2006 (to, say LiveJournal or Digg) may not longer hold. As a data engineer, its important to keep a skeptical eye on marketing and make sure we're delivering valuable solutions that increase the bottom line for our business, not just produce "ain't it cool" type correlations.
Extrapolating relatively few truly random data points from massive datasets, for analysis and modeling, is what "Big Data" is all about. This article would have you think that working with clusters or snippets of impossibly ginormous datasets is somehow less "Big", but that's sorta the point. Perhaps somehow should inform the author that the more data available doesn't translate into working with more data.
> The “bigger” your data, the more false positives will turn up in it, when you’re looking for correlations<p>I think they are talking about the Sharpshooter Fallacy<p><a href="http://en.wikipedia.org/wiki/Texas_sharpshooter_fallacy" rel="nofollow">http://en.wikipedia.org/wiki/Texas_sharpshooter_fallacy</a>
Said this before, still want to see it fixed, as I can't stand to read the page with the huge grey box on the right side that disables scrolling without being moused over content. Last time though, I didn't have my environment info.<p>Windows 7 Ultimate SP1
Chrome Version 26.0.1410.64 m
I've been thinking the same thing as the premise of the article for a while now. More often, I think people just write horrible code / poorly designed systems that perform sluggishly, and underwhelm...and then someone queues mr. big data as the silver bullet.
Clarifying the scope of a project or data collection & analysis effort is paramount. You never want to attempt to boil the ocean. The key is to figure out the data the matters most per your company or organization's strategy.
Does anyone know how much indexed data Google has for their search? (Not the size of the database) I'd bet it wont be over a few hundred TB - Something that can fit on most desks in the not too distant future.
This article is the equivalent of "horse drawn carriages are perfectly adequate for most journeys, and much more pleasant and commodious to boot." Good luck with that, buddy.<p>You're not going to know what correlations are important and which are not until you study the data. Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.<p>It's also more than a little insulting to FB and Yahoo to insist they are not web scale. The problem of small jobs on MR clusters is real, but even with small jobs, Hadoop turns out to be a lot more cost-effective than various other proprietary solutions which are your only real enterprise alternative.
The problem of small MR jobs is being solved by things like Cloudera Impala, which can run on top of raw HDFS to perform interactive queries.
Yeah. I'm a scientist that deals with huge datasets. <i>Huge</i>. I must admit that I do cringe a little every time I see the words 'big data'.<p>Disclaimer: I haven't read the post. Only the title.
Sometimes though, you really do have lots of data and need appropriate solutions. At Quantcast, our cluster processes petabytes per day and our edge datacenters handle hundreds of thousands of transactions per second. In fact we recently open sourced our file system (QFS[1]) built on top of HDFS, which can up to double FS capacity on the same hardware. Although it's certainly true that not every company (or even not most) needs all that horsepower, there are definitely some for whom it's the core of their business.<p>[1]. <a href="http://quantcast.github.io/qfs/" rel="nofollow">http://quantcast.github.io/qfs/</a>
What is the author of this article trying to say here?<p>><i>it appears that for both Facebook and Yahoo, those same clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range (pdf), which means they could easily be handled on a single computer—even a laptop.</i><p>That facebook or yahoo could be run from a laptop?