TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Big data: are we making a big mistake?

199 pointsby pietroabout 11 years ago

25 comments

nemesisjabout 11 years ago
Another conclusion to draw from this article (which I really enjoyed, by the way) is that Big Data has been turned into one of the most abstract buzzwords ever. You thought &quot;cloud&quot; was bad? &quot;Big Data&quot; is far worse in its specificity.<p>I can&#x27;t count the number of times I&#x27;ll be talking to some sales rep and they&#x27;ll describe how they scan the data within whatever application they&#x27;re demoing and &quot;suggest&quot; items using &quot;big data techniques&quot;. In almost all cases they&#x27;re talking about a few thousand or hundred thousand records, tops.<p>I&#x27;ve found that when non-hardcore techies talk about Big Data, what they really mean is &quot;they have some data&quot; vs before, when they had zero data.<p>From the article:<p><i>&quot;Consultants urge the data-naive to wise up to the potential of big data. A recent report from the McKinsey Global Institute reckoned that the US healthcare system could save $300bn a year – $1,000 per American – through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes. </i><p>What these consultants mean is that by having just some data compared to the silo&#x27;d data that is the norm in US healthcare, they could save a lot, and they&#x27;re right. My previous company had a large data set (20+ million patients) and we&#x27;d find millions of dollars of savings opportunities for every hospital we implemented in, but that&#x27;s because we had the data, not because we were running some kind of non-causual correlation analysis like the article references. It was just because we could actually run queries on a data set.<p>-----<p>Off Topic - how annoying is it that when you copy &amp; paste from the FT, they preface your copy with the following text?<p><i>High quality global journalism requires investment. Please share this article with others using the link below, do not cut &amp; paste the article. See our Ts&amp;Cs and Copyright Policy for more detail. Email ftsales.support@ft.com to buy additional rights. <a href="http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#ixzz2xSKoQYaW*" rel="nofollow">http:&#x2F;&#x2F;www.ft.com&#x2F;cms&#x2F;s&#x2F;2&#x2F;21a6e7d8-b479-11e3-a09a-00144feabd...</a>
评论 #7498894 未加载
评论 #7497240 未加载
评论 #7496720 未加载
评论 #7496248 未加载
评论 #7497091 未加载
评论 #7497047 未加载
评论 #7496345 未加载
评论 #7498154 未加载
评论 #7501995 未加载
amirmcabout 11 years ago
<i>&quot;But while big data promise much to scientists, entrepreneurs and governments, they are doomed to disappoint us if we ignore some very familiar statistical lessons.<p>“There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.”&quot;</i><p>This should be the main learning point. Humans can be astonishingly bad at dealing with stats and biases which can led to erroneous decisions being made. If you want an example where such decisions by very smart people can have catastrophic consequences, look up the Challenger disaster [1].<p>I rarely see people stating their assumptions upfront, which doesn&#x27;t help the problem (I guess it&#x27;s not cool to admit potential weaknesses). The more people&#x2F;companies that get into &#x27;big data&#x27; (without adequate training) the more false positives we&#x27;re going to see.<p>[1] <a href="http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Space_Shuttle_Challenger_disast...</a>
sitkackabout 11 years ago
This article reminds me of the argument [0] between Noam Chomsky [1] and Peter Norvig [2]. TL;DR (paraphrased with hyperbole) Chomsky claims the statistical AI of Norvig is a fancy sideshow that doesn&#x27;t understand _why_ it is doing a thing. It just throws gigabytes of data at an ensemble and comes out with an answer.<p>[0] - <a href="http://www.theatlantic.com/technology/archive/2012/11/noam-chomsky-on-where-artificial-intelligence-went-wrong/261637/" rel="nofollow">http:&#x2F;&#x2F;www.theatlantic.com&#x2F;technology&#x2F;archive&#x2F;2012&#x2F;11&#x2F;noam-c...</a><p>[1] - <a href="http://en.wikipedia.org/wiki/Noam_Chomsky" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Noam_Chomsky</a><p>[2] - <a href="http://en.wikipedia.org/wiki/Peter_Norvig" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Peter_Norvig</a><p>----<p>Norvigs rebuttal, <a href="http://norvig.com/chomsky.html" rel="nofollow">http:&#x2F;&#x2F;norvig.com&#x2F;chomsky.html</a>
评论 #7497003 未加载
评论 #7496658 未加载
评论 #7496808 未加载
SixSigmaabout 11 years ago
&gt; a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”<p>I think that&#x27;s indicative of Wired breathless enthusiasm for technology that turned my off buying the print version many years ago.<p>Scrape away some of the hyperbole and it is true that data driven management has made many companies more competitive and, if I dare mention the hobgoblin, efficient.<p>Hunches and ideas can only get you so far. It is important to visit the data gemba and do the genchi genbutsu.<p><a href="http://en.wikipedia.org/wiki/Gemba" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Gemba</a><p><a href="http://en.wikipedia.org/wiki/Gembutsu" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Gembutsu</a>
评论 #7496280 未加载
RA_Fisherabout 11 years ago
I&#x27;m much more impressed when someone can squeeze information out of small data. W.S Gosset was extracting tons of information from as little as two observations. I&#x27;m very grateful that my advisor guided my cohort to work with two-observation MLE in many contexts. This type of practice focuses the analyst on squeezing out as much information as possible. When applied to big data, this approach can be very useful. Big data comes with data wrangling challenges, but if you don&#x27;t carefully squeeze out information, you&#x27;ll be leaving tons and tons on the table.
hawkharrisabout 11 years ago
The misconceptions about big data are similar to those surrounding the word science.<p>Many people associate &quot;science&quot; with things: cells, microscopes, the inner workings of the body. But science isn&#x27;t a set of things; it&#x27;s a process, a method of thinking, that can be applied to any facet of life.<p>Big data is similar, in my opinion. It&#x27;s not so much about the stuff —  the size or diversity of a company&#x27;s datasets. It has more to do with the types of observations you&#x27;re making and the statistical methods involved.<p>This distinction is important for two reasons:<p>1. If Big Data is recognized as a process rather than a circumstance, businesses will be more deliberate in deciding whether to use the methods. They will weigh the benefits of, say, MapReduce against other approaches.<p>2. The idea that &quot;Big Data&quot; techniques have everything to do with size is somewhat misleading. A comprehensive query of a 50,000 user dataset can be more computationally expensive than a simple operation on a 100,000-record dataset.
评论 #7496588 未加载
评论 #7496852 未加载
nobbyclarkabout 11 years ago
I get the impression from looking at local &quot;big data&quot; events that the enterprise software crowd has tuned into big data.<p>I fear that now that SOAP and enterprise buses have gone their way, they look a new buzzword to sell. More solutions looking for problems...
hibikirabout 11 years ago
I find it amusing that the article talks about big mistakes in polling data, when the clear winner of the last two US elections is one Nate Silver, who aggregated polls to get predictions so close to the actual results, one wonders why people actually vote anymore.<p>Now, just like with every other technological solution, we only learn about the limits of its use by overuse. There&#x27;s plenty of people out there storing large amounts of data and getting no valuable conclusions out of it. But the fact that many people will fail doesn&#x27;t mean the concept is not worth pursuing.<p>Chasing what is cool is a pretty dangerous impulse. The trick is to be able to tell when it can pay off, and to quickly learn when it will not, and cut your losses. Maybe you don&#x27;t need big data, just like maybe your shiny cutting edge library might not be ready for production.
评论 #7496320 未加载
评论 #7496219 未加载
emiliobumacharabout 11 years ago
Great article. I think the brightest gem here is the Multiple comparisons problem:<p><a href="http://en.wikipedia.org/wiki/Multiple_comparisons" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Multiple_comparisons</a>
评论 #7496168 未加载
评论 #7496214 未加载
akadienabout 11 years ago
This is my favorite line and the one that damns so many &quot;big data&quot; efforts:<p>&quot;They cared about ­correlation rather than causation.&quot;<p>Analytics are a tool to help find correlations and patterns so that humans can do the hard work of determining and testing for causation. Computers are doing their jobs; humans aren&#x27;t.
dj-wonkabout 11 years ago
The “with enough data, the numbers speak for themselves” statement has several meanings.<p>In one sense, if you can observe real phenomena, you don&#x27;t have to guess at what is happening. For businesses that collect troves of it, they may need statistics &#x27;less&#x27; because the sample size may approach the population size.<p>But calculating basic (mean, standard deviation, etc.) statistics is hardly the most interesting part. Inferential statistics is often more useful: how does one variable affect another?<p>As the article points out, the &quot;... the numbers speak for themselves” statement may also be interpreted as &quot;traditional statistical methods (which you might call theory-driven) are less important as you get more data&quot;. I don&#x27;t want to wade in the theory-driven vs. exploratory argument, because I think they both have their places. Both are important, and anyone who says that only one is important is half blind.<p>Here is my main point: data -- in the senses that many people care about; e.g. prediction, intuition, or causation -- does <i></i>not<i></i> speak for itself. The difficult task of thinking and reasoning about data is, by definition, driven by both the data and the reasoning. So I&#x27;m a big proponent of (1) making your model clear and (2) sharing your model along with your interpretations. (This is analogous to sharing your logic when you make a conclusion; hardly a controversial claim.)
stillsutabout 11 years ago
What executives say it does...<p>&quot;Facebook’s mission is to give people the power to share and make the world more open and connected.&quot;<p>What it actually does... (that will be left to the reader.)<p>&quot;Big Data&quot; is often sold as one thing by Enterprise software folks. But what value the data, or processing of it actually has is usually much more dependent on the user and his context (like FB!) and usually doesn&#x27;t fit as nicely onto a PPT slide.<p>Articles like this usually confuse the PR definition and the analyst definition.
MCarusiabout 11 years ago
A few other comments have raised this point, but Big Data is basically the new Web 2.0. Aside from being a buzzword, as a term it&#x27;s so nebulous that half of the articles about it don&#x27;t really define what it is. When does &quot;data&quot; become &quot;big data&quot;?
sam_sachabout 11 years ago
Conclusion: &quot;Big Data&quot; is a stupid buzzword and it makes me cringe every time I&#x27;m forced to say it to sell some new solution or frame something in a way someone who barely knows anything about computer science can understand.<p>It&#x27;s nebulous. I&#x27;ve seen it applied to machine learning, data management, data transfer, etc. These are all things that existed long before the term, but bloggers just won&#x27;t STFU about it. Businesses, systems, etc. generate data. If you don&#x27;t analyze that data to test your hypotheses and theories, at the end of the day, you don&#x27;t understand your own business and are relying on intuition for decision making.
bsbechtelabout 11 years ago
There is definitely value to big data, but isn&#x27;t it also a form of legitimizing stereotypes, at least in some cases? I mean, the general premise of big data, is to glean conclusions and new knowledge of the world from billions of records. When humans are the source of the data that is being extracted and analyzed, are the conclusions not stereotypes of those individuals, unless the correlation is 100%? This might be ok, and even useful, when trying to optimize clicks on ads, but what about when the government uses it to make policy decisions?
SworDsyabout 11 years ago
if i work for facebook and i want to figure out something about my users, isn&#x27;t it safe to say N = All since the data im accessing is all user data from fb? it&#x27;s easy to go wrong with big data, and although the article glossed over some fairly important things (assuming the people who work on these datasets are much dumber than they are in reality), they&#x27;re right on about idea that the scope and scale of what big data promises may be too grandiose for it&#x27;s capabilities
评论 #7496283 未加载
评论 #7496399 未加载
评论 #7496385 未加载
linuxhanslabout 11 years ago
Any either-or discussion is doomed to fail. Saying that BigData is the end of theory is clearly nonsense.<p>BigData vs. Theory, Java vs. C++, Capitalism vs. Socialism, Industry vs. Nature, Good vs. Bad, etc.<p>BigData allows to store a lot of data and provides a means run some computation on that data. Not more, and not less.
pellaabout 11 years ago
<i>&quot;Big data can tell us what’s wrong, not what’s right.&quot;</i><p>from: <a href="http://www.wired.com/2013/02/big-data-means-big-errors-people/" rel="nofollow">http:&#x2F;&#x2F;www.wired.com&#x2F;2013&#x2F;02&#x2F;big-data-means-big-errors-peopl...</a>
Sami_Lehtinenabout 11 years ago
I think this site is really related to this topic, even if it doesn&#x27;t involve term &#x27;big data&#x27;. <a href="http://www.statisticsdonewrong.com/" rel="nofollow">http:&#x2F;&#x2F;www.statisticsdonewrong.com&#x2F;</a>
dreamfactory2about 11 years ago
Reminds me of chartism vs <a href="http://en.wikipedia.org/wiki/Efficient-market_hypothesis" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Efficient-market_hypothesis</a>
ddmmaabout 11 years ago
I like to think that every object or living being in this world has properties and methods as in programming ... This the source of data, small or big depending of actions or complexity
instaheatabout 11 years ago
Well I was considering a career as a Data Scientist having a strong interest in this sort of thing and as a poker player.<p>This just kills my vibe, man.
wglbabout 11 years ago
Excellent article.<p>New favorite phrases &quot;data exhaust&quot; and &quot;digital exhaust&quot;.
评论 #7497385 未加载
kushtiabout 11 years ago
We&#x27;re making a big mistake with an every big thing, that&#x27;s the way we handle buzzwords.
Houshalterabout 11 years ago
Nonsense. Google Flu was not &quot;Big&quot; data, they had only a few years worth of data at best. Additionally, when combined with current CDC data, it&#x27;s predictions were better than models based on CDC data alone. And in all likelihood they can improve it with better methods.