There's a lot of buzz around Hadoop and Big Data, but I really wonder how big is Big Data? How do big data startups measure this?<p>In terms of total storage, is it a few Terabytes for most enterprises?
I've had a similar question before. I have heard of as few as 60,000 observations was considered "big data" [1], yet at my company, we generate about 60 million pharmacy claims every 3 months, and no one here calls it big data. In terms of storage, it's on the order of a couple hundred terabytes for all our data. This is considered small enough that we can query it with traditional SQL.<p>Big Data, the experts say, is more about the novel way you analyze data, relative to the difficulty of the problem. A speaker at PyCon who was talking about algorithms and data structures for handling genetic data had a term that I like quite a bit better: "Data of Unusual Size" (C. Titus Brown at MSU was the speaker).<p>"Big Data" is a really big buzzword right now, but the term is overused and often does not convey the meaning it's supposed to. "Big" is a relative term. The novelty of how much data is being used as opposed to how much used to be used (in the sumo case, they had never handled so much data before) is what makes it big.<p>As for Hadoop, you'd want to use it when it's no longer feasible to keep your data stored in an RDBMS, or when speed becomes an issue, or when you want your schema to be more flexible than an RDBMS. If you are not concerned with the reliability of your data (RDBMS make the safety of the data a paramount priority--read the wikipedia page on ACID to see these guarantees), there's plenty of reasons for choosing Hadoop.<p>[1]: <a href="http://www.wired.com/wiredenterprise/2013/03/big-data/" rel="nofollow">http://www.wired.com/wiredenterprise/2013/03/big-data/</a>
[2]: <a href="http://en.wikipedia.org/wiki/ACID" rel="nofollow">http://en.wikipedia.org/wiki/ACID</a>
[3]: <a href="http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data-science/" rel="nofollow">http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data...</a>
<a href="http://www.youtube.com/watch?v=B27SpLOOhWw" rel="nofollow">http://www.youtube.com/watch?v=B27SpLOOhWw</a><p>Look at the above it will give you a good idea. I'm not going to go into examples as the above video really give some good examples. A Terabytes of data is nothing these days, I see Terabytes databases within companies on a regular bases within my job. The size does not mater, but the three big components of "Big Data" are: multiple sources of information in multiple formats, volume of data and rate of ingest/rate of new incoming data. The basic idea is to be able to process all of the incoming data within your company and get some kind of intelligent information out of all this data that you can use.
In my experience, and I have worked in this field since 2001, "Big Data" is the the answer to the problem of poorly implemented reporting models. The excuse for bad performance has always been "the size of the dataset" not ultimately the real culprit: lack of technical knowledge on how to build an appropriate and performant model. Size of the data set is almost irrelevant, when it's done right, but when it's done wrong, even a small ( 100 000 records ) dataset can be sold as "the problem".<p>Business people love to chase a holy grail, this being yet another one.
<i>"If a program manipulates a large amount of data, it does so in a small number of ways."</i> Alan Perlis<p>"Big Data" has an operational definition. It's relative to current technology. Less than two decades ago a terrabyte was big enough that Microsoft created TerraServer as a technology demo [<a href="http://en.wikipedia.org/wiki/TerraServer-USA" rel="nofollow">http://en.wikipedia.org/wiki/TerraServer-USA</a>]. Terraserver would dwarf big data of the time when Perlis wrote Epigram 4. Today, TerraServer is dwarfed by Youtube.
I don't know if I'm alone in this but I have this question.<p>What is Big Data ? I hear it a lot from non technical people, so my first thought was that it is q company . After doing little research I am starting to think it is a concept. Sometimes it sometimes it sounds like marketing term. What is it ?
I think of it in more human terms -- it has to do with <i>time to process</i> than actual size. I think data becomes <i>big</i> when it takes longer than I'd like to answer the questions I want answered. As a corollary,the longer it takes, the "bigger" it becomes.