A lot of innovation happens in the big data (data that needs distributed compute - spark etc.) like ability to blend data across sources, schemaless / schema on the fly, deploy analytical models to production etc. Is there a case for similar innovations in the small to medium data (working with ~10M dataset) blended across data sources, simple analytical models and such ? What percentage of usecases are in the bigdata realm vs. small/medium data.
This is an incredibly interesting question but I have no idea how you would ever be able to figure out the answer. What defines a data set? What about huge data sets that reference back to a relatively small mapping table, is that one big data set or two data sets of different size? Maybe a cloud hosting provider would have some insight into hosted data sets but even if the public had that information we still wouldn't know anything about data sets that are collected and stored on local machines. Similar problems arise for cataloguing models by their complexity. What is the broader question here, what are you trying to figure out?<p>There is definitely research being done on sparse data sets. Early stats methods were applied to what we would consider small data. Tukey did a lot of work on data viz and exploratory data analysis that was important and applies to small data sets. Many medical experiments use small data sets. Bayesian methods can apply to small data sets.
I'm kind of sad that the term "data mining" has fallen out of favour, because large datasets (as with mines) tend to contain a lot of worthless dirt that just has to be sifted through.<p>10 million rows of data is still pretty big, all the same. You can get away with invoking the Central Limit Theorem after about 30 observations, for instance (with all the usual assumptions and caveats). Sometimes all you're getting for the extra effort is a tighter confidence interval around something that could be pretty well estimated with a couple of hundred rows of data.
Yes, there are huge opportunities for small / medium data. Maybe 90%+ data problems are in the small size range. The biggest pain point is converting the insights from the analysis into an actionable plan that actually improves things.<p>Thinking this through, the main pain point I've experienced is convincing people to act on the data, follow through, and connect changes in product (however defined) into changes in the metrics.
In some cases there can be a large improvement going from status quo (if status quo is rather lacklustre) to a simple model, and it may not be worth doing anything more complicated if the accuracy of the component in question is no longer a bottleneck of overall system performance.<p>Maybe a simple model with a well-chosen prior informed by domain knowledge does the job.