TechEcho

5 comments

temp234almost 4 years ago

This is an incredibly interesting question but I have no idea how you would ever be able to figure out the answer. What defines a data set? What about huge data sets that reference back to a relatively small mapping table, is that one big data set or two data sets of different size? Maybe a cloud hosting provider would have some insight into hosted data sets but even if the public had that information we still wouldn't know anything about data sets that are collected and stored on local machines. Similar problems arise for cataloguing models by their complexity. What is the broader question here, what are you trying to figure out?<p>There is definitely research being done on sparse data sets. Early stats methods were applied to what we would consider small data. Tukey did a lot of work on data viz and exploratory data analysis that was important and applies to small data sets. Many medical experiments use small data sets. Bayesian methods can apply to small data sets.

评论 #27306977 未加载

ploikaalmost 4 years ago

I'm kind of sad that the term "data mining" has fallen out of favour, because large datasets (as with mines) tend to contain a lot of worthless dirt that just has to be sifted through.<p>10 million rows of data is still pretty big, all the same. You can get away with invoking the Central Limit Theorem after about 30 observations, for instance (with all the usual assumptions and caveats). Sometimes all you're getting for the extra effort is a tighter confidence interval around something that could be pretty well estimated with a couple of hundred rows of data.

idohalmost 4 years ago

Yes, there are huge opportunities for small / medium data. Maybe 90%+ data problems are in the small size range. The biggest pain point is converting the insights from the analysis into an actionable plan that actually improves things.<p>Thinking this through, the main pain point I've experienced is convincing people to act on the data, follow through, and connect changes in product (however defined) into changes in the metrics.

评论 #27306936 未加载

shooalmost 4 years ago

In some cases there can be a large improvement going from status quo (if status quo is rather lacklustre) to a simple model, and it may not be worth doing anything more complicated if the accuracy of the component in question is no longer a bottleneck of overall system performance.<p>Maybe a simple model with a well-chosen prior informed by domain knowledge does the job.

he11owalmost 4 years ago

Most commercial value lies in processing small/medium data.

5 comments

temp234almost 4 years ago

评论 #27306977 未加载

ploikaalmost 4 years ago

idohalmost 4 years ago

评论 #27306936 未加载

shooalmost 4 years ago

he11owalmost 4 years ago

Most commercial value lies in processing small/medium data.

Ask HN: Is there a case for small / medium data

5 comments

Ask HN: Is there a case for small / medium data

5 comments