科技回声

It's good that the "Big Data" community is finally shifting the paradigm (back) to stream processing. I mean, it is the abstraction behind pipes, which were invented when data we now consider small was Big. Now if only someone will take the UNIX pipe and make it transparently multi-machine instead of writing an ungodly large Java framework to emulate them badly, slowly, and verbosely...<p>However, I was a little disappointed by the "probabilistic" methods. I was thinking of things like approximate kNN, online regression, that sort of thing, in which you actually trade speed and streamability for accuracy. Bloom filters don't actually lose any accuracy in the example given, since there is a fallback to a database in the case of a false positive. Instead they are an optimization technique.<p>The more interesting probabilistic methods to me are the ones that say: we are willing to give up the accuracy of the traditional technique, but are hoping to make up for it by being able to process more data. But of course "probabilistic method" is a broad and context-dependent term.

Another pretty good article on this topic: <a href="https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/" rel="nofollow">https://highlyscalable.wordpress.com/2012/05/01/probabilisti...</a>

Stream Processing and Probabilistic Methods: Data at Scale

2 条评论

Stream Processing and Probabilistic Methods: Data at Scale

2 条评论