I think one way to summarize it is: by constructing feature vector components that are parochially based upon the mere <i>position</i> within some data, you introduce correlation and causality effects between the components of your feature vector.<p>For example, think about using regression instead of clustering. Your design matrix is still going to be S from the paper, where the p-th row of S is the p-th "iteration" of the window (while it is is being slid from start to finish of the data).<p>So column 1 of p=1 window will trivially be perfectly correlated with column 2 of p=0 window, and column 0 of p=2, at least for a number of rows until that data point has slid out of the window. So in some sense, past observations (earlier columns in S) will "cause" later observations (later columns in S) due to the sliding mechanism.<p>In regression this is a well-explored problem with multi-collinearity and also with conditional effect sizes when causal relationships are not separated into separate equations or corrected by adding dummy variables for conditional levels of one covariate given others.<p>It's not at all surprising that something similar would happen when clustering.<p>Sadly, it's also not that surprising that the entire research community doesn't care too much and is happy to crank out papers with this issue anyway. It's similar to data mining for statistical significance or other problems, and indeed machine learning is not at all a silver bullet to get around those problems.<p>One approach that might help reduce the problem is to randomly sub-sample windows. It would be a lot of work, and probably be computationally costly, but you could in principal devise a sampling scheme where you jointly sample <i>all</i> of the sub-windows at once, in such a way as to ensure certain bounds on the total amount of overlap between them, using probably some sort of Metropolis-like scheme. It's not clear to me if this is worthwhile or not, and I tend to prefer the idea of quantizing the signal in some way to make it a lexicographic problem instead.<p>Also note that the problem is <i>a lot</i> less worrisome in supervised learning, like training a decision tree or SVM on overlapping sub-windows. The reason is that you (presumably) have control over the labels. So you can ensure that two highly correlated sub-samples (like window p and window p+1) get the same label most of the time when it matters (e.g. when you're right on top of the interesting phenomenon you want to classify). <i>Crucially</i> the reason this helps is that you not only can ensure that two sample windows which differ by only a slight amount of sliding will get the same label, but also that those windows will get the same label as other instances of the same phenomenon elsewhere in the signal, perhaps far away in some other sub-window nowhere near it. It means you are clamping the positive training samples to reflect the signal properties you want, rather than hoping the clustering will intrinsically pick that to be the notion of similarity it uses when classifying samples (which, as the paper points out, you have no reason to expect that it will).<p>But it will still suffer somewhat to subjective choices about when, exactly, the window has slid far enough away from the labeled phenomenon to have "rolled off" enough that the label should be for the other class (in a binary case). What's bad about this is that however this gets codified into the training set needs to be taken into account as part of the experimental design and accounted for when reporting anything like statistical significance results later on. But because this choice, which is effectively a "roll off" parameter, is not written down or made explicitly part of the model anywhere, most people just ignore, or don't think about the way it might have an effect on the result.