TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Clustering of Time Series Subsequences Is Meaningless (2005) [pdf]

95 点作者 silverpikezero大约 10 年前

9 条评论

daemonk大约 10 年前
So the heart of the problem, from what I can gather, is the sliding window approach to generating subsequences. This approach necessarily generates &quot;redundantly similar&quot; windows, ie. windows will be most similar to their closest neighboring windows since there is an overlapping length of window - 1<p>The sliding window approach is not adding random noise to the clustering, it is adding redundantly similar data to the clustering resulting in almost perfect sine-wave-like cluster centers.
评论 #9353568 未加载
humanarity大约 10 年前
&quot;The cluster centers found by STS clustering on any particular run of k-means on stock market dataset are not significantly more similar to each other than they are to cluster centers taken from random walk data! In other words, if we were asked to perform clustering on a particular stock market dataset, we could reuse an old clustering obtained from random walk data, and no one could tell the difference.&quot;<p>&quot;As the sliding window passes by, the datapoint first appears as the rightmost value in the window, then it goes on to appear exactly once in every possible location within the sliding window. So the t_i datapoint contribution to the overall shape is the same everywhere...&quot;<p>&quot;Another way to look at it is that every value v_i in the mean vector, 1 ≤ i ≤ w, is computed by averaging essentially every value in the original time series; more precisely, from t_i to t_m-w+i . So for a time series of m = 1024 and w = 32, the first value in the mean vector is the average of t[1..993]; the second value is the average of t[2…994], and so forth. Again, the only datapoints not being included in every computation are the ones at the very beginning and at the very end, and their effects are negligible asymptotically.&quot;
评论 #9352894 未加载
GFK_of_xmaspast大约 10 年前
Eamonn Keogh is a major advocate of Dynamic Time Warping and this should be kept in mind while reading this paper.<p>(he does have interesting stuff here: <a href="http:&#x2F;&#x2F;www.cs.ucr.edu&#x2F;~eamonn&#x2F;selected_publications.htm" rel="nofollow">http:&#x2F;&#x2F;www.cs.ucr.edu&#x2F;~eamonn&#x2F;selected_publications.htm</a>)
评论 #9355100 未加载
kilbuz大约 10 年前
From the abstract: &quot;forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset&quot;<p>Like assuming your data come from a normal distribution? Box guides us: &#x27;all models are wrong, some are useful.&#x27;
评论 #9352879 未加载
baldeagle大约 10 年前
when I last looked into time series clustering, it seemed like other statistical techniques worked better for predicting failure modes. I still think there is merit in motif recognition and rules based around that concept, but I haven&#x27;t seen very much of interest outside of EKGs.
amelius大约 10 年前
From the abstract:<p>&gt; While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature.<p>Ok, I would love to see a simple illustration here :)
engi_nerd大约 10 年前
What about cluster analysis of whole time series (which are of differing lengths)? Is that meaningful?
eveningcoffee大约 10 年前
I think that it should read &quot;Clustering of Time Series Subsequences With K-means Is Meaningless&quot;.
评论 #9353265 未加载
p4wnc6大约 10 年前
I think one way to summarize it is: by constructing feature vector components that are parochially based upon the mere <i>position</i> within some data, you introduce correlation and causality effects between the components of your feature vector.<p>For example, think about using regression instead of clustering. Your design matrix is still going to be S from the paper, where the p-th row of S is the p-th &quot;iteration&quot; of the window (while it is is being slid from start to finish of the data).<p>So column 1 of p=1 window will trivially be perfectly correlated with column 2 of p=0 window, and column 0 of p=2, at least for a number of rows until that data point has slid out of the window. So in some sense, past observations (earlier columns in S) will &quot;cause&quot; later observations (later columns in S) due to the sliding mechanism.<p>In regression this is a well-explored problem with multi-collinearity and also with conditional effect sizes when causal relationships are not separated into separate equations or corrected by adding dummy variables for conditional levels of one covariate given others.<p>It&#x27;s not at all surprising that something similar would happen when clustering.<p>Sadly, it&#x27;s also not that surprising that the entire research community doesn&#x27;t care too much and is happy to crank out papers with this issue anyway. It&#x27;s similar to data mining for statistical significance or other problems, and indeed machine learning is not at all a silver bullet to get around those problems.<p>One approach that might help reduce the problem is to randomly sub-sample windows. It would be a lot of work, and probably be computationally costly, but you could in principal devise a sampling scheme where you jointly sample <i>all</i> of the sub-windows at once, in such a way as to ensure certain bounds on the total amount of overlap between them, using probably some sort of Metropolis-like scheme. It&#x27;s not clear to me if this is worthwhile or not, and I tend to prefer the idea of quantizing the signal in some way to make it a lexicographic problem instead.<p>Also note that the problem is <i>a lot</i> less worrisome in supervised learning, like training a decision tree or SVM on overlapping sub-windows. The reason is that you (presumably) have control over the labels. So you can ensure that two highly correlated sub-samples (like window p and window p+1) get the same label most of the time when it matters (e.g. when you&#x27;re right on top of the interesting phenomenon you want to classify). <i>Crucially</i> the reason this helps is that you not only can ensure that two sample windows which differ by only a slight amount of sliding will get the same label, but also that those windows will get the same label as other instances of the same phenomenon elsewhere in the signal, perhaps far away in some other sub-window nowhere near it. It means you are clamping the positive training samples to reflect the signal properties you want, rather than hoping the clustering will intrinsically pick that to be the notion of similarity it uses when classifying samples (which, as the paper points out, you have no reason to expect that it will).<p>But it will still suffer somewhat to subjective choices about when, exactly, the window has slid far enough away from the labeled phenomenon to have &quot;rolled off&quot; enough that the label should be for the other class (in a binary case). What&#x27;s bad about this is that however this gets codified into the training set needs to be taken into account as part of the experimental design and accounted for when reporting anything like statistical significance results later on. But because this choice, which is effectively a &quot;roll off&quot; parameter, is not written down or made explicitly part of the model anywhere, most people just ignore, or don&#x27;t think about the way it might have an effect on the result.