TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Winsorized mean

108 pointsby dedalusover 1 year ago

11 comments

Alligaturtleover 1 year ago
I&#x27;m not sure I would feel comfortable using the Winsorized mean -- it doesn&#x27;t have any particular statistical properties, and it lacks any intuition appeal because it&#x27;s not clear what the value represents.<p>I can understand a line of logic that would give rise to something like the Winsorized mean -- after you look at your data, you see some obvious outliers. It feels dirty to just drop those values (which would lead to the truncated mean) because the information from an implausible value is more likely to be near the extreme than it is to be near the center mass.<p>What to do with those extreme values?<p>Here&#x27;s something I now want to experiment with -- bootstrapping the extreme values. Take note of the original empirical distribution. Then, create a new distribution by removing the top and bottom X% of the observations and replacing them with values drawn i.i.d. from the original empirical distribution. This could lead to some values being replaced with the outliers that we originally wanted to drop. After we do this, record the mean. Then create new sample distributions until we have a distribution of new means. What I am curious about is how the shape if this distribution of means will be impacted depending on that X% value selected at the beginning.<p>What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?
评论 #37932954 未加载
评论 #37934055 未加载
评论 #37939464 未加载
评论 #37934202 未加载
评论 #37945553 未加载
评论 #37934245 未加载
jedbergover 1 year ago
The trimmed mean and Winsorized mean are both super useful in monitoring and metrics systems. In most cases you don&#x27;t actually want the median but you also don&#x27;t want the extreme outliers to throw everything off with a mean.<p>Both methods give you better metrics for comparing periodically, like day over day or week over week.
评论 #37935145 未加载
评论 #37934095 未加载
评论 #37933124 未加载
评论 #37942732 未加载
bluenose69over 1 year ago
I use it mainly as a data-exploration method. For example, if the Winsorized mean gives a value that differs a lot from the conventional mean, then I might examine the outliers in a bit more detail with tools like a boxplot or a histogram.<p>The source of the data matters a lot in what methods make sense. For example, hand-entered numbers might involve transposed digits, or missing signs, or decimal points in the wrong place. Numbers deriving from some electronic measurements might have problems with numbers &quot;pegging out&quot; at some limit. In other cases, those numbers might &quot;wrap around&quot;. Data that have been examined at an earlier stage might have numbers changed to something that is obviously wrong, like a temperature of -999.999 or something. The list goes on.<p>My point is that exploring outliers is often quite productive, and comparing means to Winsorized means can be a very quick way to see if outliers are an issue. This is not so much an issue for interactive work, for which plotting data is usually an early step, but it can come in handy during a preliminary stage of processing large datasets non-interactively. It can also be handy as part of a quality-control pipeline in a data stream.
评论 #37935271 未加载
some_randomover 1 year ago
One of the fascinating things about statistics is that in some ways it&#x27;s more an art than a science, and the question of &quot;when would I choose to use this over a normal mean or median&quot; is a great example of that.
phlip9over 1 year ago
A related example out in the wild:<p>Rust&#x27;s `cargo bench` &quot;winzorizes&quot; the benchmark samples before computing summary statistics (incl. the mean).<p><a href="https:&#x2F;&#x2F;github.com&#x2F;rust-lang&#x2F;rust&#x2F;blob&#x2F;master&#x2F;library&#x2F;test&#x2F;src&#x2F;bench.rs#L150">https:&#x2F;&#x2F;github.com&#x2F;rust-lang&#x2F;rust&#x2F;blob&#x2F;master&#x2F;library&#x2F;test&#x2F;s...</a>
SubiculumCodeover 1 year ago
It can be useful data cleaning method when used judiciously, but I&#x27;m surprised its at the top of HN
评论 #37934305 未加载
评论 #37934807 未加载
hornbanover 1 year ago
There&#x27;s some interesting discussion in this thread about truncated vs winsorized means. For my own part, this is the first time I&#x27;ve come across either of these terms.<p>I tend to benefit the most from seeing the entire distribution visually, and that helps me decide if I&#x27;m looking for a median, a &quot;normal&quot; mean, a &quot;mean minus some weird outliers&quot;, or something different entirely.<p>Does anybody happen to know of a good visual guide for how different measures of central tendency apply to various distributions? Anything that emphasizes pathological cases is helpful.
thih9over 1 year ago
Is all fun and games until one or two of the extreme remaining values (that are later used to replace the rest) turn out to be an outlier in itself.
评论 #37932466 未加载
snicker7over 1 year ago
Why would I prefer a winsorized mean over a median?
评论 #37932694 未加载
评论 #37933226 未加载
评论 #37933329 未加载
评论 #37932541 未加载
laughyover 1 year ago
A better alternative is to assume t-distributed errors
croisillonover 1 year ago
not to be confused with Florida mean