I'm not sure I would feel comfortable using the Winsorized mean -- it doesn't have any particular statistical properties, and it lacks any intuition appeal because it's not clear what the value represents.<p>I can understand a line of logic that would give rise to something like the Winsorized mean -- after you look at your data, you see some obvious outliers. It feels dirty to just drop those values (which would lead to the truncated mean) because the information from an implausible value is more likely to be near the extreme than it is to be near the center mass.<p>What to do with those extreme values?<p>Here's something I now want to experiment with -- bootstrapping the extreme values. Take note of the original empirical distribution. Then, create a new distribution by removing the top and bottom X% of the observations and replacing them with values drawn i.i.d. from the original empirical distribution. This could lead to some values being replaced with the outliers that we originally wanted to drop. After we do this, record the mean. Then create new sample distributions until we have a distribution of new means. What I am curious about is how the shape if this distribution of means will be impacted depending on that X% value selected at the beginning.<p>What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?
The trimmed mean and Winsorized mean are both super useful in monitoring and metrics systems. In most cases you don't actually want the median but you also don't want the extreme outliers to throw everything off with a mean.<p>Both methods give you better metrics for comparing periodically, like day over day or week over week.
I use it mainly as a data-exploration method. For example, if the Winsorized mean gives a value that differs a lot from the conventional mean, then I might examine the outliers in a bit more detail with tools like a boxplot or a histogram.<p>The source of the data matters a lot in what methods make sense. For example, hand-entered numbers might involve transposed digits, or missing signs, or decimal points in the wrong place. Numbers deriving from some electronic measurements might have problems with numbers "pegging out" at some limit. In other cases, those numbers might "wrap around". Data that have been examined at an earlier stage might have numbers changed to something that is obviously wrong, like a temperature of -999.999 or something. The list goes on.<p>My point is that exploring outliers is often quite productive, and comparing means to Winsorized means can be a very quick way to see if outliers are an issue. This is not so much an issue for interactive work, for which plotting data is usually an early step, but it can come in handy during a preliminary stage of processing large datasets non-interactively. It can also be handy as part of a quality-control pipeline in a data stream.
One of the fascinating things about statistics is that in some ways it's more an art than a science, and the question of "when would I choose to use this over a normal mean or median" is a great example of that.
A related example out in the wild:<p>Rust's `cargo bench` "winzorizes" the benchmark samples before computing summary statistics (incl. the mean).<p><a href="https://github.com/rust-lang/rust/blob/master/library/test/src/bench.rs#L150">https://github.com/rust-lang/rust/blob/master/library/test/s...</a>
There's some interesting discussion in this thread about truncated vs winsorized means. For my own part, this is the first time I've come across either of these terms.<p>I tend to benefit the most from seeing the entire distribution visually, and that helps me decide if I'm looking for a median, a "normal" mean, a "mean minus some weird outliers", or something different entirely.<p>Does anybody happen to know of a good visual guide for how different measures of central tendency apply to various distributions? Anything that emphasizes pathological cases is helpful.