TechEcho

11 comments

I'm not sure I would feel comfortable using the Winsorized mean -- it doesn't have any particular statistical properties, and it lacks any intuition appeal because it's not clear what the value represents.I can understand a line of logic that would give rise to something like the Winsorized mean -- after you look at your data, you see some obvious outliers. It feels dirty to just drop those values (which would lead to the truncated mean) because the information from an implausible value is more likely to be near the extreme than it is to be near the center mass.What to do with those extreme values?Here's something I now want to experiment with -- bootstrapping the extreme values. Take note of the original empirical distribution. Then, create a new distribution by removing the top and bottom X% of the observations and replacing them with values drawn i.i.d. from the original empirical distribution. This could lead to some values being replaced with the outliers that we originally wanted to drop. After we do this, record the mean. Then create new sample distributions until we have a distribution of new means. What I am curious about is how the shape if this distribution of means will be impacted depending on that X% value selected at the beginning.What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?

评论 #37932954 未加载

评论 #37934055 未加载

评论 #37939464 未加载

评论 #37934202 未加载

评论 #37945553 未加载

评论 #37934245 未加载

jedbergover 1 year ago

The trimmed mean and Winsorized mean are both super useful in monitoring and metrics systems. In most cases you don't actually want the median but you also don't want the extreme outliers to throw everything off with a mean.Both methods give you better metrics for comparing periodically, like day over day or week over week.

评论 #37935145 未加载

评论 #37934095 未加载

评论 #37933124 未加载

评论 #37942732 未加载

bluenose69over 1 year ago

I use it mainly as a data-exploration method. For example, if the Winsorized mean gives a value that differs a lot from the conventional mean, then I might examine the outliers in a bit more detail with tools like a boxplot or a histogram.The source of the data matters a lot in what methods make sense. For example, hand-entered numbers might involve transposed digits, or missing signs, or decimal points in the wrong place. Numbers deriving from some electronic measurements might have problems with numbers "pegging out" at some limit. In other cases, those numbers might "wrap around". Data that have been examined at an earlier stage might have numbers changed to something that is obviously wrong, like a temperature of -999.999 or something. The list goes on.My point is that exploring outliers is often quite productive, and comparing means to Winsorized means can be a very quick way to see if outliers are an issue. This is not so much an issue for interactive work, for which plotting data is usually an early step, but it can come in handy during a preliminary stage of processing large datasets non-interactively. It can also be handy as part of a quality-control pipeline in a data stream.

评论 #37935271 未加载

some_randomover 1 year ago

One of the fascinating things about statistics is that in some ways it's more an art than a science, and the question of "when would I choose to use this over a normal mean or median" is a great example of that.

phlip9over 1 year ago

A related example out in the wild:Rust's `cargo bench` "winzorizes" the benchmark samples before computing summary statistics (incl. the mean).<a href="https://github.com/rust-lang/rust/blob/master/library/test/src/bench.rs#L150">https://github.com/rust-lang/rust/blob/master/library/test/s...</a>

SubiculumCodeover 1 year ago

It can be useful data cleaning method when used judiciously, but I'm surprised its at the top of HN

评论 #37934305 未加载

评论 #37934807 未加载

hornbanover 1 year ago

There's some interesting discussion in this thread about truncated vs winsorized means. For my own part, this is the first time I've come across either of these terms.I tend to benefit the most from seeing the entire distribution visually, and that helps me decide if I'm looking for a median, a "normal" mean, a "mean minus some weird outliers", or something different entirely.Does anybody happen to know of a good visual guide for how different measures of central tendency apply to various distributions? Anything that emphasizes pathological cases is helpful.

thih9over 1 year ago

Is all fun and games until one or two of the extreme remaining values (that are later used to replace the rest) turn out to be an outlier in itself.

评论 #37932466 未加载

snicker7over 1 year ago

Why would I prefer a winsorized mean over a median?

评论 #37932694 未加载

评论 #37933226 未加载

评论 #37933329 未加载

评论 #37932541 未加载

laughyover 1 year ago

A better alternative is to assume t-distributed errors

croisillonover 1 year ago

not to be confused with Florida mean