Other, related points:<p>- With heavy tails, the sample mean (i.e. the number you can see) is very likely to underestimate the population mean.<p>- With heavy enough tails, higher moments like variance (and therefore standard deviation) do not exist at all -- they're infinite.<p>- Critically: With heavy tails, the central limit theorem breaks down. Sums of heavy-tailed samples converge to a normal distribution so slowly it might not realistically ever happen with your finite data. Any computation you do that explicitly or implicitly relies on the CLT will give you junk results!
If the author is seeing this thread; I couldn't find an RSS feed for your site. I don't know if they are difficult to setup, but if it's very little effort, I would appreciate seeing what you post next :)<p>As for the waryness about the mean. A lot of people much further behind than thinking of different distributions. Even something you assume is normal distributed needs a mean <i>and</i> a variance! As for visualising, histograms are incredibly underrated tools. You can infer a lot of information by just looking at a distribution.
The mean is misleading.
The median is misleading.
The mode is misleading.
Any reduction of a range of data to a single representative datum is misleading.<p>However, the fight back against providing something a bit more meaningful than a single value can sometimes be quite strong.<p>I try hard to provide software estimates as probability distributions, but when someone sees a line with a probability peak somewhere around two days (could be really simple), and then a wide hump somewhere around two weeks (if it's not simple, it will mean a significant rewrite), with a very low line between them and then a long, long tail off to several months, it is not well-received.<p>I can see their point; they're trying to plan things, and the whole system is set up to work with single numbers. If everyone provided probability graphs for their estimates, and we had a tool that could then combine them and deliver the net probability graph of the combined pieces, I expect they'd be a lot more amenable.
Nassim Taleb greatly expands on this point in <i>Antifragile</i>. For a freely available, techical argument, check out <i>Doing Statistics Under Fat Tails</i>[1].<p>[1]: <a href="https://www.fooledbyrandomness.com/FatTails.html" rel="nofollow">https://www.fooledbyrandomness.com/FatTails.html</a>
This is a worthy posting, particularly as so much becomes iterative statistics in "A.I." clothing. The two old (slightly hackneyed) counter-examples which are popular in lectures about measures of the <i>central tendency</i> are:<p>- One is trying to get a sense of the common sort of income in a room and then Bill Gates wanders in. Suddenly the average income becomes an amount which <i>no one</i> experiences.<p>- What is the average number of testicles in the human population? That computed central tendency is quite rare.
I would quibble some here. When we look at revenue, I agree: ignore the mean. If there's a whole bunch of people not paying you anything, that's OK... Look at the 50th and 90th percentile.<p>But <i>profit</i>, and similarly <i>costs</i>? Your mean customer better be profitable, or you won't be. How much the people on the left of the graph <i>cost</i> you is <i>important</i>.<p>Part of this is definitional, too. Do you include that far left part of the graph where people are not really paying you as a "customer"?
The mean is not so bad for many purposes because it is an expectation value.<p>If you add up your revenue, subtract your expenses, and divide by the number of customers that gives you a real profit number. (Condition how you define revenue & expenses) If that number is negative or positive it is meaningful.<p>The median on the other hand has a different set of problems. If you are running a game like Fate Grand Order you'd better cultivate the guy who spends $70k because he has to "catch them all". The median player probably pays little or nothing, but the guy who sells ero comics at Comiket complains about what it costs to get (say) Saber Bride, but it is worth more to him than it is to the medium.<p>Mean and median are terrible numbers to use for latency; what drives you nuts with your computer being unresponsive is not the median latency, but the 99% latency.
I came across too many people who value mean soooooo much in the analysis. Well, some of them made mistake and the project died.
Hypothesis: heavy reliance on mean increases the probability of failure in internet industry.
This reminds of PG's essay <i>mean people fail</i>:
<a href="http://www.paulgraham.com/mean.html" rel="nofollow">http://www.paulgraham.com/mean.html</a><p>Pun intended :)
The Iranian civilization can draw continuity to Susa, circa 3000BC, further than China. The Mesopotamian and Indian civilizations are older still but broke continuity.