Liked it better when it was called "Programmers Need To Learn Statistics Or I Will Kill Them All" <a href="http://zedshaw.com/essays/programmer_stats.html" rel="nofollow">http://zedshaw.com/essays/programmer_stats.html</a>, but still kind of wrong.<p>If you need to reduce a distribution to a single number, the most informative number is going to be the mean.<p>I understand their point about the 99th percentile, but consider that it's possible to improve the 99th percentile measure, while increasing the mean and degrading the performance of all but 1% of the users.<p>The real issue is reducing a distribution to one number.
Using FPS as a measure of your UI's performance is equally problematic. FPS is a great measurement for for games since performance dips usually occur over a span of many frames, but for UIs a lot of work tends to get concentrated into a single frame. A single frame that takes 110ms (or, heaven forbid, 500ms) to render won't move the needle on your FPS meter, but it will be instantly recognizable by the user.<p>I've complained about this before; use maximum frame delay [0] instead of FPS when measuring UI responsiveness.<p>[0] The maximum time elapsed between any two sequential frames during your test.
Why not break the numbers down more granularly to like 25% 50% 70% 90% 95% 99% ?<p>Understanding where your users are on the curve is probably more interesting than a single number. Worrying about that last 1% really only makes a meaningful difference if your user base is huge enough that fixing something for 1% of your users can jump revenue by a significant multiple.<p>Mentally, I try and think of the 80 or 90% of users with a similar experience, needs, etc. and make it better for them. In this case, speed is good for everybody, but I care very little about the needs of that last 1% if your customers are all paying the same. No sense in putting the needs of a small number of users in front of the needs of a much larger set of users.
Big fan of Optimizely here. I've used their product with a handful of my clients.
The thing that struck me about the article the most was how well it was written. Very engaging all the way through. That kind of quality of writing, I would say is quite rare.<p>Anyway, I'm really glad that they've improved the load times for their snippet, because this issue is always a genuine concern that needs resolving.
Sadly is hard to say anything without any value on axis. Difference between mean and 99% is 5s or 20ms?
As someone said here, threshold for "slow loading" should be used before picking metric for measuring it.
Pick graphic one and draw a line where users whine about slow loading, check how many are under and above. If the number of users below threshold value is "greater" then above threshold you shouldn't be so worried. How much greater can be picked from standard error from threshold measure from users.
Regarding mean vs 99% etc. In this case all you care about is: did loading the script delay page rendering to an extent that it was perceptible to the user? It's basically a step function. 99% is appropriate in this case.<p>Want to do it yourself? This talk by Etsy a few weeks ago has some detail on how they did a similar thing:<p><a href="http://www.slideshare.net/marcusbarczak/integrating-multiple-cdn-providers-at-etsy" rel="nofollow">http://www.slideshare.net/marcusbarczak/integrating-multiple...</a><p>Some links at the end of the talk. Infrastructure wise, I think you have to be prepared to pay for some expensive DNS before this kind of thing is viable.
The problem with using average for many performance stats is that it hides issues. There is a great paper on the topic - <a href="http://method-r.com/downloads/doc_details/44-thinking-clearly-about-performance" rel="nofollow">http://method-r.com/downloads/doc_details/44-thinking-clearl...</a><p>It is only about 13 pages, making it a quick but very informative read. I highly recommend it for anyone trying to measure performance, throughput, response time, efficiency, skew and load.
I think in Optimizely's case, the most important factor is making sure that there's no statistically significant correlation between higher response times and A/B testing. In other words, if the higher response times result an imbalanced impact on the test, the test is invalid.
I've used ApDex[1] before for giving a better measure of response times for user experience<p>1:<a href="http://apdex.org/" rel="nofollow">http://apdex.org/</a>