"This was not a statistically controlled study: the subjects took a free test online and of their own accord."<p>This is by far the most important point made in the article in The Economist, but so far it is little reflected in the comments here on HN. The reported results have some broad plausibility to me, as a native speaker of General American English who has lived overseas (including living in an international dormitory with residents from all over the world, who variously used English-as-a-second-language, French-as-a-first-or-second language, Spanish-as-a-first-language, or Chinese-as-a-second-language as interlanguages). But the reported results may or may not reflect the reality of the situation in the real world, as the editor of The Economist takes care to note.<p>It's time to dust of the electrons on my FAQ post on voluntary response polls.<p>VOLUNTARY RESPONSE POLLS<p>As I commented previously when we had a poll on the ages of HNers, the data can't be relied on to make such an inference (what the average age of HN participants is). That's because the data are not from a random sample of the relevant population. One professor of statistics, who is a co-author of a highly regarded AP statistics textbook, has tried to popularize the phrase that "voluntary response data are worthless" to go along with the phrase "correlation does not imply causation." Other statistics teachers are gradually picking up this phrase.<p>-----Original Message----- From: Paul Velleman [SMTPfv2@cornell.edu] Sent: Wednesday, January 14, 1998 5:10 PM To: apstat-l@etc.bc.ca; Kim Robinson Cc: mmbalach@mtu.edu Subject: Re: qualtiative study<p>Sorry Kim, but it just aint so. Voluntary response data are worthless. One excellent example is the books by Shere Hite. She collected many responses from biased lists with voluntary response and drew conclusions that are roundly contradicted by all responsible studies. She claimed to be doing only qualitative work, but what she got was just plain garbage. Another famous example is the Literary Digest "poll". All you learn from voluntary response is what is said by those who choose to respond. Unless the respondents are a substantially large fraction of the population, they are very likely to be a biased -- possibly a very biased -- subset. Anecdotes tell you nothing at all about the state of the world. They can't be "used only as a description" because they describe nothing but themselves.<p><a href="http://mathforum.org/kb/thread.jspa?threadID=194473&tstart=36420" rel="nofollow">http://mathforum.org/kb/thread.jspa?threadID=194473&tsta...</a><p>For more on the distinction between statistics and mathematics, see<p><a href="http://statland.org/MAAFIXED.PDF" rel="nofollow">http://statland.org/MAAFIXED.PDF</a><p>and<p><a href="http://escholarship.org/uc/item/6hb3k0nz" rel="nofollow">http://escholarship.org/uc/item/6hb3k0nz</a><p>I think Professor Velleman promotes "Voluntary response data are worthless" as a slogan for the same reason an earlier generation of statisticians taught their students the slogan "correlation does not imply causation." That's because common human cognitive errors run strongly in one direction on each issue, so the slogan has take the cognitive error head-on. Of course, a distinct pattern in voluntary responses tells us SOMETHING (maybe about what kind of people come forward to respond), just as a correlation tells us SOMETHING (maybe about a lurking variable correlated with both things we observe), but it doesn't tell us enough to warrant a firm conclusion about facts of the world. The Literary Digest poll intended to predict the election results in the United States in 1932<p><a href="http://historymatters.gmu.edu/d/5168/" rel="nofollow">http://historymatters.gmu.edu/d/5168/</a><p><a href="http://www.math.uah.edu/stat/data/LiteraryDigest.pdf" rel="nofollow">http://www.math.uah.edu/stat/data/LiteraryDigest.pdf</a><p>is a spectacular historical example of a voluntary response poll with a HUGE sample size and high response rate that didn't give a correct picture of reality at all.<p>When I have brought up this issue before, some other HNers have replied that there are some statistical tools for correcting for response-bias effects, IF one can obtain a simple random sample of the population of interest and evaluate what kinds of people respond. But we can't do that in the case being discussed here in this thread on HN.<p>Another reply I frequently see when I bring up this issue is that the public relies on voluntary response data all the time to make conclusions about reality. To that I refer careful readers to what Professor Velleman is quoted as saying above (the general public often believes statements that are baloney) and to what Google's director of research, Peter Norvig, says about research conducted with better data,<p><a href="http://norvig.com/experiment-design.html" rel="nofollow">http://norvig.com/experiment-design.html</a><p>that even good data (and Norvig would not generally characterize voluntary response data as good data) can lead to wrong conclusions if there isn't careful thinking behind a study design. Again, human beings have strong predilections to believe certain kinds of wrong data and wrong conclusions. We are not neutral evaluators of data and conclusions, but have predispositions (cognitive illusions) that lead to making mistakes without careful training and thought.<p>Another frequently seen reply is that sometimes a "convenience sample" (this is a common term among statisticians for a sample that can't be counted on to be a random sample) of a population offers just that, convenience, and should not be rejected on that basis alone. But the most thoughtful version of that frequent reply I recently saw did correctly point out that if we know from the get-go that the sample was not done statistically correctly, then even if we are confident (enough) that Norwegians who responded to an online poll are reasonably fluent in English, we wouldn't want to extrapolate from that to conclude that any particular social or educational factor present in Norway provides an advantage in learning the English language, or that Panamanians on average are some of the least fluent speakers of English in the world.<p>On my part, I wildly guess that most western Europeans who have completed secondary education in the last three decades are moderately fluent in English, if only because they have occasion to use English as an interlanguage when speaking to other Europeans (something I have seen happen many, many times) and because they have much exposure to English-language media content (books, movies, radio broadcasts, TV shows). People who live in Latin America and who have occasion to travel to neighboring countries (including Brazil) have considerably more occasions to use Spanish as an interlanguage, even with native speakers of Portuguese, and thus somewhat reduced tendency to keep their English in practice.<p>The way to know which social or educational or economic factor is most important in the spread of world English as the global interlanguage would be to do an even more careful study than the interesting preliminary study reported here. Meanwhile, we will be trading anecdotes based on personal experience, which I will read with interest to supplement my personal experience.