TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Simpson’s Paradox (2016)

370 pointsby mromniaabout 6 years ago

20 comments

mcguireabout 6 years ago
I&#x27;d like to say that the author has been reading <i>The Book of Why</i>, but it seems that he hasn&#x27;t because he missed the punch line of the section on the paradox: you need a causal model to separate the two branches of the paradox. It&#x27;s as easy to construct examples where the overall view is correct as it is so construct examples where the separate views are.
评论 #19290378 未加载
评论 #19314288 未加载
knappaabout 6 years ago
The sex-discrimination lawsuit against UC Berkley seems to be a kind of academic urban myth; the administration was apparently afraid of such a lawsuit and the study was done in response to those administrative fears.
评论 #19290460 未加载
评论 #19289611 未加载
gokabout 6 years ago
The last example of software optimization causing mean slowdown because users actually use the software is so true. Another example I&#x27;ve seen is better ML models causing accuracy to go down; users try harder things.
freddexabout 6 years ago
I like the way this is written. Very clear and to the point, with a tone of &quot;Hey, check out this cool thing&quot;.
评论 #19289214 未加载
IngoBlechschmidabout 6 years ago
An explorable explanation of Simpson&#x27;s Paradox, neatly complementing the article, is here: <a href="https:&#x2F;&#x2F;pwacker.com&#x2F;simpson.html" rel="nofollow">https:&#x2F;&#x2F;pwacker.com&#x2F;simpson.html</a>
currymjabout 6 years ago
Judea Pearl’s explanations of this in terms of causality are the only way it really makes sense, in my view.<p><a href="https:&#x2F;&#x2F;ftp.cs.ucla.edu&#x2F;pub&#x2F;stat_ser&#x2F;r414.pdf" rel="nofollow">https:&#x2F;&#x2F;ftp.cs.ucla.edu&#x2F;pub&#x2F;stat_ser&#x2F;r414.pdf</a>
评论 #19291214 未加载
YeGoblynQueenneabout 6 years ago
So, this is the data that the wikipedia page on Simpson&#x27;s Paradox cites for the Berkeley study, and that the author of the article has quoted:<p><pre><code> Men Women Department Applied Admitted Applied Admitted A [825] 62% 108 [82%] B [560] 63% 25 [68%] C 325 [37%] [593] 34% D [417] 33% 375 [35%] E 191 [28%] [393] 24% F [373] 6% 341 [7%] </code></pre> Above, I&#x27;ve bracketed in each pair of columns a) the sex with the most applicants and b) the sex with the most admissions, in a department. If that data is really the Berkeley data, then it&#x27;s clear that the bias is against the sex with the most applicants, rather than either men or women.<p>I can propose a mechanism for this kind of (with some abuse of terminology) selection bias. A department accepts some applications, then realises they&#x27;ve admitted too many applicants of one sex and start rejecting applicants from the dominant sex in an attempt to redress the balance. They make a mess of it and end up biased too far in the opposite direction than they originally started.<p>Also note that in 4 out of 6 departments, more men applied than women, explaining why more departments appear biased against men (provided my observation holds).<p>However, I can&#x27;t be sure whether this is actually the original data because it&#x27;s nowhere to be found on my pdf copy of the study (Sex bias in graduate admission) which I believe I got from here: <a href="https:&#x2F;&#x2F;homepage.stat.uiowa.edu&#x2F;~mbognar&#x2F;1030&#x2F;Bickel-Berkeley.pdf" rel="nofollow">https:&#x2F;&#x2F;homepage.stat.uiowa.edu&#x2F;~mbognar&#x2F;1030&#x2F;Bickel-Berkele...</a>. If anyone knows where this data actually comes from, I&#x27;d welcome a pointer.
评论 #19292586 未加载
评论 #19292571 未加载
sreanabout 6 years ago
Simpson&#x27;s Paradox is one of the many phenomena that shows how different applied ML is from regular software engineering. Another one is feedback loops between decomposed subproblems.<p>In ML encapsulation, shielding away of inner details often does not work. One needs to know what is happening on the other side of the abstraction boundary. This is a problem for managers and PM coning to ML from a purely software engineering background. They are used to encapsulation and decomposition serving them well and they expect the same.
评论 #19290352 未加载
评论 #19289783 未加载
jzlabout 6 years ago
Observation #2: the paradox is essentially describing statistical gerrymandering. :)
评论 #19292681 未加载
esquire_900about 6 years ago
This is the exact feeling I&#x27;ve been having for years, nicely described in an easy to understand language. At least in data science and (god forbid) behavioral psychology, you can answer any question any way you like - statistically valid - by slightly shifting the level of focus (as described here), definitions or angle of attack. The more data, the easier.<p>Thanks for putting it in such a clear way :)
throway88989898about 6 years ago
Neatly phrased:<p>Trends which appear in slices of data may disappear or reverse when the groups are combined.
评论 #19289635 未加载
sopooneoabout 6 years ago
In simples case at least, such as with the kidney stones, can we reduce our risk of reaching wrong conclusions by increasing our sample size of patients and randomizing which receive each treatment?
评论 #19289602 未加载
评论 #19289342 未加载
jzlabout 6 years ago
Cool article. My knowledge of statistics is really rusty, but isn&#x27;t this another way approaching the topic of &quot;Bayesian Thinking&quot;? If you think about the scenarios in the article from the standpoint of <i>predicting</i> any given outcome in advance, male vs. female and hard department vs. easy department should be treated as &quot;priors&quot;. Or to put it another way, Bayesian thinking means asking the question &quot;What is the chance of X happening <i>given Y</i>?&quot;<p>A nice intro to the topic: <a href="https:&#x2F;&#x2F;betterexplained.com&#x2F;articles&#x2F;an-intuitive-and-short-explanation-of-bayes-theorem&#x2F;" rel="nofollow">https:&#x2F;&#x2F;betterexplained.com&#x2F;articles&#x2F;an-intuitive-and-short-...</a><p>Which explains why a positive test on a mammogram means you only have an 8% chance of having breast cancer:<p><i>&gt;The chance of getting a real, positive result is .008. The chance of getting any type of positive result is the chance of a true positive plus the chance of a false positive (.008 + 0.09504 = .10304).</i><p><i>&gt;So, our chance of cancer is .008&#x2F;.10304 = 0.0776, or about 7.8%.</i><p><i>&gt;Interesting — a positive mammogram only means you have a 7.8% chance of cancer, rather than 80% (the supposed accuracy of the test). It might seem strange at first but it makes sense: the test gives a false positive 9.6% of the time (quite high), so there will be many false positives in a given population. For a rare disease, most of the positive test results will be wrong.</i>
评论 #19290464 未加载
评论 #19290943 未加载
TicklishTigerabout 6 years ago
That is not a paradox. It&#x27;s just the fact that a theory about something might not hold when you take a closer look at that something.<p>In the articles example, the admission rates of a university seemed to indicate that there is a bias against women.<p>Zooming in and looking at the admission rates of the individual departments seem to indicate that there is a bias against men.<p>The article makes it sound like the first theory was wrong. And the second theory - the bias against men - is the real truth.<p>Zooming in further might indicate the opposite again.<p>Take two boxers. So far, one of them has won 86% of his fights and the other one has won 100%. According to the article, &quot;The data is clear&quot;.<p>Now we add more data:<p>One fighter is Mike Tyson. He won 50 of his 58 fights. The other one is me. I did one fight in kindergarden and won it. But to be honest: I would not want to fight Tyson. As paradox as it sounds.
评论 #19289473 未加载
评论 #19289334 未加载
评论 #19289490 未加载
评论 #19289337 未加载
评论 #19289347 未加载
air7about 6 years ago
This is one of my favorite paradoxes too. Here&#x27;s why:<p>&quot;... given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data, with each story dictating its own choice. Pearl considers this to be the real paradox behind Simpson&#x27;s reversal.&quot; [0]<p>[0]<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Simpson%27s_paradox" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Simpson%27s_paradox</a>
评论 #19298131 未加载
emmelaichabout 6 years ago
I sometimes wonder why people expect there to be any fixed, categorical semantic relationship between any set of numbers and set of natural language statements.<p>Very rarely do the words or the numbers cover even a tiny amount of the possible interpretations.
_bxg1about 6 years ago
This is basically how gerrymandering works, isn&#x27;t it?
jdhzzzabout 6 years ago
I am reminded of this XKCD comic <a href="https:&#x2F;&#x2F;xkcd.com&#x2F;2080&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;2080&#x2F;</a>.
clircleabout 6 years ago
Iirc, you can guard against simpson&#x27;s paradox by designing&#x2F;collecting balanced data
评论 #19289619 未加载
lettergramabout 6 years ago
Idk why the 2016 needs to be in the title here. I understand for date relevant content, but this is not.
评论 #19289860 未加载
评论 #19289823 未加载