TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Bayes is guaranteed to overfit

157 pointsby Amboliaalmost 2 years ago

13 comments

radford-nealalmost 2 years ago
As the author admits at the end, this is rather misleading. In normal usage, &quot;overfit&quot; is by definition a bad thing (it wouldn&#x27;t be &quot;over&quot; if it was good). And the argument given does nothing to show that Bayesian inference is doing anything bad.<p>To take a trivial example, suppose you have a uniform(0,1) prior for the probability of a coin landing heads. Integrating over this gives a probability for heads of 1&#x2F;2. You flip the coin once, and it lands heads. If you integrate over the posterior given this observation, you&#x27;ll find that the probability of the value in the observation, which is heads, is now 2&#x2F;3, greater than it was under the prior.<p>And that&#x27;s OVERFITTING, according to the definition in the blog post.<p>Not according to any sensible definition, however.
评论 #36107021 未加载
评论 #36107035 未加载
CrazyStatalmost 2 years ago
I&#x27;m on my phone so I haven&#x27;t tried to work through the math to see where the error is, but the author&#x27;s conclusion is wrong and the counterexample is simple.<p>&gt; Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality p ( y i | y − i ) &lt; p ( y i | y ) , for any point i and any model.<p>Let y_1, ... y_n be iid from a Uniform(0,theta) distribution, with some nice prior on theta (e.g. Exponential(1)). Then the posterior for theta, and hence the predictive density for a new y_i, depends only on max(y_1, ..., y_n). So for all but one of the n observations the author&#x27;s strict inequality does not hold.
syntaxingalmost 2 years ago
The author mentions he defines over fit as “Test error is always larger than training error”. Is there an algorithm or model where that’s not the case?
评论 #36105748 未加载
评论 #36106860 未加载
评论 #36105819 未加载
评论 #36106386 未加载
评论 #36111060 未加载
评论 #36105757 未加载
评论 #36106147 未加载
to-mialmost 2 years ago
It seems that the post is comparing a predictive distribution conditioned on N data points to one conditioned on N-1 data points. The latter is a biased estimate of the former (e.g., <a href="https:&#x2F;&#x2F;users.aalto.fi&#x2F;~ave&#x2F;publications&#x2F;VehtariLampinen_NC2002_preprint.pdf" rel="nofollow">https:&#x2F;&#x2F;users.aalto.fi&#x2F;~ave&#x2F;publications&#x2F;VehtariLampinen_NC2...</a>)
评论 #36108955 未加载
MontyCarloHallalmost 2 years ago
I don&#x27;t follow the math. WLOG, for N total datapoints, let y_i = y_N. Then the leave-one-out posterior predictive is<p><pre><code> \int p(y_N|θ)p(θ|{y_1...y_{N-1}}) dθ = p(y_N|{y_1...y_{N-1}) </code></pre> by the law of total probability.<p>Expanding the leave-one-out posterior (via Bayes&#x27; rule), we have<p><pre><code> p(θ|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|θ)p(θ)&#x2F;\int p({y_1...y_{N-1}}|θ&#x27;)p(θ&#x27;) dθ&#x27; </code></pre> which when plugged back into the first equation is<p><pre><code> \int p(y_N|θ) p({y_1...y_{N-1}}|θ)p(θ) dθ&#x2F;(\int p({y_1...y_{N-1}}|θ&#x27;)p(θ&#x27;) dθ&#x27;) </code></pre> I don&#x27;t see how this simplifies to the harmonic mean expression in the post.<p>Regardless, the author is asserting that<p><pre><code> p(y_N|{y_1...y_{N-1}}) ≤ p(y_N|{y_1...y_N}) </code></pre> which seems intuitively plausible for any trained model — given a model trained on data {y_1...y_N}, performing inference on any datapoint y_1...y_N in the training set will generally be more accurate than performing inference on a datapoint y_{N+1} not in the training set.
评论 #36106475 未加载
alexmolasalmost 2 years ago
I got lost in the second equation, when the author says<p>p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y) \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta<p>why is that? Can someone explain the rationale behind this?
评论 #36105901 未加载
vervezalmost 2 years ago
Here&#x27;s a good recent paper that looks at this problem and provides remedies in a Bayesian manner. <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2202.11678" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2202.11678</a>
评论 #36108671 未加载
joshjob42almost 2 years ago
This is a pretty silly blog post. The complaint comes down to &quot;a Bayesian model will always place a lower probability on outcomes that have not been observed than outcomes which have been observed already&quot; which... of course it would! In what situation where you&#x27;re trying to understand an exchangeable set of outcomes would you think you should put more probability on things that you haven&#x27;t seen than those you have? The only things I can dream up violate exchangeability (ie for instance, a finite bag of different color marbles, where you draw without replacement).
bbminneralmost 2 years ago
So the argument is essentially that &quot;not only if you pick the best thing fitting your finite data, but even if you take a weighted average over things that fit your finite data proportionally to how well they fit your finite data - you are still almost surely end up with something that fits your finite sample better than the general population (that this sample was drawn from)&quot;?
psyklicalmost 2 years ago
Typically, we think of overfitting and underfitting as exclusive properties. IMO a large problem here is that the author&#x27;s definition of overfitting is not inconsistent with underfitting. (Underfit indicates a poor fit on both the training and test sets, in general.)<p>For example, a simple model might underfit in general, but it may still fit the training set better than the test. If this happens yet both are poor fits, it is clearly underfitting and not overfitting. Yet by the article&#x27;s definition, it would both be underfitting and overfitting simultaneously. So, I suspect this is not an ideal definition.
dmurrayalmost 2 years ago
Am I missing something or is this argument only as strong as the (perfectly reasonable) claim that all models overfit unless they have a regularisation term?
评论 #36105747 未加载
chunsjalmost 2 years ago
We can safely say that Bayes is guaranteed to underfit compared to MLE.
tesdingeralmost 2 years ago
Bayesian statistics is dumb. What&#x27;s the point of using prior assumptions that are based on speculation? It&#x27;s better to admitting your lack of knowledge, not jumping to conclusions without sufficient data, and processing data in an unbiased manner.
评论 #36109611 未加载
评论 #36108970 未加载
评论 #36108650 未加载