As the author admits at the end, this is rather misleading. In normal usage, "overfit" is by definition a bad thing (it wouldn't be "over" if it was good). And the argument given does nothing to show that Bayesian inference is doing anything bad.<p>To take a trivial example, suppose you have a uniform(0,1) prior for the probability of a coin landing heads. Integrating over this gives a probability for heads of 1/2. You flip the coin once, and it lands heads. If you integrate over the posterior given this observation, you'll find that the probability of the value in the observation, which is heads, is now 2/3, greater than it was under the prior.<p>And that's OVERFITTING, according to the definition in the blog post.<p>Not according to any sensible definition, however.
I'm on my phone so I haven't tried to work through the math to see where the error is, but the author's conclusion is wrong and the counterexample is simple.<p>> Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality
p
(
y
i
|
y
−
i
)
<
p
(
y
i
|
y
)
, for any point
i
and any model.<p>Let y_1, ... y_n be iid from a Uniform(0,theta) distribution, with some nice prior on theta (e.g. Exponential(1)). Then the posterior for theta, and hence the predictive density for a new y_i, depends only on max(y_1, ..., y_n). So for all but one of the n observations the author's strict inequality does not hold.
The author mentions he defines over fit as “Test error is always larger than training error”. Is there an algorithm or model where that’s not the case?
It seems that the post is comparing a predictive distribution conditioned on N data points to one conditioned on N-1 data points. The latter is a biased estimate of the former (e.g., <a href="https://users.aalto.fi/~ave/publications/VehtariLampinen_NC2002_preprint.pdf" rel="nofollow">https://users.aalto.fi/~ave/publications/VehtariLampinen_NC2...</a>)
I don't follow the math. WLOG, for N total datapoints, let y_i = y_N. Then the leave-one-out posterior predictive is<p><pre><code> \int p(y_N|θ)p(θ|{y_1...y_{N-1}}) dθ = p(y_N|{y_1...y_{N-1})
</code></pre>
by the law of total probability.<p>Expanding the leave-one-out posterior (via Bayes' rule), we have<p><pre><code> p(θ|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|θ)p(θ)/\int p({y_1...y_{N-1}}|θ')p(θ') dθ'
</code></pre>
which when plugged back into the first equation is<p><pre><code> \int p(y_N|θ) p({y_1...y_{N-1}}|θ)p(θ) dθ/(\int p({y_1...y_{N-1}}|θ')p(θ') dθ')
</code></pre>
I don't see how this simplifies to the harmonic mean expression in the post.<p>Regardless, the author is asserting that<p><pre><code> p(y_N|{y_1...y_{N-1}}) ≤ p(y_N|{y_1...y_N})
</code></pre>
which seems intuitively plausible for any trained model — given a model trained on data {y_1...y_N}, performing inference on any datapoint y_1...y_N in the training set will generally be more accurate than performing inference on a datapoint y_{N+1} not in the training set.
I got lost in the second equation, when the author says<p>p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y) \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta<p>why is that? Can someone explain the rationale behind this?
Here's a good recent paper that looks at this problem and provides remedies in a Bayesian manner. <a href="https://arxiv.org/abs/2202.11678" rel="nofollow">https://arxiv.org/abs/2202.11678</a>
This is a pretty silly blog post. The complaint comes down to "a Bayesian model will always place a lower probability on outcomes that have not been observed than outcomes which have been observed already" which... of course it would! In what situation where you're trying to understand an exchangeable set of outcomes would you think you should put more probability on things that you haven't seen than those you have? The only things I can dream up violate exchangeability (ie for instance, a finite bag of different color marbles, where you draw without replacement).
So the argument is essentially that "not only if you pick the best thing fitting your finite data, but even if you take a weighted average over things that fit your finite data proportionally to how well they fit your finite data - you are still almost surely end up with something that fits your finite sample better than the general population (that this sample was drawn from)"?
Typically, we think of overfitting and underfitting as exclusive properties. IMO a large problem here is that the author's definition of overfitting is not inconsistent with underfitting. (Underfit indicates a poor fit on both the training and test sets, in general.)<p>For example, a simple model might underfit in general, but it may still fit the training set better than the test. If this happens yet both are poor fits, it is clearly underfitting and not overfitting. Yet by the article's definition, it would both be underfitting and overfitting simultaneously. So, I suspect this is not an ideal definition.
Am I missing something or is this argument only as strong as the (perfectly reasonable) claim that all models overfit unless they have a regularisation term?
Bayesian statistics is dumb. What's the point of using prior assumptions that are based on speculation? It's better to admitting your lack of knowledge, not jumping to conclusions without sufficient data, and processing data in an unbiased manner.