This is only the toy version of the actual problems solved by the Allies, which were more nuanced, and involved reasoning about the tank manufacturing pipeline. The write-up [0] doesn't go into the math but makes an interesting read.<p>[0] <a href="https://sci-hub.tw/10.2307/2280189" rel="nofollow">https://sci-hub.tw/10.2307/2280189</a>
More about Frequentist and Bayesian analysis can be found here:<p><a href="https://en.wikipedia.org/wiki/German_tank_problem" rel="nofollow">https://en.wikipedia.org/wiki/German_tank_problem</a><p>Matter of fact...<p><pre><code> According to conventional Allied intelligence estimates, the Germans
were producing around 1,400 tanks a month between June 1940 and September 1942.
Applying the formula below to the serial numbers of captured tanks, the number
was calculated to be 246 a month. After the war, captured German production
figures from the ministry of Albert Speer showed the actual number to be 245.</code></pre>
How ironic that the nation that led the world in the frontiers of maths in the 19th century completely missed the boat in the applied math of signals intelligence in WWII. I'm referring to the tank serial numbers and the lack of care in Enigma codes, except by the Kriegsmarine, but even they eventually lost a code book to the allies, which they apparently considered an impossibility.
Interesting article, though I think it incorrectly leaves
the reader thinking that there is some interesting
informating hidden in the average spacing of the numbers.
In fact, all you need to know is that maximum observation
and the number of observations. Once you simplify the average
spacing goes away.<p>If M is the maximum serial number of N is the total number of
observations, using the formula in the post:<p><pre><code> M + (avg. spacing) = M + M / N - 1 = (N + 1) / N * M
</code></pre>
To me that gives a more clear picture of what the unbiased
estimator is doing: inflate the maximum value by a factor that
limits towards one as the sample size grows.
For anyone else interested in WW2 reverse engineering and design etc., <a href="https://www.youtube.com/watch?v=GJCF-Ufapu8" rel="nofollow">https://www.youtube.com/watch?v=GJCF-Ufapu8</a>
"The secret war" is a huge documentary covering british efforts to counter german electronic warfare and V-weapons.
Why didn't they use randomized and scrambled serial numbers? Sort of like what Amazon does to their order numbers. I know it can still be cracked but serially numbering military equipment is not very smart. I was setting up a Shopify store the other day and it doesn't allow for a lookup table to be used for order numbers. I don't want competitors to know that I've sold so many X items. Same thing with Squarespace and Square e-commerce stores. It blows my mind that a multi-billion dollar ecom giant has not implemented despite of forum posts and requests from users.
The job interview version: If you are being interviewed for a position by engineers who have their employee ids (serially allocated) on their badge find the number of employees from those ids assuming all engineers are equally likely to be on the panel of 8.
See also Doomsday Argument.<p><a href="https://en.wikipedia.org/wiki/Doomsday_argument" rel="nofollow">https://en.wikipedia.org/wiki/Doomsday_argument</a>
This all seems to assume the tank serial numbers would be captured at one moment in time ("captured 15 of these tanks uniformly at random.") But in fact the tank shells dribble in over time which biases the gap, the gaps at the highest numbers are going to be greater. Earlier tanks have had many more chances to be destroyed or captured. So using average gap is clearly not going to give the best estimate. If you restrict yourself to tanks from the latest large battle, that will cancel out the dribble effect though.
Since no one else commented on that yet; I just wanted to say that I like the simple layout OP is using.<p>Not much clutter & straight to the point. Loads fast and it’s under 630KB.<p>Could certainly be improved but it’s nice not having to load >25MB just to read an article.
This is also a good (applied, with simple code) example of the use of probabilistic programming. I can't get myself to read full books, but somehow this simple example gave me some intuition and additional pointers to follow.
It depends on how you intend to 'score' the estimate.<p>Are you looking for the answer that is the 'most likely', or one that has the 'lowest least squared error', or maybe one that is 'unbiased' (mean error)?<p><a href="http://datagenetics.com/blog/march22014/index.html" rel="nofollow">http://datagenetics.com/blog/march22014/index.html</a>
I recommend Think Bayes by Allen Downey if you want to study more. It's a free book available online. <a href="http://www.greenteapress.com/thinkbayes/thinkbayes.pdf" rel="nofollow">http://www.greenteapress.com/thinkbayes/thinkbayes.pdf</a>
"the Germans, being Germans, had numbered their parts in the order they rolled off the production line"<p>Probably in today's world this is racist or nationalist or something. But (as someone of German descent) I have to admit it's funny.
I remember studying this problem in the context of anonymity a few years back, defining immeasurability as the property whereby an adversary cannot distinguish between different node counts, for example. The tank problem is related to mark recapture techniques for animal population size estimation. Shameless plug,<p><a href="http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2011/MSC/MSC-2011-06.pdf" rel="nofollow">http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2...</a>
Nice write up.<p>Little bit of a funny though: Note how num_tanks ~ Unif(max(captured),2000) was defined, so you already have p[ parameter | data ]. Isn't this already a posterior?<p>I get however how if you had the r.v.s num_tanks ~ Unif(M,2000), observed | num_tanks ~ Unif(1,num_tanks), M some constant, that you could find a posterior distribution num_tanks | vector<observed> by first finding the joint via
E[ 1[num_tanks < t]P[observed | num_tanks] ]
Given the praise for Bayesian methods here, I'm surprised the author didn't discuss the Bayesian solution. See <a href="http://isaacslavitt.com/2015/12/19/german-tank-problem-with-pymc-and-pystan/" rel="nofollow">http://isaacslavitt.com/2015/12/19/german-tank-problem-with-...</a> for a similar exposition.
Very interesting. Especially the three links and in particular<p><a href="https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers" rel="nofollow">https://github.com/CamDavidsonPilon/Probabilistic-Programmin...</a>
Seeing that it's a uniform distribution, let's start out with assuming our sample mean (the average serial number we find) has the same distribution as the true mean (the actual number of tanks in existence). If this is true, then:<p>2 x mean<p>should be an unbiased estimator of the true mean. But because we are probably under sampling the extremes, we could use the Bessel correction:<p>1/(n-1) x summation_{i=1}^n(sample_i)<p>I would guess this comes out to a better estimation than what the article says.<p>Bessel's correction might be a bit of overkill, since it's intended to work with normal distributions. But I still suspect it comes out to a better estimation that what the blog post says.