Fun to think about, but in the real world, no question neatly divides people, even the gender one. To quote Reddit's u/tailcalled[1], the exo-software/meatspace world is even less standardized than the software world:<p>Falsehoods programmers believe about gender:
<a href="http://www.cscyphers.com/blog/2012/06/28/falsehoods-programmers-believe-about-gender/" rel="nofollow">http://www.cscyphers.com/blog/2012/06/28/falsehoods-programm...</a><p>Falsehoods programmers believe about names:
<a href="http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/" rel="nofollow">http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...</a><p>Falsehoods programmers believe about addresses:
<a href="http://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/" rel="nofollow">http://www.mjt.me.uk/posts/falsehoods-programmers-believe-ab...</a><p>Falsehoods programmers believe about time:
<a href="http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time" rel="nofollow">http://infiniteundo.com/post/25326999628/falsehoods-programm...</a><p>More falsehoods programmers believe about time:
<a href="http://infiniteundo.com/post/25509354022/more-falsehoods-programmers-believe-about-time-wisdom" rel="nofollow">http://infiniteundo.com/post/25509354022/more-falsehoods-pro...</a><p>Falsehoods programmers believe about geography:
<a href="http://wiesmann.codiferes.net/wordpress/?p=15187&lang=en" rel="nofollow">http://wiesmann.codiferes.net/wordpress/?p=15187&lang=en</a><p>[1] <a href="http://www.reddit.com/r/programming/comments/1fc147/falsehoods_programmers_believe_about_addresses/ca8sirp" rel="nofollow">http://www.reddit.com/r/programming/comments/1fc147/falsehoo...</a>
I don't think this problem is solvable in any elegant form, but it is solvable. You'll just end up with massively conjunctive questions that you can't even hold in your head at once, like "27: Are you a non-practicing Catholic with exactly three children, or an asian owner of a minivan produced between 1998 and 2004 that isn't green, or a licensed boat mechanic with astigmatism, or..." and so on for the next 6 pages.<p>In short, you can draw categories to include or exclude as precise a number as you like, you just have to be willing to draw really, really complicated boundaries.
> To contribute to the project, open up a pull request and add your question to the list below. All questions are open to debate and discussion.<p>This is a completely wrong way to approach the problem. Because the questions should all divide the population into two parts the questions should be 'matched' to each other. This approach is a bit like doing a PCA by figuring out one component, then the other, then the rest...<p>One way to solve this problem is to have a lot of yes/no questions (like a big Karnaugh-table), then everybody would have a long bitstring as his unique ID. Now you need to compress that bitstring -- like the minimization of the Karnaugh-table.<p><a href="http://en.wikipedia.org/wiki/Karnaugh_map" rel="nofollow">http://en.wikipedia.org/wiki/Karnaugh_map</a><p>-- you need to generalize this for N number of questions (which can be done), then you'd have 33 complex questions like
'is it true that
(you live in NA
AND
you are male)
OR
(you live in Canada
AND you are white
AND
) .. and so on and on.
Assuming that the person doesn't necessarily need to know their answer (which is important for babies anyways) the answer is trivial. The first question would be "Given that we ordered all humans in order of the time of their birth, would the 1st bit of your position in the ordering be 1?", continue the other 32 questions with the remaining 32 bits.
Very interesting thought experiment. A few random thoughts:<p>Reminds me of Panoptic by the EFF: <a href="https://panopticlick.eff.org/" rel="nofollow">https://panopticlick.eff.org/</a><p>Everyone's ID would change as time passed (if they move, if they age, if they get a sex change, etc).<p>The best questions for this are inherently "irrelevant", since "relevant" questions tend to be statistically linked. So, questions like "Was the second letter of your first girlfriend's middle name between A and M?" is better than "Were you younger than 20 when you had your first girlfriend?", since we can likely guess the latter based on the other statistics.<p>It's very unlikely every ID will be unique if only asking 33 yes/no questions. I mean, look at two twins living together -- very few questions will be able to differentiate between them.<p>I think it's possible to do based on a random snapshot in time, however less possible if it's meant to last a lifetime.<p>I also think the questions exist, but not in a manner that we'd be able to come up with on our own. As in, I believe that a program that knew every detail about every human could create 33 yes/no questions that differentiated people, however I don't believe we could do it ourselves.<p>I also wonder how many questions would be required to ask non-yes/no questions and get a completely unique ID for everyone. For example, questions like "weight? languages spoken? birth place?".
33 questions is sort of the Shannon-Hartley optimal encoding of identifying information about human beings.<p>That means to come up with them is identical to finding an <i>optimal compression</i> of identifying data.<p>Necessarily, as the second question already implies, for this question to correctly divide the population in half, you would have to group large amounts of small populations together, resulting in <i>very</i> long questions.<p>For example, if you'd like to make another geographical question that's independent of the second one, it would have to divide in half every population of the 6 countries you mentioned. The next question would necessarily have to divide those 12 again.<p>By the way, the first question you ask is already suboptimal when combined with the second question, as those countries together probably do not have a clean 50% male/female split. (if they do, you should really explain that as it's not obvious)
Interesting exercise, which I'd call impossible in the given form. Imagine someone magically came up with 32 statistically independent binary indicators. Now you need to come up with the 33th question Q such that if you pick any two persons who are similar up to the 32nd bit, that single question must allow to distinguish them. Sounds hard.
Just use:<p>- Birthday (19~ bits)<p>- Rough Location (remaining bits)<p>And base the questions around those two, for example, where you born on a 1-15th, does the city you were born in start with the letter's a-k. This part would be an exercise in statistics, I would think.<p>edit: And one bit for if you were the first to be born of two identical twins =p
This project assumes we can know things that are not really knowable for everyone. It starts with gender and birthplace, both tricky questions in some situations.<p>So maybe we get to assume we have some oracle that helps us simplify the hard questions.<p>At that stage, it's easy. Begin with, "Assume we build a list of people sorted by time of birth (with some arbitrary tiebreakers, like proximity of birthplace to Barbados, or darkness of hair color...)."<p>Question 1: Are you on the top half or bottom half of this list?<p>Question 2: Are you on the top quarter or bottom quarter of the half?<p>Question 3: ...
It's not enough to find 33 independent questions that evenly split the world's population.<p>An optimal, though inelegant solution to that goal might look something like this:<p>"Is the {1..33}th bit of sha1(name : location : date of birth) 1?".<p>Clearly you'll have tons of collisions with that solution, as you would have with any solution using 33 independent questions.<p>To uniquely identify people, we'd either need to use more bits, or look very closely at the population and derive very specific questions.
I thought it would be more plausible and probably more interesting to do this in maybe 40 questions. To do this in 33, as several others have pointed out, would require 33 questions that each almost perfectly bisect the population and are almost perfectly independent of each other.<p>With 40 or 45, we could relax that a bit and use questions that are actually meaningful. Two people who are within a few bits of each other would actually be similar in ways we care about, unlike two people who are similar because their transliterated last names both appear in the last half of the alphabet.
So you want to create a data set with entropy = 1 .
Think of this in terms of a hash function ,
You want to create a hash which only has an address space of 33 bits.
Something in terms of
H(Alice) = 0x12321 {H is a function which generates 0x12321 to store the data of alice)<p>Doesn't this sound like perfect hashing with limited memory.
I don't really think that this can be done with such memory constraints.
Even now we cannot produce a perfect hash function that uses 1 bit / key.
The theoretical best we can do is 1.44 bit / key. And the practical best we have done till now is 2.5 bits per key. [1]<p>This may just be possible without the memory constraint that is , you answer N number of questions which uniquely identify you. (where N > 48 )<p>[1] <a href="http://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_function" rel="nofollow">http://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_p...</a>
No one here seems to have mentioned Hunch (<a href="http://en.wikipedia.org/wiki/Hunch_(website)" rel="nofollow">http://en.wikipedia.org/wiki/Hunch_(website)</a> ).<p>Picking discrete questions like this is equivalent to building a decision tree for humanity. This is actually something that could be approached as an engineering problem (and there are mechanisms for optimizing decision trees).<p>The problem still remains in the face of both the technological capabilities of decision trees, and practical implementations like Hunch.com, that decision trees are reductive and discrete. Reality is neither discrete nor reductive.<p>It may very well be the case that there is a set of questions that could uniquely identify humans, but the insight that could be drawn from those questions might be essentially pointless.<p>For example:<p>* Were you born in the northern hemisphere?<p>* Were you born on an even numbered year in the Gregorian calendar?<p>* Is the country of your birth governed through a representative system?
This reminds me of Akinator: <a href="http://en.akinator.com" rel="nofollow">http://en.akinator.com</a><p>It's a little spammy nowadays, but it's had enough input that it seems pretty amazingly accurate at "guessing" what / who you are thinking of in ~ 25 questions.
I don't think it is possible with exactly 33 questions. It will probably require more than that. Binary numbers have the property of adding twice as many numbers +1 for every new bit. For example if you already have 7 bits and you add an 8th one, then you'll be able to represent 127 numbers with that bit off and 128 numbers with that bit on.<p>To properly mimic this property with yes/no questions, you will have to come up with questions that divide the whole Earth's population equally AT EVERY NEW QUESTION. Even the most obvious one, "are you (fe)male?" is slightly biased toward men (according to wikipedia). At every question that skew your 50/50, you'll have to add another question beyond 33 to catch up with this.
I think first you have to show a question exists that effectively separates identical twins before you spend much time working on broad questions like gender and geography.
This is a fun exercise, but as others have pointed out likely impossible in its current form.<p>We don't have true constraints on space though; why limit to 33 bits? How could we still provide a meaningful UUID to each person?<p>A UUID based on time and location of birth might be more feasible than any other approach, since neither will change and it's the least likely to be ambiguous. Capturing UTC at the time of cutting or otherwise removing the umbilical cord could be one way of choosing as precise, non-debatable a timestamp as any. Adding lat/long and, say, the first byte of the UTF-8 character of the mother's name (or an aspect of the mother's UUID?) could get you the rest of the way there.<p>Of course, this falls over in places without access to precise timing and geolocation.
Do the answers have to be knowable? Time independent?<p>For example, "are you below the median age at this exact second?" That is not a knowable answer, and changes by the second, but it does give you an exact 50/50 split.<p>Repeat N times for each split and we're getting very very close.
These need to be questions that are invariant over a lifetime:<p>- Were you born in the northern hemisphere or southern?<p>$2^{33}$ is sufficient for those alive now, but the human population is a dynamic function. Set a bit when the person dies?
Added pull requests to extend the address space from 33 bits to 36 bits to accommodate our revered ancestors, and a bit to indicate liveness.<p>TODO: don't implement zombies or ghosts at this time (YAGNI principle).
On a related note, this has a very interesting significance in the world of privacy and anonymous tracking.<p><a href="http://33bits.org/about/" rel="nofollow">http://33bits.org/about/</a>
This is a really cool concept, but one that is totally impossible. In the actual world, few things are truly independent. Even if you could find 33 binary questions that did not correlate with each other at all, you still run the risk of having multiple people yield the same 33 answers.<p>Just because two things aren't statistically linked does not mean that they will never overlap.
Wouldn't the best way to do this be to ask questions related to genetic markers? You require 33 yes/no questions that independently divide the population in half, but has near-uniform distribution otherwise (each populace half has no relationship to the other questions).<p>Are there 33 genetic markers that each has no correlation on the presence of the others?
33 is not the constraint. If we increase the limit to 50 and these 50 questions can fingerprint an individual then that will be really interesting.<p>Some hard problems :-
1. Distinguish twins
2. Using characters in names as some like Chinese use non-ascii names.
If anyone is interested in seeing such a application in a fictional setting, I suggest the anime Death Note, if nothing else for its entertainment value. For those who are familiar with the story, the questions L asked in order to narrow down Kira suspects to a limited demographics in a small region in Japan, among billions of candidates, were some good ones. A good article that analyzes the plot from a information theory perspective: [<a href="http://www.gwern.net/Death%20Note%20Anonymity](http://www.gwern.net/Death%20Note%20Anonymity)" rel="nofollow">http://www.gwern.net/Death%20Note%20Anonymity](http://www.gw...</a>.
Even if the questions were perfect (each question splitting the population in two exact halves, and all questions totally independent from each other) and therefore the algorithm would give each person a perfectly random number, the birthday paradox [1] tells us that even for just square(2^33)=~ 93k people we would have 50% probability of having a collision. To work we would need more bits. (Either that or create questions that are _not_ independent, so crafted in a way to make sure each person gets a different number)<p>[1] <a href="http://en.wikipedia.org/wiki/Birthday_problem" rel="nofollow">http://en.wikipedia.org/wiki/Birthday_problem</a>
How many questions would you need to differentiate between identical twins, particularly if they live and work together? Take identical twin sons of a subsistence farmer - they live together, work together on the same things, know the same people, have the same genetic makeup, and whichever was the first twin born may not have been recorded. You could ask their names, but that's not a yes/no question.<p>Or even twins who are still babies, no work required? Some cultures wouldn't even have named them yet.
"As an example, having the questions "Are you male?" and "Are you below the median age?" will not work "<p>First question is "Are you male?". Made me laugh.
Another problem not mentioned is that the questions should be about the content that does not allow for the answer to change over time. Otherwise the ID is no good.
Could you not have way more than 33 questions created, (maybe a couple hundred) but change what questions are asked based on previous answers? Use the previous answers to determine the strongest next question to ask?<p>If an early answer states the candidate lives in the north
hemisphere, there's no point in asking them if they live on a landlocked African country... or whatever much more complicated questions could arrive.
A boring solution:<p>Question n. Consider the number of your birth out of all people currently alive. When you divide by 2^n and take the remainder, is it odd?
This seems like it would be a lot of work. The intensity and specificity of the questions that would need to be asked would have to be quite unique. It might be possible, but without excluding people of the world because they get lumped into a group, it seems like maybe 33 questions might not be enough to uniquely identify everyone in the world.
A useful question might revolve around language or concepts a person knows, but then this becomes a lot more difficult if the questioner doesn't know which language/concepts a person wouldn't understand (and therefore whether they could even answer the question) - and if they do know, there is a priori knowledge effectively.
The 33-question issue is a tough one for sure.<p>I'm instead left wondering how many extra questions (35 bits? 36 bits?) it would have to be expanded to in order to produce unique results but without having to be particularly clever in producing the questions. I bet it wouldn't take as many extra as one might be inclined to think.
Do the questions have to be constant over time? If not it can trivially be solved by asking:
Are you born before or after time <i>t</i>? 33 times, where <i>t</i> is the median date of birth of your population. You just need to recompute <i>t</i> 33 times (and know the date of birth of every single person in the world).
If the goal is to have questions which can be answered only with yes or no. I don't think asking for location of the person is good thing, because there would be so many questions as there is locations.<p>"Do you live in China, India, The United States, Indonesia, Brazil or Pakistan?" is not good question.
Is the intent of this exercise to build a unique identifier that the individual could reproduce over the course of their life, or does it just uniquely identify them at the time they answered the questions?<p>I ask because questions like number of siblings, favorite movie, etc. would change over time.
Are you male? This will not split the population 50/50. One group will be slightly larger, and you then only have 32 questions to subdivide this larger group into further categories which is impossible.<p>This is not possible unless the categories _precisely_ bisect the group each time.
Seems impossible to me, for example what question would separate two identical twins( identical in dna and when born ).<p>And let's say you find such a question, there is no way that question would divide half of the population.
Isn't this solvable to a degree by just asking a big amount of yes/no questions to a big amout of people and then removing all those questions that didn't identify people any further?
I bet you could make a lot of progress by dividing GPS coordinates evenly by population. Simple binary search by primary residence and then leave some space for division within a household.
It's an interesting idea ... But no I don't think it is possible in any way that is not turning the list into a set of questions about their genetics or DNA.
An easy way to construct the questions is to ask for increasingly precise time and location of birth.<p>There will be corner cases, but then so does asking if someone is male.