The Difficulty Of Private Contact Discovery

99 点作者 FredericJ超过 11 年前

17 条评论

This is great until you realize the average address book is a disgusting mess of spelling errors, wrong values in wrong fields, punctuation in places where punctuation isn't necessary (commas in phone number fields, digits in name fields, etc.)The only field that might be fairly consistent is email address and even that is no guarantee. Sure, it's possible to clean up a contact before hashing, but when you consider the fact that the typical phone contains >500 contacts it's not something that can be done in a reasonable amount of time on a modern smartphone. For the last two years I've been working on a product that involves cleaning/organizing contacts and making them searchable, when I started I had no idea what a massive undertaking it would be.So, hash till your hearts content, but if you have 10 different people with the same contact in their address book I would be willing to bet that you will get 10 different hashes because humans are not good at data entry.

评论 #7010311 未加载

评论 #7008270 未加载

评论 #7009615 未加载

sigil超过 11 年前

Interesting problem!> It’s also possible to compress “updates” to the bloom filter. The server just needs to calculate the XOR of the version the client has and the updated version, then run that through LZMA (the input will be mostly zeros), and transmit the compressed diff to the client.> Unfortunately, for a reasonably large user base, this strategy doesn’t work because the bloom filters themselves are too large to transmit to mobile clients. For 10 million TextSecure users, a bloom filter like this would be ~40MB, requested from the server 116 times a second if every client refreshes it once a day.Wait, the LZMA-compressed daily diffs would be still be ~40MB? If the 40MB is just the initial bloom filter download, that's not so bad. If 40MB is the daily diff, I'd be interested in seeing the calculation for that.

评论 #7008445 未加载

评论 #7008258 未加载

pbsd超过 11 年前

A new private-set intersection paper, claiming to scale orders of magnitude better than predecessors, has just been released [1]. I don't know whether it solves your problem, as I haven't read anything beyond the abstract, but you might want to check it out anyway.[1] <a href="https://research.microsoft.com/en-us/um/people/senyk/pubs/sapsi.pdf" rel="nofollow">https://research.microsoft.com/en-us/um/people/senyk/pubs/sa...</a>

tlb超过 11 年前

The problem is that it's easy to upload 1000 randomly generated phone numbers and get back results. But I think you can distinguish random or sequentially generated numbers from a real user's actual contact list, by looking at the distribution of friends among the contacts. A real contact list should have a high density of friendships between contacts. The system should only return results if the uploader provides a subgraph that's more highly connected than a random one would be.You'd need to look at real data and tune the parameters to make this effective.

spullara超过 11 年前

I think there is a big difference between private contact discovery and contact discovery that isn't brute forceable like we are seeing with SnapChat and others. One change that could be made for these very small pre image corpuses would be to ask the target user if they should be exposed to the searcher.1) I upload the hashes of my contacts which enqueues a request to those that match the hashes2) The target of the request dequeues those requests and is given the option to expose themselves to the other user3) If approved, the original uploader gets access to that user on the serviceThis would stop the massive discovery issue and would only put a medium amount of friction into the growth engine. It obviously doesn't cover the issue that the service itself has access to the contacts which means they could potentially be exposed. However, if the service discards hashes without matching queues and removes matches when dequeued, the risk to the user is much reduced from the full address book storage case.This was written off the top of my head, so forgive me if I have left a big hole somewhere.

评论 #7007930 未加载

jcampbell1超过 11 年前

I understand how the Encrypted Bloom Filter they describe keeps my contacts private from the service, but it seems to leak the entire contact list the same as the trivial solution:I install millions of copies of the app, all with a different contact list. I do this signed query process for every possible phone number. I now have the complete list of users.

评论 #7008412 未加载

robrenaud超过 11 年前

Is this a real problem?Is it reasonable to trust for me to trust you to you run some arbitrary binary on my phone that does some magical black box computation on my contact list and then phone home to your service, but not trust to that you will just not be a dick with the raw data?Stated more simply, if I am paranoid about my privacy, then why am I running your apps and letting you read my contact list at all?

评论 #7009152 未加载

评论 #7009007 未加载

评论 #7008728 未加载

评论 #7008745 未加载

maxerickson超过 11 年前

In a similar discussion, I said something about hashing phone number pairs. The graph has a lot more edges than nodes.Still doesn't stop someone from attacking a single number though.

评论 #7014479 未加载

评论 #7008076 未加载

pyrocat超过 11 年前

So don't do it. I know I'm probably in the minority here but I've never given another app permission to use my facebook or twitter account to "Find Friends" and I personally find the whole practice offensive. Just have an option to invite people by their email address/number/whatever, but not in bulk.

评论 #7007986 未加载

abentspoon超过 11 年前

What about this?1) Client uploads a bloom filter with all contacts on phone2) Server responds with a bloom filter with all registered contacts that match the client's bloom filter3) Client displays contacts that match server's bloom filterYou can optionally trade contacts back and forth again with a larger bits/contact ratio to decrease false positives.I think it works out so that in exchange for 7 bits of information about each contact from the client, you can reduce the server's response by a factor of 128.

评论 #7008060 未加载

评论 #7008435 未加载

评论 #7008101 未加载

Strilanc超过 11 年前

40 MiB for ten million users? I think you should ditch the bloom filter if the false-positive rate is set so low it takes that much space.Assuming you can map phone numbers onto the integers from 0 to ten billion, trivial compression of the is-a-user/is-not-a-user bit sequence should easily fit in ~14MiB [1].See also: succinct data structures [2] for accessing data without decompressing it.(This is obviously a short term solution. Growth will overcome the savings.)1: <a href="http://www.wolframalpha.com/input/?i=%28-p+lg_2%28p%29-%281-p%29+lg_2%281-p%29%29+*+ten+billion++%2F+2^23+where++p+%3D+ten+million+%2F+ten+billion" rel="nofollow">http://www.wolframalpha.com/input/?i=%28-p+lg_2%28p%29-%281-...</a>2: <a href="http://en.wikipedia.org/wiki/Succinct_data_structure" rel="nofollow">http://en.wikipedia.org/wiki/Succinct_data_structure</a>

fleitz超过 11 年前

It's not a technical problem it's a value prop problem. Most users don't care about privacy....Fundamentally there is more value for the network operator to grow the network than for users to disclose their details. Email works perfectly fine without having to disclose your address book because no one owns the network and therefore do not care how many people are using email.Alternatively one could just allow users to choose whether to make their info public and then the client can just search a directory.

Illniyar超过 11 年前

What about artificially increasing the size of the hash input. For instance: when finding out if a a contact exists the client sends 10 iterations of the contact (I.E. the contact number with another digit, each 0-9) hashed, the server keeps only one random hashed one to compare to (presumably the client hashes one of them randomly when sending it's own contact information).That way you increase the size of the problem from 10^10 to 10^11 but only increase your bandwidth 10 fold.It still keeps the need to balance the amount of bandwidth used, but the bandwidth is no longer determined by the amount of all the users in the system, only the increase in problem size and the amount of contacts a person has.Is that a worthwhile solution?

cristiantincu超过 11 年前

> Building a social network is not easy. Social networks have value proportional to their size, so participants aren’t motivated to join new social networks which aren’t already large. It’s a paradox where if people haven’t already joined, people aren’t motivated to join.Value is relative. For stakeholders, value is indeed given by the network’s size. For users (participants), it’s very debatable whether size equals value. The author seems to be mistaking the two perspectives for a single one. No paradox here.

评论 #7008753 未加载

abdullahkhalids超过 11 年前

Is is possible to leverage some sort of peer-assisted discovery? Where you ask some sub-set of your trusted contacts already on the system, to call the server on your behalf. So that the server is less likely to know whose contacts are being sent to it.Of course, this can only be part of the full solution.Disclaimer: just throwing a wild idea out there. don't know anything about the field.

评论 #7008781 未加载

jahewson超过 11 年前

Could you use an enormously expensive hash function so that building a lookup table is infeasible?

评论 #7010051 未加载

amix超过 11 年前

I don't really think you need to solve this problem using crypto. I think a simpler solution would be to use the "hash it" solution, but restrict how many lookups each client can do. You can either do this by the using the client's phone number, IP or other unique information etc. This way an attacker would have a very hard time brute forcing this.Additionally you could use captchas (or other humanity tests, such as SMS verification) to limit hackers creating automated accounts (in order to fight automated bots spamming and polluting the network).

评论 #7008613 未加载