Show HN: Read Wikipedia privately using homomorphic encryption

331 pointsby blintzalmost 3 years ago

Hi, creator here.This is a demo of our recent work presented at Oakland (IEEE S&P): <a href="https://eprint.iacr.org/2022/368" rel="nofollow">https://eprint.iacr.org/2022/368</a>. The server and client code are written in Rust and available here: <a href="https://github.com/menonsamir/spiral-rs">https://github.com/menonsamir/spiral-rs</a>. The general aim of our work is to show that homomorphic encryption is practical today for real-world applications. The server we use to serve this costs $35/month!A quick overview: the client uses homomorphic encryption to encrypt the article number that they would like to retrieve. The server processes the query and produces an encrypted result containing the desired article, and sends this back to the client, who can decrypt and obtain the article. A malicious server is unable to determine which article the client retrieved. All search and autocomplete is down locally. The technical details are in the paper, but the high level summary is that the client creates a large one-hot vector of encrypted bits (0’s except for the index of the desired article, where they place a 1) and then the server computes something like a ‘homomorphic dot product’ between the query and the plaintext articles.I’d like to caveat that this is an in-browser demo to show it is practical to use homomorphic encryption at this scale. As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.Happy to answer any questions!

28 comments

jerfalmost 3 years ago

This is the first thing out of homomorphic encryption I personally have seen that seems to be in the ballpark of useful for some practical use, which is impressive. Have I missed out on any other such things of interest?(And this is not a criticism; this is a compliment. You start so far behind the eight-ball with homomorphic encryption with regard to the resources it consumes I wasn't convinced it was ever going to be even remotely useful for much of anything. Precisely because I was so skeptical, I am that impressed to see something work this well. It's not the fastest Wikipedia mirror, but... honestly... I've been on slower websites! Websites with far less excuse.)

评论 #31670116 未加载

评论 #31670475 未加载

评论 #31672254 未加载

gojomoalmost 3 years ago

Interesting! But, it'd be helpful to clarify further the strength of the following claim:> This demo allows private access to 6 GB (~30%) of English Wikipedia. In theory, even if the server is malicious, it will be unable to learn which articles you request. All article title searches are performed locally, and no images are available.In this demo, the number of article-titles is relatively small – a few million – & enumerable.If the server is truly malicious, and it issues itself requests for every known title, does it remain true that this "Private Information Retrieval" (PIR) scheme still gives it no hints that subsequent requests from others for individual articles retrieve particular data?(Presumably: every request touches every byte of the same full 6GB of data, and involves every such byte in constant-run-time calculations that vary per request, and thus have the effect of returning only what each request wanted – but not at all in any way correlatable with other requests for the exact same article, from the same or different clients?)

评论 #31669924 未加载

Canadaalmost 3 years ago

Can this be applied usefully to non-public datasets?Would it be feasible to add some other zero knowledge proof to this that would confirm a user has paid a subscription for access? For example, if this were a news site, the user would have to prove a valid subscription to read articles, but the site would not be able to know which articles any subscriber decided to read?If that is possible, what could the site to to prevent a paying subscriber from sharing their access to an unreasonable number of others? Would it be possible to impose a rate limit per subscriber?

评论 #31671500 未加载

评论 #31672162 未加载

jl6almost 3 years ago

In another comment you’ve said:> With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)Is this technique therefore practical only when the server side dataset is relatively small (or full scans for every query are tolerable)?(edit: sorry, misattributed the quote)

评论 #31669853 未加载

评论 #31669700 未加载

评论 #31669686 未加载

0cVlTeIATBsalmost 3 years ago

Could this be used for DNS?

评论 #31669991 未加载

评论 #31670056 未加载

mihaitodoralmost 3 years ago

Last year, there was a detailed presentation with several speakers on state of the art Secure Multi-Party Computation for practical applications in healthcare, fighting financial crime and machine learning from CWI (Centrum Wiskunde & Informatica) Netherlands. The recording is here (2,5h): <a href="https://www.youtube.com/watch?v=gE7-S1sEf6Q" rel="nofollow">https://www.youtube.com/watch?v=gE7-S1sEf6Q</a>

JanisErdmanisalmost 3 years ago

> A malicious server is unable to determine which article the client retrieved.This sounds like magic :O. How does it behave when new articles (elements) are added, does it need to rebuild the whole database and distribute new parameters?I wonder how practical it would be for clients to synchronize content without server not being able to deduce the synchronization state at which the client is.

评论 #31670232 未加载

raxxorraxoralmost 3 years ago

Does homophobic in this case mean that I can edit the content of an article and the diff is directly applied to the crypt?

评论 #31669370 未加载

评论 #31669237 未加载

syrrimalmost 3 years ago

What is the maximum throughput the server can maintain? Or, in other words, how much does it cost per query?

评论 #31669803 未加载

f38zf5vdtalmost 3 years ago

Extremely cool. Now we can serve content without any ability to observe what people are being served exactly. I was hoping that someday soon such technology could be used to serve search results and give us a truly private search engine experience.

ajconwayalmost 3 years ago

Theoretically, can this scheme be turned into a generic O(N) key-value retrieval for non-static content (in this example — supporting adding, removing and replacing articles without re-encrypting the whole database and re-sending the client setup data)?

评论 #31674713 未加载

rkagereralmost 3 years ago

Not able to read the full paper at the moment, and confused about something:If the server needs to go pull the article from Wikipedia, how is it blind to which one is being requested?If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?The only way I understand this works is if the client itself seeded the encrypted data in the first place (or at least an encrypted index if all the server pushes back is article numbers).Maybe I'm ignorant of something; if so thanks for ELI5.

评论 #31671914 未加载

评论 #31671335 未加载

评论 #31671320 未加载

yargalmost 3 years ago

Can this functionality be implemented as a peer-to-peer (or federated) service?I'm assuming it'll depend on breaking down questions into hierarchical sub-questions that can either be recomposed locally or in another homomorphic context. But can that sort of thing be done without data-leaks, or prohibitively expensive inter-node communication?Are there any introductory resources (that you know of) on homomorphic encryption and compute that'll turn this into less of a mind-fuck?

评论 #31676715 未加载

评论 #31676221 未加载

Labo333almost 3 years ago

I understand that you do some kind of dot product (with two steps, Regev and GSW). However, it looks to me that those steps involve fixed dimension vectors.- How do you handle variable length data? Do you need to pad it?- What is the memory overhead of the storage of encrypted data?I think that at least for video data, the streaming scheme "leaks" the size of the encrypted data with the number of streaming packets.

评论 #31672552 未加载

throwaway81523almost 3 years ago

If you say a malicious server can't determine which article was retrieved, is that private information retrieval (PIR)? Something must be different here. I thought there was a theorem that for single-server PIR to work, the client has to download the entire DB, which is the right way to read Wikipedia privately anyway.

评论 #31676165 未加载

j2kunalmost 3 years ago

Do you have a blog or Twitter? I'd like to keep up with any other cool projects you're working on!

评论 #31672911 未加载

iFirealmost 3 years ago

I wonder if this can be done on sqlite?<a href="http://static.wiki/" rel="nofollow">http://static.wiki/</a>See the previous news article. <a href="https://news.ycombinator.com/item?id=28012829" rel="nofollow">https://news.ycombinator.com/item?id=28012829</a>

评论 #31673777 未加载

评论 #31670774 未加载

cobbzillaalmost 3 years ago

Fantastic project.Have you considered running (# of cpus) parallel scanners continuously? An inbound query “hops on” the the least-loaded scanner; at each article/chunk the scanner runs all the queries; each query “hops off” and returns after it has completed the cycle through the entire DB.

评论 #31677282 未加载

fragmedealmost 3 years ago

Well but you get into the security space and license your server db technology for shit like IoT lights. I don't want the company knowing if my lights are on or off, but if they had a homomorphic encrypted backend and app, I might trust it.

评论 #31703854 未加载

eternityforestalmost 3 years ago

This is wonderful! I've never seen anything like this in practical form.I hope it doesn't become standard practice for general websites(As I imagine some would like to see), but it's an amazing tool and there will probably be many wonderful uses.

nixpulvisalmost 3 years ago

This kind of stuff gives some of the best arguments for open source software (OSS) to date. Otherwise, it has to be taken completely on faith, which then defeats nearly the entire purpose and makes the performance overhead untenable.

评论 #31676046 未加载

sedatkalmost 3 years ago

> As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.Arguably, a malicious server could deliver a bad executable too.

评论 #31676195 未加载

dorgoalmost 3 years ago

Idea: Apply this to personalized advertising. Client sends his interests + habits + personal info encrypted to the server. Server finds and sends back to client the best ad based on clients info.

barbazooalmost 3 years ago

Can anyone recommend an explanation of this concept geared towards people with only a superficial knowledge of encryption?This seems to be some kind of search applied on an encrypted dataset, is that right?

评论 #31669890 未加载

评论 #31678179 未加载

评论 #31670046 未加载

badrabbitalmost 3 years ago

Very nice! Great against snoopers that lack authority but for when they do have some authority (bosses, government) without plausible deniability it can do more harm than good.

评论 #31675817 未加载

sizzlealmost 3 years ago

This sounds like the ultimate anti-user profiling and targeted advertising solution. I hope google and other advertising giants can’t stop this. Thoughts?

dontbenebbyalmost 3 years ago

This is very cool OP! I interviewed to be a privacy engineer with Wikimedia a while back.I suggested that my goal would be to add a v3 onion service. They actually had listed years of "homomorphic encryption" as a requirement. I phoned up the recruiter and basically said it's ok if there is a personality conflict, but the role as written was impossible to fill, and it scared me that very good suggestions for privacy as well as the health of the Tor network were discarded.(If you set up a dot onion, that frees up traffic on exit nodes, whose capacity are limited.)Big thanks to the OP for being willing to share this work, it's very cool and I'm about to read your eprint.I'm excited about the potential of homomorphic encryption, though I worry about things like CPU cost -- I recall when folks had to really be nudged not to encrypt huge blocks of data with PGP, but instead use it to encrypt the passphrase to a Truecrypt volume using a symmetric cipher like AES.(I'd love how to know we got to a point Twitter added an onion service then banned me, but Wikipedia continues to not even support MFA for logins -- I recently registered an account intending to eventually upload some art to the commons, but the perpetual refusal to allow folks to make healthy choices disturbs me.In fact, after reading articles like these ones[1][2], it makes me question the integrity of the folks I interacted with during the interview process.On my end, it was especially disturbing since prior to enrolling in my PhD, the alternative path I discussed was becoming an FBI agent focused on counter intelligence in the "cyber" realm.The agent I spoke with told me I'd serve "at the needs of the bureau", so that would mean probably not using my computer skills, which would then languish, then after a couple years I might still not get my desired position, and gave me a card, which I eventually lost.Years later, prior to the insurrection, I had to walk down to Carnegie Mellon and ask if anyone had his contact information, and was shocked that folks refused to even point me at a link to the lecture, which had been listed as open to the public.I'm someone who reads Wikipedia, not really edits, but the vast majority of users are readers not editors, and this perpetual pattern of refusing to enable privacy enhancing technologies, paired with using privileges access to make hiring decisions against folks who lack the physical ability to make good privacy decisions offended me on a deep, personal level, and is why I often post in brash, erratic manner.Because I see zero incentive to stay silent -- if I'm quiet, people will slowly drain my bank account.If I post, there is a chance someone will see what I say, notice my skills, and offer full time employment. So I have to continue risking offending folks until I find a full time job, which I have not had since I left the Center for Democracy and Technology under duress following a series of electronic and physical attacks, paired with threats and harassment by staffers in the organization.TL;DR: Great research, but I hope they also add an onion service rather than jump straight to using this :-)[1] <a href="https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/thread/6ANVSSZWOGH27OXAIN2XMJ2X7NWRVURF/#6ANVSSZWOGH27OXAIN2XMJ2X7NWRVURF" rel="nofollow">https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@list...</a>[2] <a href="https://slate.com/technology/2021/10/wikipedia-mainland-china-admins-banned.html" rel="nofollow">https://slate.com/technology/2021/10/wikipedia-mainland-chin...</a>

评论 #31675226 未加载

ddjsn111almost 3 years ago

How does the server select the article in a way that we can be sure they don't record the article sent back? Are the articles encrypted on the server too?

评论 #31673393 未加载