Machine learning without centralized training data

568 pointsby nealmuellerabout 8 years ago

27 comments

This is one of those announcements that seems unremarkable on read-through but could be industry-changing in a decade. The driving force between consolidation & monopoly in the tech industry is that bigger firms with more data have an advantage over smaller firms because they can deliver features (often using machine-learning) that users want and small startups or individuals simply cannot implement. This, in theory, provides a way for users to maintain control of their data while granting permission for machine-learning algorithms to inspect it and "phone home" with an improved model, without revealing the individual data. Couple it with a P2P protocol and a good on-device UI platform and you could in theory construct something similar to the WWW, with data stored locally, but with all the convenience features of centralized cloud-based servers.

评论 #14056213 未加载

评论 #14058493 未加载

评论 #14056035 未加载

评论 #14059003 未加载

评论 #14056297 未加载

评论 #14057061 未加载

whymabout 8 years ago

Their papers mentioned in the article:Federated Learning: Strategies for Improving Communication Efficiency (2016) <a href="https://arxiv.org/abs/1610.05492" rel="nofollow">https://arxiv.org/abs/1610.05492</a>Federated Optimization: Distributed Machine Learning for On-Device Intelligence (2016) <a href="https://arxiv.org/abs/1610.02527" rel="nofollow">https://arxiv.org/abs/1610.02527</a>Communication-Efficient Learning of Deep Networks from Decentralized Data (2017) <a href="https://arxiv.org/abs/1602.05629" rel="nofollow">https://arxiv.org/abs/1602.05629</a>Practical Secure Aggregation for Privacy Preserving Machine Learning (2017) <a href="http://eprint.iacr.org/2017/281" rel="nofollow">http://eprint.iacr.org/2017/281</a>

评论 #14062121 未加载

binalpatelabout 8 years ago

Reminds me of a talk I saw by Stephen Boyd from Stanford a few years ago: <a href="https://www.youtube.com/watch?v=wqy-og_7SLs" rel="nofollow">https://www.youtube.com/watch?v=wqy-og_7SLs</a>(Slides only here: <a href="https://www.slideshare.net/0xdata/h2o-world-consensus-optimization-and-machine-learning-stephen-boyd" rel="nofollow">https://www.slideshare.net/0xdata/h2o-world-consensus-optimi...</a>)At that time I was working at a healthcare startup, and the ramifications of consensus algorithms blew my mind, especially given the constraints of HIPAA. This could be massive within the medical space, being able to train an algorithm with data from everyone, while still preserving privacy.

评论 #14057542 未加载

andreykabout 8 years ago

The paper: <a href="https://arxiv.org/pdf/1602.05629.pdf" rel="nofollow">https://arxiv.org/pdf/1602.05629.pdf</a>The key algorithmic detail: it seems they have each device perform multiple batch updates to the model, and then average all the multi-batch updates. "That is, each client locally takes one step of gradient descent on the current model using its local data, and the server then takes a weighted average of the resulting models. Once the algorithm is written this way, we can add more computation to each client by iterating the local update. "They do some sensible things with model initialization to make sure weight update averaging works, and show in practice this way of doing things requires less communication and gets to the goal faster than a more naive approach. It seems like a fairly straighforward idea from the baseline SGD, so the contribution is mostly in actually doing it.

评论 #14060612 未加载

itchyjunkabout 8 years ago

"Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud."So I assume this would help with privacy in a sense that you can train model on user data without transmitting it to the server. Is this in any way similar to something Apple calls 'Differential Privacy' [0] ?"The key idea is to use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps.""Careful scheduling ensures training happens only when the device is idle, plugged in, and on a free wireless connection, so there is no impact on the phone's performance."It's crazy what the phones of near future will be doing while 'idle'.------------------------[0] <a href="https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/" rel="nofollow">https://www.wired.com/2016/06/apples-differential-privacy-co...</a>

评论 #14056261 未加载

评论 #14056224 未加载

评论 #14056173 未加载

评论 #14056214 未加载

sixdimensionalabout 8 years ago

This is fascinating, and makes a lot of sense. There aren't too many companies in the world that could pull something like this off.. amazing work.Counterpoint: perhaps they don't need your data if they already have the model that describes you!If the data is like oil, but the algorithm is like gold.. then they still extract the gold without extracting the oil. You're still giving it away in exchange for the use of their service.For that matter, run the model in reverse, and while you might not get the exact data... we've seen that machine learning has the ability to generate something that simulates the original input...

评论 #14057077 未加载

azinman2about 8 years ago

This is quite amazing, beyond the homomorphic privacy implications being executed at scale in production -- they're also finding a way to harness billions of phones to do training on all kinds of data. They don't need to pay for huge data centers when they can get users to do it for them. They also can get data that might otherwise have never left the phone in light of encryption trends.

评论 #14057607 未加载

argonautabout 8 years ago

This is speculative, but it seems like the privacy aspect is oversold as it may be possible to reverse engineer the input data from the model updates. The point is that the model updates themselves are specific to each user.

评论 #14056454 未加载

评论 #14057093 未加载

评论 #14056564 未加载

TYabout 8 years ago

This is an amazing development. Google is in a unique position to run this on truly massive scale.Reading this, I couldn't shake the feeling that I heard all of this somewhere before in a work of fiction.Then I remembered - here's the relevant clip from "Ex Machina":<a href="https://youtu.be/39MdwJhp4Xc" rel="nofollow">https://youtu.be/39MdwJhp4Xc</a>

siliconc0wabout 8 years ago

While a neat architectural improvement, the cynic in me thinks this is a fig leaf for the voracious inhalation of your digital life they're already doing.

评论 #14056329 未加载

评论 #14057694 未加载

emcqabout 8 years ago

Even if this only allowed device based training and not privacy advantages it's exciting as a way of compression. Rather than sucking up device upload bandwidth you keep the data local and send the tiny model weight delta!

评论 #14057098 未加载

sandGorgonabout 8 years ago

Tangentially related to this - numerai is a crowdsourced hedge fund that uses structure preserving encryption to be able to distribute it's data, while at the same time ensuring that it can be mined.<a href="https://medium.com/numerai/encrypted-data-for-efficient-markets-fffbe9743ba8" rel="nofollow">https://medium.com/numerai/encrypted-data-for-efficient-mark...</a>Why did they not build something like this ? I'm kind of concerned that my private keyboard data is being distributed without security. The secure aggregation protocol doesn't seem to be doing anything like this.

muzakthingsabout 8 years ago

This is literally non-stochastic gradient descent where the batch update simply comes from a single node and a correlated set of examples. Nothing mind-blowing about it.

评论 #14056791 未加载

legulereabout 8 years ago

Where is the security model in this? What stops malicious attackers from uploading updates that are constructed to destory the model?

评论 #14057591 未加载

评论 #14057675 未加载

yeukhonabout 8 years ago

To be honest I have thought about this for long for distributed computing. If we have a problem which takes a lot of time to compute but problem can be computed with small pieces and then combined then why can't we pay user to subscribe for the computation? This is a major step toward thr big goal.

holografixabout 8 years ago

I don't work with ML for my day job but find it exhilaratingly interesting. (true story!)When I first read this I was thinking: surely we can already do distributed learning, isnt that what for example SparkML does?Is the benefit of this in the outsourcing of training of a large model to a bunch of weak devices?

评论 #14057348 未加载

alex_hirnerabout 8 years ago

I think the implications go even beyond privacy and efficiency. One could estimate each user's contribution to fidelity gains of the model. At least as an average within a batch. I imagine such an attribution to rewarded in money or credibility in the future.

nudpiedoabout 8 years ago

Where is the difference between that and distributed computing? A part of the specific usage for ML I don't see many differences, seti@home was an actual revolution made of actual volunteers (I don't know how many google users will be aware of that).

orphabout 8 years ago

Huge implications for distributed self-driving car training and improvement.

jonbaerabout 8 years ago

<a href="https://en.wikipedia.org/wiki/Swarm_intelligence" rel="nofollow">https://en.wikipedia.org/wiki/Swarm_intelligence</a>

nialv7about 8 years ago

I had exactly this idea about a year ago!I know ideas without execution don't worth anything, but I'm just happy to see my vision is on the right direction.

评论 #14065577 未加载

Joofabout 8 years ago

Could we build this into a P2P-like model where there are some supernodes that do the actual aggregation?

mehlmanabout 8 years ago

I would argue there is no such thing. The model will after the update now incooperate your traning data as a seen example, clever use of optimization would enable you to partly reconstruct the example.

hefeweizenabout 8 years ago

How similar is this to multi-task learning?

ykabout 8 years ago

Google is building a google cloud, that is they try to use the hardware of other people, instead of other people using Googles hardware.

评论 #14057112 未加载

评论 #14057587 未加载

exitabout 8 years ago

i wonder whether this can be used as a blockchain proof of work

Svexarabout 8 years ago

So it's Google Wave for machine learning?