Geoffrey Hinton publishes new deep learning algorithm

319 pointsby danboarderover 2 years ago

24 comments

Maybe I'm missing something, but from the paper <a href="https://www.cs.toronto.edu/~hinton/FFA13.pdf" rel="nofollow">https://www.cs.toronto.edu/~hinton/FFA13.pdf</a>, they use non-conv nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF achieves 59% accuracy (at best).Those are relatively close figures, but good accuracy on CIFAR-10 is 99%+ and getting ~94% is trivial.So, if an improper architecture for a problem is used and the accuracy is poor, how compelling is using another optimization approach and achieving similar accuracy?It's a unique and interesting approach, but the article specifically mentions it gets accuracy similar to backprop, but if this is the experiment that claim is based on, it loses some credibility in my eyes.

评论 #34357264 未加载

评论 #34361486 未加载

评论 #34355504 未加载

评论 #34367350 未加载

cschmidover 2 years ago

The article links to an old draft of the paper (it seems that the results in 4.1 couldn't be replicated). The arxiv has a more recent one: <a href="https://arxiv.org/abs/2212.13345" rel="nofollow">https://arxiv.org/abs/2212.13345</a>

goethes_kindover 2 years ago

I skimmed through the paper and am a bit confused. There's only one equation and I feel like he rushed to publish a shower thought without even bothering to flesh it out mathematically.So how do you optimize a layer? Do you still use gradient descent? So you are have a per layer loss with a positive and negative component and then do gradient descent?So then what is the label for each layer? Do you use the same label for each layer?And what does he mean by the forward pass not being fully known? I don't get this application of the blackbox between layers. Why would you want to do that?

评论 #34353552 未加载

评论 #34354101 未加载

评论 #34358940 未加载

评论 #34353731 未加载

mauritsover 2 years ago

Deep dive tutorial for learning in a forward pass [1][1] <a href="https://amassivek.github.io/sigprop" rel="nofollow">https://amassivek.github.io/sigprop</a>

评论 #34354753 未加载

sva_over 2 years ago

I found this paragraph from the paper very interesting:> 7 The relevance of FF to analog hardware> An energy efficient way to multiply an activity vector by a weight matrix is to implement activities as voltages and weights as conductances. Their products, per unit time, are charges which add themselves. This seems a lot more sensible than driving transistors at high power to model the individual bits in the digital representation of a number and then performing O(n^2) single bit operations to multiply two n-bit numbers together. Unfortunately, it is difficult to implement the backpropagation procedure in an equally efficient way, so people have resorted to using A-to-D converters and digital computations for computing gradients (Kendall et al., 2020). The use of two forward passes instead of a forward and a backward pass should make these A-to-D converters unnecessary.It was my impression that it is difficult to properly isolate an electronic system to use voltages in this way (hence computers sort of "cut" voltages into bits 0/1 using a step function).Have these limitations been overcome or do they not matter as much, as neural networks can work with more fuzzy data?Interesting to imagine such a processor though.

评论 #34358660 未加载

评论 #34358935 未加载

评论 #34359935 未加载

singularity2001over 2 years ago

The paragraph about Mortal Computation is worth repeating:If these FF networks can be proven to scale or made to scale similarly to BP networks, this would enable making hardware several orders of magnitude more efficient, for the price of loosing the ability to make exact copies of models to other computers. (The loss of reproducibility sits well with the tradition of scientific papers anyway/s;)2.) How does this paper relate to Hintons feedback alignment from 5 years ago? I remember it was feedback without derivatives. What are the key new ideas? To adjust the output of each individual layer to be big for positive cases and small for negative cases without any feedback? Have these approaches been combined?

rsfernover 2 years ago

Discussion last month when the preprint was released: <a href="https://news.ycombinator.com/item?id=33823170" rel="nofollow">https://news.ycombinator.com/item?id=33823170</a>

评论 #34354743 未加载

rytillover 2 years ago

Paper: <a href="https://www.cs.toronto.edu/~hinton/FFA13.pdf" rel="nofollow">https://www.cs.toronto.edu/~hinton/FFA13.pdf</a>

BenoitPover 2 years ago

Not a deep learning expert, but: it seems that without backpropagation for model updates, the communication costs should be lower. And that will enable models that are easier to parallelize?Nvidia isn't creating new versions of its NVLink/NVSwitch products just for the sake of it, better communication must be a key enabler.Can someone with deeper knowledge can comment on this? Is communication a bottleneck, and will this algorithm uncover a new design space for NNs?

评论 #34352538 未加载

评论 #34352901 未加载

评论 #34352780 未加载

kumarvvrover 2 years ago

This is an interesting approach and I have read that this is more closer to how our brains works.We extract learning, while we are imbibing the data and there seems to be no mechanism in the brain that favors backprop like learning process.

评论 #34353340 未加载

NHQover 2 years ago

Hinton's networks become the neuron of novel networks. It is important to know that these types of weights don't learn features, they map a compressed representation of the learned info, which is the input. Classification through error correction. That is actually what labels do for supervised learning (IOW they learn many ways to represent the label, and that is what the weights are). Modern AI do that plus learn features, but the weights are nevertheless a representation of what was learned, plus a fancy way to encode and decode into that domain.What Hinton and Deepmind will do is use neural-network learned-data, or perhaps the weights, as input to this kind of network. In other words, the output of another NN is labeled a priori, ergo you can use it "unsupervised" networks, which this research expounds. This will allow them to cook the input network into a specific dish, by labels even. Now give me my phd.edit: edit

emilecourthover 2 years ago

There's an open source implementation of the paper in pytorch <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/forward_forward">https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...</a> by @diegofiori_He also wrote an interesting thread on the memory usage of this algo versus backprop <a href="https://twitter.com/diegofiori_/status/1605242573311709184?s=20&t=WjJVGJ_VOFuUKpRigaLLRw" rel="nofollow">https://twitter.com/diegofiori_/status/1605242573311709184?s...</a>

mdp2021over 2 years ago

Direct link to an implementation on GitHub: <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/forward_forward">https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...</a>--The divulgational title is almost an understatement: the Forward-Forward algorithm is an alternative to backpropagation.Edit: sorry, the previous formulation of the above in this post, relative to the advantages, was due to a misreading. Hinton writes:> The Forward-Forward algorithm (FF) is comparable in speed to backpropagation but has the advantage that it can be used when the precise details of the forward computation are unknown. It also has the advantage that it can learn while pipelining sequential data through a neural network without ever storing the neural activities or stopping to propagate error derivatives....The two areas in which the forward-forward algorithm may be superior to backpropagation are as a model of learning in cortex and as a way of making use of very low-power analog hardware without resorting to reinforcement learning

评论 #34352018 未加载

Kalanosover 2 years ago

What exactly is the negative data? Seems like it's just scrambled nonsense (aka what the truth is not)

评论 #34354016 未加载

belterover 2 years ago

@dangMeta question on HN implementation: Why do sometimes submitting a previously submitted resource links automatically to the previous discussion, while other times is considered a new submission?

评论 #34357097 未加载

评论 #34358306 未加载

parpfishover 2 years ago

Having read through the forward-forward paper, it feels like it's Oja's rule adapted for supervised learning but I can't really articulate why...

ipnonover 2 years ago

It’s incredible to think that dreams are just our brains generating training data, and lack of sleep causes us to overfit on our immediate surroundings.

评论 #34353869 未加载

评论 #34355485 未加载

评论 #34354703 未加载

sheerunover 2 years ago

Isn't this similar to how GAN networks learn? Edit: Yes, there is small chapter in paper comparing it to GAN

评论 #34353241 未加载

PartiallyTypedover 2 years ago

It seems that the point is that the objective function is applied layerwise, still computes gradient to get the update direction, it's just that gradients don't propagate to previous layers (detatched tensor).As far as I can tell, this is almost the same as stacking multiple layers of ensembles, except worse as each ensemble is trained while previous ensembles are learning. This is causing context drift.To deal with the context drift, Hinton proposes to normalise the output.This isn't anything new or novel. Expressing "ThIs LoOkS sImIlAr To HoW cOgNiTiOn WoRkS" to make it sound impressive doesn't make it impressive or good by any stretch of the imagination.Hinton just took something that existed for a long time, made it worse, gave it a different name and wrapped it in a paper under his name.With every paper I am more convinced that the Laureates don't deserve the award.Sorry, this "paper" smells from a mile away, and the fact that it is upvoted as much shows that people will upvote anything if they see a pretty name attached.Edit:Due to the apparent controversy of my criticism, I can't respond with a reply, so here is my response to the comment below asking what exactly makes this worse.> As far as I can tell, this is almost the same as stacking multiple layers of ensemblesIt isn't new. Ensembling is used and has been used for a long time. All kaggle competitions are won through ensembles and even ensembles of ensembles. It is a well studied field.> except worse as each ensemble is trained while previous ensembles are learning.Ensembles exhibit certain properties, but only iff they are trained independently from each other. This is well studied, you can read more about it in Bishop's Pattern recognition book.> This is causing context drift.Context drift occurs when a distribution changes over time. This changes the loss landscape which means the global minima change / move.> To deal with the context drift, Hinton proposes to normalise the output.So not only is what Hinton built a variation of something that existed already, made it worse by training the models simultaneously, and to handle the fact that it is worse, he adds additional computations to deal with said issue.

评论 #34353480 未加载

评论 #34353292 未加载

评论 #34353527 未加载

评论 #34355383 未加载

Coneylakeover 2 years ago

Is the derivative calculated by forward-forward as analytic as in backpropagation?

keepquestioningover 2 years ago

Is this a game changer?

penciltwirlerover 2 years ago

This is old news already.

lvl102over 2 years ago

Quantum would absolutely change everything in DL/ML space.

评论 #34354108 未加载

评论 #34353289 未加载

bayesian_horseover 2 years ago

I remember a long time ago there was a collection of Geoffrey Hinton facts. Like "Geoffrey Hinton once wrote a neural network that beat Chuck Norris."

评论 #34353317 未加载