Maybe I'm missing something, but from the paper <a href="https://www.cs.toronto.edu/~hinton/FFA13.pdf" rel="nofollow">https://www.cs.toronto.edu/~hinton/FFA13.pdf</a>, they use non-conv nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF achieves 59% accuracy (at best).<p>Those are relatively close figures, but good accuracy on CIFAR-10 is 99%+ and getting ~94% is trivial.<p>So, if an improper architecture for a problem is used and the accuracy is poor, how compelling is using another optimization approach and achieving similar accuracy?<p>It's a unique and interesting approach, but the article specifically mentions it gets accuracy similar to backprop, but if this is the experiment that claim is based on, it loses some credibility in my eyes.
The article links to an old draft of the paper (it seems that the results in 4.1 couldn't be replicated). The arxiv has a more recent one: <a href="https://arxiv.org/abs/2212.13345" rel="nofollow">https://arxiv.org/abs/2212.13345</a>
I skimmed through the paper and am a bit confused. There's only one equation and I feel like he rushed to publish a shower thought without even bothering to flesh it out mathematically.<p>So how do you optimize a layer? Do you still use gradient descent? So you are have a per layer loss with a positive and negative component and then do gradient descent?<p>So then what is the label for each layer? Do you use the same label for each layer?<p>And what does he mean by the forward pass not being fully known? I don't get this application of the blackbox between layers. Why would you want to do that?
Deep dive tutorial for learning in a forward pass [1]<p>[1] <a href="https://amassivek.github.io/sigprop" rel="nofollow">https://amassivek.github.io/sigprop</a>
I found this paragraph from the paper very interesting:<p>> <i>7 The relevance of FF to analog hardware</i><p>> <i>An energy efficient way to multiply an activity vector by a weight matrix is to implement activities
as voltages and weights as conductances. Their products, per unit time, are charges which add
themselves. This seems a lot more sensible than driving transistors at high power to model the
individual bits in the digital representation of a number and then performing O(n^2) single bit
operations to multiply two n-bit numbers together. Unfortunately, it is difficult to implement the
backpropagation procedure in an equally efficient way, so people have resorted to using A-to-D
converters and digital computations for computing gradients (Kendall et al., 2020). The use of two
forward passes instead of a forward and a backward pass should make these A-to-D converters
unnecessary.</i><p>It was my impression that it is difficult to properly isolate an electronic system to use voltages in this way (hence computers sort of "cut" voltages into bits 0/1 using a step function).<p>Have these limitations been overcome or do they not matter as much, as neural networks can work with more fuzzy data?<p>Interesting to imagine such a processor though.
The paragraph about Mortal Computation is worth repeating:<p>If these FF networks can be proven to scale or made to scale similarly to BP networks, this would enable making hardware several orders of magnitude more efficient, for the price of loosing the ability to make exact copies of models to other computers.
(The loss of reproducibility sits well with the tradition of scientific papers anyway/s;)<p>2.) How does this paper relate to Hintons feedback alignment from 5 years ago? I remember it was feedback without derivatives. What are the key new ideas? To adjust the output of each individual layer to be big for positive cases and small for negative cases without any feedback? Have these approaches been combined?
Discussion last month when the preprint was released: <a href="https://news.ycombinator.com/item?id=33823170" rel="nofollow">https://news.ycombinator.com/item?id=33823170</a>
Not a deep learning expert, but: it seems that without backpropagation for model updates, the communication costs should be lower. And that will enable models that are easier to parallelize?<p>Nvidia isn't creating new versions of its NVLink/NVSwitch products just for the sake of it, better communication must be a key enabler.<p>Can someone with deeper knowledge can comment on this? Is communication a bottleneck, and will this algorithm uncover a new design space for NNs?
This is an interesting approach and I have read that this is more closer to how our brains works.<p>We extract learning, while we are imbibing the data and there seems to be no mechanism in the brain that favors backprop like learning process.
Hinton's networks become the neuron of novel networks. It is important to know that these types of weights don't learn features, they map a compressed representation of the learned info, which is the input. Classification through error correction. That is actually what labels do for supervised learning (IOW they learn many ways to represent the label, and that is what the weights are).
Modern AI do that plus learn features, but the weights are nevertheless a representation of what was learned, plus a fancy way to encode and decode into that domain.<p>What Hinton and Deepmind will do is use neural-network learned-data, or perhaps the weights, as input to this kind of network. In other words, the output of another NN is labeled a priori, ergo you can use it "unsupervised" networks, which this research expounds. This will allow them to cook the input network into a specific dish, by labels even. Now give me my phd.<p>edit: edit
There's an open source implementation of the paper in pytorch <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/forward_forward">https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...</a> by @diegofiori_<p>He also wrote an interesting thread on the memory usage of this algo versus backprop <a href="https://twitter.com/diegofiori_/status/1605242573311709184?s=20&t=WjJVGJ_VOFuUKpRigaLLRw" rel="nofollow">https://twitter.com/diegofiori_/status/1605242573311709184?s...</a>
Direct link to an implementation on GitHub: <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/forward_forward">https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...</a><p>--<p>The divulgational title is almost an understatement: the Forward-Forward algorithm is an alternative to backpropagation.<p>Edit: sorry, the previous formulation of the above in this post, relative to the advantages, was due to a misreading. Hinton writes:<p>> <i>The Forward-Forward algorithm (FF) is comparable in speed to backpropagation but has the advantage that it can be used when the precise details of the forward computation are unknown. It also has the advantage that it can learn while pipelining sequential data through a neural network without ever storing the neural activities or stopping to propagate error derivatives....The two areas in which the forward-forward algorithm may be superior to backpropagation are as a model of learning in cortex and as a way of making use of very low-power analog hardware without resorting to reinforcement learning</i>
@dang<p>Meta question on HN implementation: Why do sometimes submitting a previously submitted resource links automatically to the previous discussion, while other times is considered a new submission?
Having read through the forward-forward paper, it feels like it's Oja's rule adapted for supervised learning but I can't really articulate why...
It’s incredible to think that dreams are just our brains generating training data, and lack of sleep causes us to overfit on our immediate surroundings.
It seems that the point is that the objective function is applied layerwise, still computes gradient to get the update direction, it's just that gradients don't propagate to previous layers (detatched tensor).<p>As far as I can tell, this is almost the same as stacking multiple layers of ensembles, except worse as each ensemble is trained while previous ensembles are learning. This is causing context drift.<p>To deal with the context drift, Hinton proposes to normalise the output.<p>This isn't anything new or novel. Expressing "ThIs LoOkS sImIlAr To HoW cOgNiTiOn WoRkS" to make it sound impressive doesn't make it impressive or good by any stretch of the imagination.<p>Hinton just took something that existed for a long time, made it worse, gave it a different name and wrapped it in a paper under his name.<p>With every paper I am more convinced that the Laureates don't deserve the award.<p>Sorry, this "paper" smells from a mile away, and the fact that it is upvoted as much shows that people will upvote anything if they see a pretty name attached.<p>Edit:<p>Due to the apparent controversy of my criticism, I can't respond with a reply, so here is my response to the comment below asking what exactly makes this worse.<p>> As far as I can tell, this is almost the same as stacking multiple layers of ensembles<p>It isn't new. Ensembling is used and has been used for a long time. All kaggle competitions are won through ensembles and even ensembles of ensembles. It is a well studied field.<p>> except worse as each ensemble is trained while previous ensembles are learning.<p>Ensembles exhibit certain properties, but only iff they are trained independently from each other. This is well studied, you can read more about it in Bishop's Pattern recognition book.<p>> This is causing context drift.<p>Context drift occurs when a distribution changes over time. This changes the loss landscape which means the global minima change / move.<p>> To deal with the context drift, Hinton proposes to normalise the output.<p>So not only is what Hinton built a variation of something that existed already, made it worse by training the models simultaneously, and to handle the fact that it is worse, he adds additional computations to deal with said issue.
I remember a long time ago there was a collection of Geoffrey Hinton facts. Like "Geoffrey Hinton once wrote a neural network that beat Chuck Norris."