TechEcho

7 comments

Riverheartover 6 years ago

For those unfamiliar with the concept, courtesy Wikipedia<a href="https://en.m.wikipedia.org/wiki/Gradient_descent" rel="nofollow">https://en.m.wikipedia.org/wiki/Gradient_descent</a>The basic intuition behind gradient descent can be illustrated by a hypothetical scenario. A person is stuck in the mountains and is trying to get down (i.e. trying to find the minima). There is heavy fog such that visibility is extremely low. Therefore, the path down the mountain is not visible, so he must use local information to find the minima. He can use the method of gradient descent, which involves looking at the steepness of the hill at his current position, then proceeding in the direction with the steepest descent (i.e. downhill). If he was trying to find the top of the mountain (i.e. the maxima), then he would proceed in the direction steepest ascent (i.e. uphill). Using this method, he would eventually find his way down the mountain. However, assume also that the steepness of the hill is not immediately obvious with simple observation, but rather it requires a sophisticated instrument to measure, which the person happens to have at the moment. It takes quite some time to measure the steepness of the hill with the instrument, thus he should minimize his use of the instrument if he wanted to get down the mountain before sunset. The difficulty then is choosing the frequency at which he should measure the steepness of the hill so not to go off track.

评论 #18532834 未加载

raibosomeover 6 years ago

At the end of this post you will get a cheat sheet of the 10 common gradient descent optimisation algorithms.Using more readable notations, I will walk you through how the vanilla stochastic gradient descent slowly evolved into the popular Adam optimiser and others. I also came out with an ‘evolutionary map’ of the optimisers to visualise this.The motivation for writing this post is that there is a lack of simple-to-read equations for parameter update and a compiled list of these optimisers.Hopefully this benefits the community.

usernomnomnomover 6 years ago

This is very helpful! If I may make a shameless self-plug, this would be even better as something that is dynamic and can be interactively played with. A few years ago I made this iPython notebook for similar didactic purposes: <a href="https://github.com/turingbirds/gradient_descent/blob/master/gradient_descent.ipynb" rel="nofollow">https://github.com/turingbirds/gradient_descent/blob/master/...</a>

评论 #18527659 未加载

raibosomeover 6 years ago

If I may, I had also built a simple demo of linear regression using gradient descent before writing this post. <a href="https://raiboso.me/backpropagation-demo/" rel="nofollow">https://raiboso.me/backpropagation-demo/</a>This demo allows you to choose between four optimisers, and lets you track the values of your variables during training.Compare your runs with different optimisers using the graph at the bottom of the page.

_jamesm_over 6 years ago

It's always useful to see different SGD methods written with a consistent nomenclature. A few thoughts:1. Is the 1999 Qian paper on momentum really the most appropriate one, given the comparison of the publication date to NAG? As a cursory examination of the paper reveals, momentum has been used for a long time before 1999!2. Similarly, the original NAG paper isn't about stochastic gradient descent and doesn't really use the equation as written. A more appropriate reference is to the Sutskever, Martens, Dahl and Hinton paper of 2013 <a href="http://proceedings.mlr.press/v28/sutskever13.html" rel="nofollow">http://proceedings.mlr.press/v28/sutskever13.html</a> which is the publication that described/reworked NAG in this way.3. It's worth noting the caveats about AMSGrad: <a href="https://www.fast.ai/2018/07/02/adam-weight-decay/" rel="nofollow">https://www.fast.ai/2018/07/02/adam-weight-decay/</a>

评论 #18533355 未加载

hnuser355over 6 years ago

And for anyone who wants to know why unmodified gradient descent may be considered a piece of shit in certain circumstances<a href="http://wikipedia.org/wiki/Rosenbrock_function" rel="nofollow">http://wikipedia.org/wiki/Rosenbrock_function</a>Gradient descent with a good line search (Wolfe conditions) applies to the multidimensional case should converge to min, but it might take you thousands of iterations. Newton’s method or something might take <50.But machine learning practitioners will know why gradient algorithms are often preferred despite this

评论 #18528586 未加载

评论 #18529098 未加载

hosejaover 6 years ago

Gradient descent is the abstraction of so many real world problems it's not even funny. From protein folding to machine intelligence, gradient descents everywhere...

7 comments

Riverheartover 6 years ago

评论 #18532834 未加载

raibosomeover 6 years ago

usernomnomnomover 6 years ago

评论 #18527659 未加载

raibosomeover 6 years ago

_jamesm_over 6 years ago

评论 #18533355 未加载

hnuser355over 6 years ago

评论 #18528586 未加载

评论 #18529098 未加载

hosejaover 6 years ago

Gradient descent is the abstraction of so many real world problems it's not even funny. From protein folding to machine intelligence, gradient descents everywhere...

Gradient Descent Optimisation Algorithms

7 comments

Gradient Descent Optimisation Algorithms

7 comments