What do you think about this analogy?<p>A simple process produces a Mandelbrot set.
A simple process (loss minimization through gradient descent) produces LLMs.
So what plays the role of 2D-plane or dense point grid in the case of LLMs?
It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training.
In case of a 2D plan, the closeness between two points is determined by our numerical representation schema.
But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus<p>The following is a quote from Yuri Manin, an eminent Mathematician.<p><a href="https://www.youtube.com/watch?v=BNzZt0QHj9U" rel="nofollow">https://www.youtube.com/watch?v=BNzZt0QHj9U</a>
Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.<p>I have a related idea which I picked up from somewhere which mirrors the above observation.<p>When we see beautiful fractals generated by simple equations and iterative processes,
we give importance to only the equations, not to the cartesian grid on which that process
operates.