Example, the attention mechanism in the phrase: argument one is dog, argument two is cat, function is concat, result is dogcat.<p>So the query is the function, the keys are the arguments to the function and the values are the result of applying the function to the arguments. Here the result of the function is memorized from the training set and not computed.
Basically yes. "Memorized" is wrong in the sense that neural networks, when they work well, learn an approximation of a function and often give the right answer for cases that they haven't seen before. It's a danger though that a network will "overfit" and memorize examples and not generalize to ones it hasn't seen.<p>The argument that "chatbots can't create anything new" is completely bogus (and is often tied up in a fetishization of creativity), there is no fundamental reason one can't attempt a literary task like "Write a play in the style of Shakespeare set in Russia's October Revolution". On the other hand it can't (correctly) make up factual information that it hasn't been trained on: ChatGPT on it's own resources can't talk about who won the Superbowl or Premier League this year because it hasn't seen any documents about it.<p>Note it doesn't have to be words, the same strategy works amazing well for images and audio, see<p><a href="https://en.wikipedia.org/wiki/Vision_transformer" rel="nofollow">https://en.wikipedia.org/wiki/Vision_transformer</a>
It's not really a function because it doesn't take the entire input and generate the whole output. It's more of a stream of inputs and a stream of outputs, and the outputs go back in as inputs. It's basically an app, with app layers doing different things like transforming words into tokens, and processing them with an attention algorithm, and then feeding it through a relatively simple neural net (compared to previous architectures before GPT).
The idea is about the LLM trained in a programming language should use as queries the functions, and learn to detect the keys (arguments to the functions) and the value matrix could be related to a memorized version of the computed function with those arguments. So the sequence of learning in transformers is like:
1) query = is this a function?, what function?
2) keys = where are what are their arguments
3) values = embed the memorized result of computing the function with the given arguments.
The most intuitive descriptions I've heard is to imagine it like querying a continuous database. That is, it can give you responses from the space _between_ the data that it has memorized/stored/incorporated.<p>Caveat: I'm not super on top of AI stuff, but that description struck a chord with me.