The trick of performing binary addition by using analog voltages was used in the IAS family of computers (1952), designed by John von Neumann. It implemented a full adder by converting two input bits and a carry in bit into voltages that were summed. Vacuum tubes converted the analog voltage back into bits by using a threshold to generate the carry-out and more complex thresholds to generate the sum-out bit.
That's awesome - the network's solution is essentially to convert the input to analog, <i>perform the actual addition in analog</i>, and then convert that back to digital. And the first two parts of that all happened in the input weights, no less.
Impressive. Am I right that what is really going on here is that the network is implementing the "+" operator by actually having the CPU of the host carrying out the "+" when executing the network?<p>I.e., the network converts the binary input and output to floating point, and then it is the CPU of the host the network is running on that <i>really</i> does the addition in floating point.<p>So usually one does a bunch of FLOPs to get "emerging" behaviour that isn't doing arithmetic. But in this case, instead the network does a bunch of transforms in and out so that in a critical point, the addition executed in the network runner is used for exactly its original purpose: Addition.<p>And I guess saying it is "analog" is a good analogy for this..
These stories remind me of a story from Discover Magazine <a href="https://www.discovermagazine.com/technology/evolving-a-conscious-machine" rel="nofollow">https://www.discovermagazine.com/technology/evolving-a-consc...</a>
A researcher was using a process to "evolve" a FPGA and the result was a circuit that was super efficient but worked in ways that were unexpected: part of the circuit seemed unconnected to the rest but if removed the whole thing stopped working and it would only work at a specific temperature.
The essay linked from the article is interesting:
<a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html" rel="nofollow">http://www.incompleteideas.net/IncIdeas/BitterLesson.html</a>
Recent work on NNs where applying Kolmogorov complexity optimization leads to a network that solves binary addition with a carry-over counter and 100% provable accuracy: (see Results section)<p><a href="https://arxiv.org/abs/2111.00600" rel="nofollow">https://arxiv.org/abs/2111.00600</a>
Related: an independent reverse engineering of a network that "grokked" modular addition:<p><a href="https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking" rel="nofollow">https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...</a><p>One interesting thing is that this network similarly does addition using a Fourier transform.
Not too surprising. Each layer can multiply each input by a set of weights, and add them up. That's all you need, in order to convert from base 2 to base 10. In base 2, we calculate the value of a set of digits by multiplying each by 2^0, 2^1, ..., 2^n. Each weight is twice the previous. The activation function is just making it easier to center those factors on zero to reduce the magnitude of the regularization term. I am guessing you could probably get the job done with just two neurons in the first layer, which output the real value of each input, one neuron in the second layer to do the addition (weights 1, 1), and 8 neurons in the final layer to convert back to binary digits.
A couple questions:<p>1. How much of this outcome is due to the unusual (pseudo) periodic activation function? Seems like a lot of the DAC-like behavior is coming from the periodicity of the first layer’s output, which seems to be due to the unique activation function.<p>2. Would the behavior of the network change if the binary strings were encoded differently? The author encodes them as 1D arrays with 1 corresponding to 1 and 0 corresponding to -1, which is an unusual way of doing things. What if the author encoded them as literal binary arrays (i.e. 1->1, 0->0)? What about one-hot arrays (i.e. 2D arrays with 1->[1, 0] and 0->[0, 1]), which is the most common way to encode categorical data?
Reminded me of the 0xFFFF0000 0xFF00FF00 0xF0F0F0F0 0xCCCCCCCC 0XAAAAAAAA trick, the use of which I don't currently recall...<p>edit: several, but probably not very relevant <a href="https://graphics.stanford.edu/~seander/bithacks.html" rel="nofollow">https://graphics.stanford.edu/~seander/bithacks.html</a>
When you expect your network to learn a binary adder and it learns FFT addition instead. :)<p>(For an ANN FFT is more natural as it's a projection algorithm.)
In a similar vein, in the 1980s a "superoptimizer" was built for optimizing snippets of assembler by doing an exhaustive search of all combinations of instructions. It found several delightful and unexpected optimizations, which quickly became incorporated into compiler code generators.
So important. Here’s a prediction. Work like this on model explainability will continue and eventually will reveal patterns in the input that reliably produce features in the resulting models. That knowledge will tell us about <i>nature</i>, in addition to ML.
I liked this article a lot, but<p>> One thought that occurred to me after this investigation was the premise that the immense bleeding-edge models of today with billions of parameters might be able to be built using orders of magnitude fewer network resources by using more efficient or custom-designed architectures.<p>Transformer units themselves are already specialized things. Wikipedia says that GPT-3 is a standard transformer network, so I'm sure there is additional room for specialization. But that's not a new idea either, and it's often the case that a after a model is released, smaller versions tend to follow.
Small nit: the analog DAC circuit has MSB and LSB reversed. That is, the LSB goes into the top 128k resistor. If you think of it one bit at a time, the gain of the circuit = -Rf/R --- where R is one of the resistors in the ladder. (Rf is the feedback resistor, which is unlabeled and un-valued in the article.) Clearly, you get bigger output when R is smaller.
A more theoretical but increasingly popular approach to identifying behavior like this is interchange intervention: <a href="https://ai.stanford.edu/blog/causal-abstraction/" rel="nofollow">https://ai.stanford.edu/blog/causal-abstraction/</a>
I don't see how this is surprising. It's cool but come on, you have a network consisting of addition and multiplication and you expect it to not use them? What else is it supposed to do? Memorize every input output pair like a look up table?
totally off topic: Any pointers to tutorials/articles that teach implementation of common neural network models in javascript/rust (from scratch) ?
> One thought that occurred to me after this investigation was the premise that the immense bleeding-edge models of today with billions of parameters might be able to be built using orders of magnitude fewer network resources by using more efficient or custom-designed architectures.<p>I'm pretty convinced that something equivalent to GPT could run on consumer hardware today, and the only reason it doesn't is because OpenAI has a vested interest in selling it as a service.<p>It's the same as Dall-E and Stable Diffusion - Dall-E makes no attempt to run on consumer hardware because it benefits OpenAI to make it so large that you must rely on someone with huge resources (i.e. them) to use it. Then some new research shows that effectively the same thing can be done on a consumer GPU.<p>I'm aware that there's plenty of other GPT-like models available on Huggingface, but (to my knowledge) there is nothing that reaches the same quality that can run on consumer hardware - <i>yet</i>.