The relevant paper: <a href="https://arxiv.org/abs/2406.02528" rel="nofollow">https://arxiv.org/abs/2406.02528</a><p>In summary, they forced the model to process data in ternary system and then build a custom FPGA chip to process the data more efficiently. Tested to be "comparable" to small models (3B), theoretically scale to 70B, unknown for SOTAs (>100B params).<p>We have always known custom chips are more efficient especially for tasks like these where it is basically approximating an analog process (i.e. the brain). What is impressive is how fast it is prgressing. These 3B params models would demolish GPT2 which was, what, 4-5 years old? And they would be pure scifi tech 10 years ago.<p>Now they can run on your phone.<p>A machine, running locally on your phone, that can listen and respond to anything a human may say. Who could have confidently claim this 10 years ago?