Dear all,<p>You may have read the news of [Raspberry Pi AI Kit](https://www.raspberrypi.com/news/raspberry-pi-ai-kit-available-now-at-70/) (Hailo-8L) with 13 TOPS. There is another [blog](https://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/) from Mozilla saying that new Firefox would have local AI ("182M parameters model using a Distilled version of GPT-2 alongside a Vision Transformer (ViT) image encoder.") JetBrains also stated somewhere that they have 100M-parameter model in the IDEs(?).<p>I read here that Phi-3 seems to have really good performance despite its small paramter size. I understand that VRAM or RAM often seem to be a bottleneck or an issue when it comes to generative AI, such as local LLM. So I was wondering that now or in the near future, would it be possible for edge devices, such as the RPi AI kit, to run language models of relatively small size, such as the distilled GPT-2, with vision and/or audio functionalities?<p>Also the Microsfot AI-PC with the Snapdragon X Elite seems to have 40-45 TOPS, while the Hailo-8L only has 13TOPS. From your experience, around how many TOPS are necessary and/or sufficient to run local AI, whether it's vision, audio, or NLP, to have good-enough speed?<p>Many thanks!
If you have a modern multicore processor, you should be able to run just about any model if you have enough memory for it. Phi-3 fits well into the 4gb Raspberry Pi memory profile when you quantize it, but you may want an even smaller model if you prefer speed over quality.<p>> around how many TOPS are necessary and/or sufficient to run local AI, whether it's vision, audio, or NLP, to have good-enough speed?<p>These are kinda individual questions.<p>- Vision is relatively compute-intensive and depends on how many frames-per-second of image detection you need<p>- Audio is relatively low-latency if you use a smaller whisper model<p>- NLP/LLMs tend to be slow, but can stream their output which makes it possible to speak answers while they're being generated<p>So putting this all together, you really have to define what kind of end-product you want. Making matters worse, all TOPS are not created equal; having a 40 TOPS NPU that doesn't support every Tensorflow operation could hamstring your ability to run everything on one machine. Past a certain price point (like the Snapdragon X Elite) it really just makes more sense to buy a 12gb 3060 and a cheap barebones desktop system to run it in.