Hi everyone! We made a library for inference/fine-tuning of open 175B+ language models (like BLOOM) at home without having high-end GPUs. You join forces with other people over the Internet (BitTorrent-style), each running a small part of model layers. Check it out in Colab:<p><a href="https://colab.research.google.com/drive/1Ervk6HPNS6AYVr3xVdQnY5a-TjjmLCdQ?usp=sharing" rel="nofollow">https://colab.research.google.com/drive/1Ervk6HPNS6AYVr3xVdQ...</a><p>Thing is, even though BLOOM weights were publicly released, it was extremely difficult to run inference efficiently unless you had lots of hardware to load the entire model into the GPU memory (you need at least 3x A100 or 8x 3090 GPUs). E.g., in case of offloading, you can only reach the speed of ~10 sec/step for sequential (non-parallel) generation. A possible alternative is to use APIs, but they are paid and not always flexible (you can’t adopt new fine-tuning/sampling methods or take a look at hidden states). So, Petals come to the rescue!<p>Please share what you think of it!