Simon, I opened an issue on your TIL repo with the pip incantation that I think will get the GPU working.<p><a href="https://github.com/simonw/til/issues/69">https://github.com/simonw/til/issues/69</a><p>I ran into that previously
I read "Paperspace" as "paper space" so it reminded of this great article: <a href="http://www.righto.com/2014/09/mining-bitcoin-with-pencil-and-paper.html" rel="nofollow">http://www.righto.com/2014/09/mining-bitcoin-with-pencil-and...</a><p>Could someone do the same with some LLM to demonstrate a very simple example?
We'd love to help you all deploy this!<p>1. We just released a couple models that are much smaller (<a href="https://huggingface.co/databricks/dolly-v2-6-9b" rel="nofollow">https://huggingface.co/databricks/dolly-v2-6-9b</a>), and these should be much easier to run on commodity hardware in a reasonable amount of time.<p>2. Regarding this particular issue, I suspect something is wrong with the setup. The example is generating a little over 100 words, which probably is something like 250 tokens. 12 minutes makes no sense for that if you're running on a modern GPU. I'd love to see details on which GPU was selected - I'm unfamiliar with which modern GPU has 30GB of memory (A10 is 24GB, T4 is 16GB, and A100 is 40/80GB). Are you sure you're using a version of PyTorch that installs CUDA correctly?<p>3. We have seen single GPU inference work in 8-bit on the A10, so I'd suggest that as a followup
I wrote a small POC of getting this model working on my box (I felt inspired after reading this). If anybody else is wanting to try this out, give it a shot here!<p><a href="https://github.com/lunabrain-ai/dolly-v2-12b-8bit-example">https://github.com/lunabrain-ai/dolly-v2-12b-8bit-example</a><p>(It's garbage code and this should really just be used as a starting POC. I hope it helps!)