<a href="https://github.com/oobabooga/text-generation-webui/">https://github.com/oobabooga/text-generation-webui/</a><p>Works on all platforms, but runs much better on Linux.<p>Running this in Docker on my 2080Ti, can barely fit 13B-4bit models into 11G of VRAM, but it works fine, produces around 10-15 tokens/second most of the time. It also has an API, that you can use with something like LangChain.<p>Supports multiple ways to run the models, purely with CUDA (I think AMD support is coming too) or on CPU with llama.cpp (also possible to offload part of the model to GPU VRAM, but the performance is still nowhere near CUDA).<p>Don't expect open-source models to perform as well as ChatGPT though, they're still pretty limited in comparison. Good place to get the models is TheBloke's page - <a href="https://huggingface.co/TheBloke" rel="nofollow">https://huggingface.co/TheBloke</a>. Tom converts popular LLM builds into multiple formats that you can use with textgen and he's a pillar of local LLM community.<p>I'm still learning how to fine-tune/train LoRAs, it's pretty finicky, but promising, I'd like to be able to feed personal data into the model and have it reliably answer questions.<p>In my opinion, these developments are way more exciting than whatever OpenAI is doing. No way I'm pushing my chatlogs into some corp datacenter, but running locally and storing checkpoints safely would achieve my end-goal of having it "impersonate" myself on the web.