As in title, if you're self-hosting an LLM, which one are you using and how did you set it up.<p>For context, I got an image generator running locally in about 20 minutes (Fooocus). Keen to try the same with LLM.
I recently found out about llama-cpp's official "Server" function: <a href="https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md">https://github.com/ggerganov/llama.cpp/blob/master/examples/...</a><p>Works like a charm for my (simple) use case. I'm running it on an Always Free Oracle Ampere A1 instance with 4 cores and ~20gb of memory. (Obligatory "fuck Larry Ellison" here)
I am running ollama on my MSI GP76 windows laptop with RTX 3080 card and 64 gb ram. It's running on the baked in linux installation. It recognized the graphics card right away and works pretty well. On my macbook pro m3 max with 36 GB ram, I can't run the 70B parameter model.