If anyone is looking for a more reasonably cost effective solution, Hetzner has 16 vCPU/32GB RAM ARM VMs for $24E/mo that will run 34b Q4 GGUF at around 4 tok/sec. It's not very fast, but it is very cheap.
Something that would be extremely helpful is a good benchmark of various hardware for llm inference. It's really hard to tell how well a GPU will perform or whether it will be supported at all.
SO roughyl how much does this instance cost a day? Like $30?
Im kind of confused why it wasnt mentioned, but hey maybe poeple arent as cheap as me. Cool project tho.
One of the tasks that can be accomplished by running LLMs on a CPU is to execute long background tasks that do not require real-time response. llama.cpp seems like a suitable platform for this. It would be interesting to explore how to leverage the various acceleration techniques available on AWS.