This is the point of it:<p><a href="https://github.com/ggerganov/llama.cpp/pull/11016#issuecomment-2599740463">https://github.com/ggerganov/llama.cpp/pull/11016#issuecomme...</a>
This looks great!<p>While we're at it, is there already some kind of standardized local storage location/scheme for LLM models? If not, this project could potentially be a great place to set an example that others can follow, if they want. I've been playing with different runtimes (Ollama, vLLM) the last days, and I really would have appreciated better interoperability in terms of shared model storage, instead of everybody defaulting to downloading everything all over again.
To make it AI really boring all those projects need to be more approachable to non-tech savvy people, e.g. some minimal GUI for searching, listing, deleting, installing ai models. I wish e.g. this or ollama could work more as invisible AI models dependency manager. Right now every app that want to have STT like whisper will bundle such model inside. User waste more memory storage and have to wait to download big models. We had similar problems with and static libraries and then moved to dynamic linking libraries.<p>I wish your app could add some model as dependency and on install would download only if such model is not avialable locally. Also would check if ollama is installed and only bootstrap if also doesn't exist on drive. Maybe with some nice interface for user to confirm download and nice onboarding.
One of my primary goals of RamaLama was to allow users to move AI Models into containers, so they can be stored in OCI Registries. I believe there is going to be a proliferation of "private" models, and eventually "private" RAG data. (Working heavily in RAG support in RamaLama now.<p>Once you have private models and RAG, I believe you will want to run these models and data on edge devices in in Kubernetes clusters. Getting the AI Models and data into OCI content. Would allow us to take advantage of content signing, trust, mirroring. And make running the AI in production easier.<p>Also allowing users to block access to outside "untrusted" AI Models stored in the internet. Allow companies to only use "trusted" AI.<p>Since Companies already have OCI registries, it makes sense to store your AI Models and content in the same location.
122 points 2 hours ago yet this is currently #38 and not on the front page.<p>Strange. At the same time I see numerous items that are on the front page posted 2 hours or older with fewer points.<p>I'm willing to take a reputation hit on this meta post. I wonder why this got demoted so quickly from front page despite people clearly voting on it. I wonder if it has anything to do with being backed by YC.<p>I sincerely hope it's just my miss understanding of hn algorithm though
> Running in containers eliminates the need for users to configure the host system for AI.<p>When is that a problem?<p>Based on the linked issue in eigenvalue's comment[1], this seems like a very good thing. It sounds like ollama is up to no good and this is a good drop-in replacement. What is the deeper problem being solved here though, about configuring the host? I've not run into any such issue.<p>1. <a href="https://news.ycombinator.com/item?id=42888129">https://news.ycombinator.com/item?id=42888129</a>
What benefit does Ollama (or RamaLama) offer over just plain llama.cpp or llamafile? The only thing I understand is that there is automatic downloading of models behind the scenes, but a big reason for me to want to use local models at all is that I want to to know exactly what files I use and keep them sorted and backed up properly, so a tool automatically downloading models and dumping in some cache directory just sounds annoying.
Does this provide a Ollama compatible API endpoint? I've got at least one other project running that only supports Ollama's API or OpenAI's hosted solution (ie. the API endpoint isn't configurable to use llama.cpp and friends)
So it's a replacement for Ollama?<p>The killer features of Ollama for me right now are the nice library of quantized models and the ability to automatically start and stop serving models in response to incoming requests and timeouts. The first send to be solved by reusing the Ollama models, but I can't see if the service is possible from my cursory look.
I am doing a short talk on this tomorrow at FOSDEM:<p><a href="https://fosdem.org/2025/schedule/event/fosdem-2025-4486-ramalama-making-working-with-ai-models-boring/" rel="nofollow">https://fosdem.org/2025/schedule/event/fosdem-2025-4486-rama...</a>