Hi HN, happy to see this here!<p>I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].<p>Also, the same plugin is available for VS Code [1].<p>Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.<p>[0] - <a href="https://github.com/ggerganov/llama.cpp/pull/9787">https://github.com/ggerganov/llama.cpp/pull/9787</a><p>[1] - <a href="https://github.com/ggml-org/llama.vscode">https://github.com/ggml-org/llama.vscode</a>
This guy is a national treasure and has contributed so much value to the open source AI ecosystem. I hope he’s able to attract enough funding to continue making software like this and releasing it as true “no strings attached” open source.
Very exciting - I'm a long-time vim user but most of my coworkers use VSCode, and I've been wanting to try out in-editor completion tools like this.<p>After using it for a couple hours (on Elixir code) with Qwen2.5-Coder-3B and no attempts to customize it, this checks a lot of boxes for me:<p><pre><code> - I pretty much want fancy autocomplete: filling in obvious things and saving my fingers the work, and these suggestions are pretty good
- the default keybindings work for me, I like that I can keep current line or multi-line suggestions
- no concerns around sending code off to a third-party
- works offline when I'm traveling
- it's fast!
</code></pre>
So I don't need to remember how to run the server, I'll probably set up a script to check if it's running and if not start it up in the background and run vim, and alias vim to use that. I looked in the help documents but didn't see a way to disable the "stats" text after the suggestions, though I'm not sure it will bother me that much.
I wonder how the "ring context" works under the hood. I have previously had (and recently messed around with again) a somewhat similar project designed for a more toy/exploratory setting (<a href="https://github.com/blackhole89/autopen">https://github.com/blackhole89/autopen</a> - demo video at <a href="https://www.youtube.com/watch?v=1O1T2q2t7i4" rel="nofollow">https://www.youtube.com/watch?v=1O1T2q2t7i4</a>), and one of the main problems to address definitively is the question of how to manage your KV cache cleverly so you don't have to constantly perform too much expensive recomputation whenever the buffer undergoes local changes.<p>The solution I came up with involved maintaining a tree of tokens branching whenever an alternative next token was explored, with full LLM state snapshots at fixed depth intervals so that the buffer would only have to be "replayed" for a few tokens when something changed. I wonder if there are some mathematical properties of how the important parts of the state (really, the KV cache, which can be thought of as a partial precomputation of the operation that one LLM iteration performs on the context) work that could have made this more efficient, like to avoid saving full snapshots or perhaps to be able to prune the "oldest" tokens out of a state efficiently.<p>(edit: Georgi's comment that beat me by 3 minutes appears to be pointing at information that would go some way to answer my questions!)
A little bit of a tangent, but I'm really curious what benefits could come from integrating these LLM tools more closely with data from LSPs, compilers, and other static analysis tools.<p>Intuitively, it seems like you could provide much more context and better output as a result. Even better would be if you could fine-tune LLMs on a per-language basis and ship them alongside typical editor tooling.<p>A problem I see w/ these AI tools is that they work much better with old, popular languages, and I worry that this will grow as a significant factor when choosing a language. Anecdotally, I see far better results when using TypeScript than Gleam, for example.<p>It would be very cool to be able to install a Gleam-specific model that could be fed data from the LSP and compiler, and wouldn't constantly hallucinate invalid syntax. I also wonder if, with additional context & fine-tuning, you could make these models smaller and more feasible to run locally on modest hardware.
Can anyone compare this to Tabbyml?[0] I just set that up yesterday for emacs to check it out.<p>The context gathering seems very interesting[1], and very vim-integrated, so I'm guessing there isn't anything very similar for Tabby. I skimmed the docs and saw some stuff about context for the Tabby chat feature[2] which I'm not super interested in using even if the docs adding sounds nice, but nothing obvious for the auto completion[3].<p>Does anyone have more insight or info to compare the two?<p>As a note, I quite like that the LLM context here "follows" what you're doing. It seems like a nice idea. Does anyone know if anyone else does something similar?<p>[0] <a href="https://www.tabbyml.com/" rel="nofollow">https://www.tabbyml.com/</a><p>[1] <a href="https://github.com/ggerganov/llama.cpp/pull/9787#issue-2572915687">https://github.com/ggerganov/llama.cpp/pull/9787#issue-25729...</a> "global context onwards"<p>[2] <a href="https://tabby.tabbyml.com/docs/administration/context/" rel="nofollow">https://tabby.tabbyml.com/docs/administration/context/</a><p>[3] <a href="https://tabby.tabbyml.com/docs/administration/code-completion/" rel="nofollow">https://tabby.tabbyml.com/docs/administration/code-completio...</a>
Is anyone actually getting value out of these models? I wired one up to Emacs and the local models all produce a huge volume of garbage output.<p>Occasionally I find a hosted LLM useful but I haven't found any output from the models I can run in Ollama on my gaming PC to be useful.<p>It's all plausible-looking but incorrect. I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?
Is this more or less the same as your VSCode version? (<a href="https://github.com/ggml-org/llama.vscode">https://github.com/ggml-org/llama.vscode</a>)
I am curious to see what will be possible with consumer grade hardware and more improvements to quantization over the next decade. Right now, even a 24GB gpu with the best models isn’t able to match the barely acceptable performance of hosted services I’m not willing to even pay $20 a month for.
Terminal coding FTW!<p>And when you're really stuck you can use DeepSeek R1 for a deeper analysis in your terminal using `askds`<p><a href="https://github.com/bodo-run/askds">https://github.com/bodo-run/askds</a>
Has anyone actually got this llama stuff to be usable on even moderate hardware? I find it just crashes because it doesn't find enough RAM. I've got 2G of VRAM on an AMD graphics card and 16G of system RAM and that doesn't seem to be enough. The impression I got from reading up was that it worked for most Apple stuff because the memory is unified and other than that, you need very expensive Nvidia GPUs with lots of VRAM. Are there any affordable options?
Been using this for a couple hours, and this is really nice. It is a great alternative to something like Github Copilot. Appreciate how simple and fast this is.
I've seen several posts and projects like this. Is there a summary/comparison somewhere of the various ways of running local completion/copilot?
It’s funny because I actually use vim mostly when I don’t want LLM assisted code. Sometimes it just gets in the way.<p>If I do, I load up cursor with vim bindings.
Really awesome work! Do anyone know what's the tool/terminal configuration he's using on the video demo to embed CPU/GPU usage on the terminal in that way ? Much appreciated :)
Looking for advice from someone who knows about the space - Suppose I'm willing to go out and buy a card for this purpose, what's a modestly priced graphics card with which I can get somewhat usable results running local LLM?
Do people with "Copilot+ PCs" get benefits running stuff like this from the much-vaunted AI coprocessors in for e.g. Snapdragon X Elite chips?