TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: KTransformers–236B Model and 1M Context LLM Inference on Local Machines

20 pointsby sssummer9 months ago
Hey Hacker News! We are excited to share our open-source project, KTransformers, a flexible framework designed for cutting-edge LLM inference optimizations! Leveraging state-of-the-art kernels from llamafile and marlin, KTransformers seamlessly enhances the performance of HuggingFace Transformers, making it possible to operate large 236B MoE models or extremely long 1M context locally with promising speed.<p>KTransformers is a Python-centric framework designed with extensibility at its core. By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. For example, it allows you to integrate with all your familiar frontends, such as the VS Code plugin backed by Tabby.<p>To demonstrate its capability, we present two showcase demos:<p>- GPT-4-level Local VSCode Copilot: It runs the huge 236B DeepSeek-Coder-V2&#x27;s Q4_K_M variant using just 11GB VRAM and 136GB DRAM on a local machine, which matches the score of GPT4-0613 in BigCodeBench with a promising 126 tokens&#x2F;s for prompt prefill and 13.6 tokens&#x2F;s for generation.<p>- 1M Context Local Inference:Achieves 15 tokens&#x2F;s with nearly 100% accuracy on the &quot;Needle In a Haystack&quot; test via the InternLM2.5-7B-Chat-1M model, utilizing 24GB VRAM and 150GB DRAM, and is several times faster than llama.cpp.<p>Check it out on GitHub: <a href="https:&#x2F;&#x2F;github.com&#x2F;kvcache-ai&#x2F;ktransformers">https:&#x2F;&#x2F;github.com&#x2F;kvcache-ai&#x2F;ktransformers</a>

3 comments

james0zan9 months ago
Our next step is supporting VL models. Please let us know if you have any requests.
iAzure9 months ago
Good news, so we can run deepseekv2 using 12G GPU
ervinxie9 months ago
The speedup is amazing!