Hey Hacker News! We are excited to share our open-source project, KTransformers, a flexible framework designed for cutting-edge LLM inference optimizations! Leveraging state-of-the-art kernels from llamafile and marlin, KTransformers seamlessly enhances the performance of HuggingFace Transformers, making it possible to operate large 236B MoE models or extremely long 1M context locally with promising speed.<p>KTransformers is a Python-centric framework designed with extensibility at its core. By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. For example, it allows you to integrate with all your familiar frontends, such as the VS Code plugin backed by Tabby.<p>To demonstrate its capability, we present two showcase demos:<p>- GPT-4-level Local VSCode Copilot: It runs the huge 236B DeepSeek-Coder-V2's Q4_K_M variant using just 11GB VRAM and 136GB DRAM on a local machine, which matches the score of GPT4-0613 in BigCodeBench with a promising 126 tokens/s for prompt prefill and 13.6 tokens/s for generation.<p>- 1M Context Local Inference:Achieves 15 tokens/s with nearly 100% accuracy on the "Needle In a Haystack" test via the InternLM2.5-7B-Chat-1M model, utilizing 24GB VRAM and 150GB DRAM, and is several times faster than llama.cpp.<p>Check it out on GitHub: <a href="https://github.com/kvcache-ai/ktransformers">https://github.com/kvcache-ai/ktransformers</a>