Only 2000 lines! There's a few reasons this is really, really nice.<p>1. CPU inferencing can be surprisingly practical for a lot of long-tail use cases where there isn't a lot of load. As you can see from the animation, you get a decent rate of tok/sec and that's not really optimized beyond using the vector API. Think about things where you don't need to generate a lot of tokens to be useful, just comprehension or a yes/no type answer is sufficient. For example using it as a fallback to faster but less general NLP libraries for things like date parsing.<p>2. Most LLM inferencing systems are set up as servers and require custom APIs or API clients. Being able to just copy/paste a file into a Java/Kotlin/Scala/Clojure project and go means no additional deployment complexity beyond ensuring the model weights can be found on disk.<p>3. It's a lot easier to read and well commented compared to quite a few LLM impls, which are unfortunately often "hacker output" not really written for comprehensibility.<p>4. Because it's such a small and comprehensible code base it makes it much easier to experiment with tweaks to the inferencing algorithm that are very use case specific, like forced decoding strategies. These often require a lot of rummaging around in barely typed Python codebases that are littered with the remnants of many prior experiments if you want to do it on the regular LLM stacks.