I'm enthusiastic about BitNet and the potential of low-bit LLMs - the papers show impressive perplexity scores matching full-precision models while drastically reducing compute and memory requirements. What's puzzling is we're not seeing any major providers announce plans to leverage this for their flagship models, despite the clear efficiency gains that could theoretically enable much larger architectures. I suspect there might be some hidden engineering challenges around specialized hardware requirements or training stability that aren't fully captured in the academic results, but would love insights from anyone closer to production deployment of these techniques.
Sorry for a stupid question but to clarify, even though it is a 1-bit model, it is supposed to be working with any types of embeddings, even taken from larger LLMs(in their example, they use HF1BitLLM/Llama3-8B-1.58-100B-tokens). I.e. it doesn't have an embedding layer built-in and relies on embedding provided separately?
Can anyone help me understand how this works without special bitnet precision-specific hardware? Is special hardware unnecessary? Maybe it just doesn't reach the full bitnet potential without it? Or maybe it does, with some fancy tricks? Thanks!
I'm glad Microsoft uses Bash in the example, instead of their own Windows shells. As a user I would like having something like "Git Bash" for Windows built in the system, as default shell.
Neat. Would anyone know where the SDPA kernel equivalent is? I poked around the repo, but only saw some form of quantization code with vectorized intrinsics.