It's weird that not once do they mention or compare their results to the already-available quantization methods. I normally try to give benefit of the doubt, but there's really no way they're not aware that there are already widely used techniques for accomplishing this same thing, so the comparison benchmarks <i>really</i> should be there.<p>To fill in the gap, here's llama.cpp's comparison chart[0] for the different quantizations available for Llama 1. We can't compare directly with their Llama 2 metrics, but just comparing the percent change in speed and perplexity, MK-1 looks very similar to Q5_1. There's a small but not insignificant hit to perplexity, and a just over 2x speedup.<p>If these numbers are accurate, you can download pre-quantized Llama 2 models from Hugging Face that will perform essentially the same as what MK-1 is offering, with the Q5 files here: <a href="https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main</a><p>[0] <a href="https://github.com/ggerganov/llama.cpp#quantization">https://github.com/ggerganov/llama.cpp#quantization</a>