One thing that's worth mentioning about llama.cpp wrappers like ollama, LM Studio and Faraday is that they don't yet support[1] sliding window attention, and instead use vanilla causal attention from llama2. As noted in the Mistral 7B paper[2], SWA has some benefits in terms of attention span over regular causal attention.<p>Disclaimer: I have a competing universal macOS/iOS app[3] that does support SWA with Mistral models (using mlc-llm).<p>[1]: <a href="https://github.com/ggerganov/llama.cpp/issues/3377">https://github.com/ggerganov/llama.cpp/issues/3377</a><p>[2]: <a href="https://arxiv.org/abs/2310.06825" rel="nofollow noreferrer">https://arxiv.org/abs/2310.06825</a><p>[3]: <a href="https://apps.apple.com/us/app/private-llm/id6448106860" rel="nofollow noreferrer">https://apps.apple.com/us/app/private-llm/id6448106860</a>