Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.<p>Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.
I’m a tad confused<p>> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.<p>But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?
I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.
I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).<p>Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.