4 pointsby om85 months ago

1 comment

om85 months ago

This is a demo of what's possible to run on edge devices using SOTA quantization. Other similar projects that try to run 8B models in browser are either using webgpu or 2 bit quantization that breaks the model. I implemented inference of AQLM quantized representation, making model that has 2 bit quantization and does not blow up.

评论 #42510839 未加载

Show HN: Llama 3.1 8B CPU Inference in a Browser via WebAssembly

1 comment

Show HN: Llama 3.1 8B CPU Inference in a Browser via WebAssembly

1 comment