Or, equivalently, how would you approach running an 180B+ LLM on a Von Neumann computer with say 1 MB of main memory and virtually limitless secondary storage?
Or, do you know about an approach (that you might read somewhere) that might help running a heavy LLM on virtually any Turing equivalent device?<p>Picture this: you're stuck with your “potato computer” (small RAM, no external GPU, very large SSD), and your LLM is saved on an external SSD.<p>Your task: run that LLM on your “potato PC” and try to achieve reasonable response times (e.g., 1 h to 24 h).
Response times of 1 year, or higher, might be impractical for most use cases.<p>And on a side note, how would you figure out the response times of a language model on low-end devices (e.g., Raspberry Pi, business laptops, MSP430)?
Would you just assume some basic operations such as linear algebra operations as a given and estimate the number of steps from there?<p>I expect the usual suspects brought up in this discussion:<p>— Memory Mapped I/O aka treating an I/O device such as an SSD as if it were actual RAM (mmap).
BTW: `mmap` makes our secondary storage somewhat akin to an infinite tape in a Turing machine<p>— “LLM in a flash: Efficient Large Language Model Inference with Limited Memory”, https://arxiv.org/html/2312.11514v2 (04 Jan 2024)<p>— SSD "wear and tear"