Currently, my product offers retrieval augmented generation as a service. Anyone doing LLM-based apps has probably noticed that the usage costs are typically on a per-token basis, which seems very "fair" in that each token takes some time and more tokens = more time = higher underlying costs.<p>However, discussions I've had with some CIOs is that this yields a very unpredictable billing cycle and that they have a hard time mapping tokens to business outcomes. This was why we originally priced by request + storage, and have modeled average token consumption by our users. We end up with larger margins for smaller requests/responses and smaller margins for larger requests/responses, which is obviously less fair but more predictable.<p>Curious about how you have felt on being charged per token? Are we making the right call by making things more predictable or is it better to be less predictable but more directly reflect the underlying costs?