I've thought about building this for a while, glad it's out there!<p>Not only does this guarantee your output is JSON, it lowers your generation cost and latency by filling in many of the repetitive schema tokens without passing them through the LLM.<p>For the very common case of "extracting multiple structured fields from a piece of unstructured text," I believe there's an even stronger optimization possible that would further decrease costs, latency and potentially even improve accuracy.<p>Assuming the fields you want to extract are independent (and they often are), you don't <i>need</i> to generate them all in one go autoregressively. Eg. instead of running the following pseudo-prompt:<p><pre><code> "Input: 'It's sunny and cold today'
Output schema: {"sunny": boolean, "temperature": string}"
</code></pre>
You could instead run the following two:<p><pre><code> "Input: 'It's sunny and cold today'
Output schema: {"sunny": boolean}"
"Input: 'It's sunny and cold today'
Output schema: {"temperature": string}"
</code></pre>
We don't do that today because when done naively it's very inefficient -- you'd be tokenizing, passing to the GPU, and computing the KV cache of the shared part of the prompt twice. But a library with the right abstraction could run the second two queries in a batch in parallel and reuse the same tokenization and KV cache for both of them. It would actually be <i>more</i> efficient than generating both fields in one go, since when you factor out the shared prefixes both the generated text and its context are shorter!<p>I mentioned above that this could also improve accuracy. Of course it doesn't do that by default (except that by excluding all the irrelevant fields it makes self-attention's job easier). But what it <i>does</i> do is give you an independent prompt for each field you're interested in. And so for particularly tricky fields you're trying to extract, you have the flexibility to eg. add several examples to make the generation N-shot.