Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.<p>Recently we came up with a fast way to generate text that matches a regex (<a href="https://blog.normalcomputing.ai/posts/2023-07-27-regex-guided-generation/regex-guided-generation.html" rel="nofollow noreferrer">https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...</a>). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.<p>Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.<p>From there it was only a small leap to be able to generate text that follows a JSON schema (<a href="https://json-schema.org/" rel="nofollow noreferrer">https://json-schema.org/</a>), or is parseable into a Pydantic model (<a href="https://docs.pydantic.dev/latest/usage/models/" rel="nofollow noreferrer">https://docs.pydantic.dev/latest/usage/models/</a>). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.<p>I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.<p>I look forward to feedback, bug reports, feature requests and discussions!<p>Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar <a href="https://arxiv.org/abs/2307.09702" rel="nofollow noreferrer">https://arxiv.org/abs/2307.09702</a>
Mechanistically, I think this library takes the simple idea of masking part of the vocabulary space and steps in time efficiently. Great!<p>I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well?
My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.<p>And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?
I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten.<p>But it's still probabilistic, and nine times out of ten isn't good enough.<p>Occasionally it will hallucinate responses like this:<p>{"key1": "value1", "key2": "value2" for i in range(n)}<p>Re-prompting with the parsing error message is usually enough to get it on the second try.<p>But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.<p>Re-prompting for escaping errors still yields a ~50% success rate.
A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?<p>As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".<p>The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.<p>As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.
Thanks for building this. The mechanics are such an obvious idea that it's astounding that the first-party platforms haven't done this yet. I would be interested to see how this could be used for other tasks outside of JSON that require structured input.
I'm not sure how this is different than:<p><a href="https://github.com/1rgs/jsonformer">https://github.com/1rgs/jsonformer</a><p>or<p><a href="https://github.com/newhouseb/clownfish">https://github.com/newhouseb/clownfish</a><p>or<p><a href="https://github.com/mkuchnik/relm">https://github.com/mkuchnik/relm</a><p>or<p><a href="https://github.com/ggerganov/llama.cpp/pull/1773">https://github.com/ggerganov/llama.cpp/pull/1773</a><p>or<p><a href="https://github.com/Shopify/torch-grammar">https://github.com/Shopify/torch-grammar</a><p>Overall there are a <i>ton</i> of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.<p>Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)
So to explain this another way:<p>After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?<p>Very slick!
Is this Brandon Willard the breakdancer from Detroit Brandon Willard?<p>Edit: It is! <a href="https://brandonwillard.github.io/" rel="nofollow noreferrer">https://brandonwillard.github.io/</a>
Hi, remilouf. You say that your background is in "probabilistic, relational and symbolic programming". In that case I suspect you understand that it is no problem to generate text from a regular or context-free grammar, or really any level of grammar. For example, you can do that very easily in Prolog (a relational language) given a grammar in Definite Clause Grammars notation.<p>As far as I can tell your approach requires a grammar to be given by a user. In that case, what is the advantage of using an LLM to generate text? Why can't you just run your grammar as a generator and generate the text you want? That would save you the considerable trouble and cost of training an LLM in the first place. And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?
This is exciting, we built a similar tool[1] recently specifically targeted at constraining llama output to match a TypeScript interface.<p>I firmly believe that output format guarantees are going to be important for real (non-toy) decades for LLMs<p>[1] <a href="https://github.com/ggerganov/llama.cpp/discussions/2494">https://github.com/ggerganov/llama.cpp/discussions/2494</a>
Are there temperature or sampling parameters for generate.regex? I'm poking around trying to generate password mnemonics (<a href="https://rmmh.github.io/abbrase/" rel="nofollow noreferrer">https://rmmh.github.io/abbrase/</a>), and it really doesn't like actually giving me proper words:<p><pre><code> >> model = models.transformers("gpt2-medium")
>> generate.regex(model, r"Rea[a-z']{,10} lik[a-z']{,10} acr[a-z']{,10} ene[a-z']{,10} sta[a-z']{,10}\.", max_tokens=30)("A memorable phrase is:")
'Rearmingandme like acrowetteanda eneatubootank stackfishkies.'</code></pre>
One potential drawback I can see is if the viable tokens are far down the list of predictions. In that case, filtering down to just those tokens is a distribution shift with resulting output being less stable / less sensible.
Looks interesting! How would you say it compares to Microsoft's TypeChat (beyond the obvious Python/TypeScript difference)?<p><a href="https://microsoft.github.io/TypeChat/blog/introducing-typechat/" rel="nofollow noreferrer">https://microsoft.github.io/TypeChat/blog/introducing-typech...</a>
OpenAI has this capability built in with functions[0], I believe! Building my own project[1] I have implemented functions in combination with guidance[2] and haven’t had a hiccup yet! I have a JSON parser function there, just in case, but it seems to be working reliably.<p>Here’s a bit more of a description of using the functions API for JSON returns: <a href="https://yonom.substack.com/p/native-json-output-from-gpt-4" rel="nofollow noreferrer">https://yonom.substack.com/p/native-json-output-from-gpt-4</a><p>[0] <a href="https://openai.com/blog/function-calling-and-other-api-updates" rel="nofollow noreferrer">https://openai.com/blog/function-calling-and-other-api-updat...</a><p>[1] <a href="https://resgen.app" rel="nofollow noreferrer">https://resgen.app</a><p>[2] <a href="https://github.com/guidance-ai/guidance">https://github.com/guidance-ai/guidance</a>
OK, you get syntactically valid JSON, but does it contain the correct info? This is effectively a polisher, like spell check, which gives the output superficially correct form but doesn't understand the content. Right?
For complex tasks like coding, my experience is that asking for a complex output format hurts performance on the underlying task. This showed up clearly in code editing benchmarks of GPT-3.5 and GPT-4:<p><a href="https://aider.chat/docs/benchmarks.html" rel="nofollow noreferrer">https://aider.chat/docs/benchmarks.html</a><p>I’m curious if you have measured whether the “constrained generation” that you’re doing suffers from similar downsides?
I really hope OpenAI add something like this to their endpoints soon.<p>Being able to pass up some kind of grammar (a regular expression, or a JSON schema, or some other format) and have this trick run during their token sampling process to ensure the output was compliant would be incredibly useful.
As a more general comment, the repo README provides examples that all use gpt2. It would be nice to see at least one example that invokes llama2, since I feel like that would make sure the reader knows that this library can use models that are more modern and interesting.
Few thoughts, you're effectively creating representations that can convert to JSON (kudos!)<p>Can't mention how we did it (there are a lot of public patents, if interested), but back in 2018 we had a way to generate synthetic data (statistically, structurally similar) off any dataset - <a href="https://medium.com/capital-one-tech/why-you-dont-necessarily-need-data-for-data-science-48d7bf503074" rel="nofollow noreferrer">https://medium.com/capital-one-tech/why-you-dont-necessarily...</a> You could also design datasets if you wanted.<p>It'd keep similar relations and worked pretty darn well. Not the exact same, but always produced valid JSON.
Enforcing JSON schema, regex and grammars is very useful. But how can we enforce decoding spans from a document? decoded text should be copied from a list of spans in the input document. It would be useful for extractive tasks.
Generating an FSM over the vocabulary is a really interesting approach to guided sampling! I'm hacking on a structured inference library (<a href="https://github.com/gsuuon/ad-llama">https://github.com/gsuuon/ad-llama</a>) - I also tried to add a vocab preprocessing step to generate a valid tokens mask (just with regex or static strings initially) but discovered that doing so would cause unlikely / unnatural tokens to be masked rather than the token which represents the natural encoding given the existing sampled tokens.<p>Given the stateful nature of tokenizers, I decided that trying to preprocess the individual token ids was a losing battle. Even in the simple case of whitespace - tokenizer merges can really screw up generating a static mask, e.g. we expect a space next, but a token decodes to 'foo', but is actually a '_foo' and would've decoded with a whitespace if it were following a valid pair. When I go to construct the static vocab mask, it would then end up matching against 'foo' instead of ' foo'.<p>How did you work around this for the FSM approach? Does it somehow include information about merges / whitespace / tokenizer statefulness?
I have noob thought on the potential of these in Formal path planning.
Specifically given a set of functions that basically map {State -> Actions} given preconditions, transition functions (heavily paraphrasing STRIPS[1]) can a correct and optionally "realistic" plan be generated[2]? I am quite interested in this. It seems clear that the issue is that there is no "guidance" like DFA on what is the correct next symbol for a Plan, but perhaps the AI can generate some kind of a probability or order on what is the best step and one can go from there...<p>Are you guys thinking about this direction?<p>[1] <a href="https://en.wikipedia.org/wiki/Stanford_Research_Institute_Problem_Solver#Complexity" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Stanford_Research_Institute_Pr...</a><p>[2] Formal Planning decision problem(plan exists) given STRIPS spec is at least NP-Complete[1]. There are several mathematical, logical and statistical "tricks"(e.g. [3]) that are used to bring down the complexity and try find a plan using heuristics(thinking MDPs, POMDPs here). This is not new, everyone in LLM research knows this.<p>[3] "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning": <a href="https://www.sciencedirect.com/science/article/pii/S0004370299000521?ref=pdf_download" rel="nofollow noreferrer">https://www.sciencedirect.com/science/article/pii/S000437029...</a>
FYI llama.cpp can do that for a "while" <a href="https://github.com/ggerganov/llama.cpp/pull/1773">https://github.com/ggerganov/llama.cpp/pull/1773</a><p>Somebody is also working on a whisper.cpp version, which is maybe even more interesting because if you have grammar you can speak not only JSON but also a code (or anything)
This is amazing! I think for production and rapid development use-cases though we just use XML for information extraction. It's extremely easy to parse with regex and rarely do the models make mistakes since the start and end tokens are uncommon. At least this is just for the OpenAI model which are different from the use cases in this ShowHN.
Having played around with this sort of thing in the llama.cpp ecosystem when they added it a few weeks ago, I will say that it also helps if your models a) are tuned to output json and b) you prompt them to do so. Anything you can do to help the output fit the grammar helps.
How does this compare in terms of latency, cost, and effectiveness to jsonformer? <a href="https://github.com/1rgs/jsonformer">https://github.com/1rgs/jsonformer</a>
That looks intriguing. Managing that interface has proven challenging - especially on data cleaning tasks where the model ends up talking rather than doing. Bit more guiderails would be helpful on that
Would love to have a tutorial on how to install and run this locally with a nice model, for those of us who are behind the 8-ball with torch, transformers, diffusers, llama2 etc.
I feel like I'm missing something very basic here, but is this library intended to be used with an existing model? If so, could you point to an example?
Are there edge cases here due to context length?<p>1. I have a json schema with required fields. I complete the json, but do not include the required fields.<p>2. I run out of token from the model before I finish the json object because I'm in the middle of some deep, nested structure.<p>These seem solvable, just edge cases to control for by either reserving tokens, randomly generating required tokens until completing the json, or something more sophisticated.
I've spent two days trying to make this work with anything other than gpt2 and I just can't get it to work.<p>GPT2 doesn't seem to take instruction well. I've tried llama gpt-medium etc etc.<p>They all either pick up a different language, or freeze.<p>EDIT: I see tons of activity and work in the github issues, so ignore this for now.<p>Super excited when I'll be able to have this working for myself!
Ok so:<p>- for what energy/processing cost per validation?<p>- how much of the input space was tested (unicode chars, escaped chars, newlines, etc)?<p>- are you doing this as a service? We've seen LLMs already evolve negatively in some capabilities over time, so do you have a constant "ping" test suite validating the LLM's performance?
it still blows my mind that OpenAI exposes an API with Functions calling, and yet <i>does not guarantee the model will call your function correctly</i>, in fact, it does not even guarantee the output will be valid JSON.<p>When this is, really, a solved problem. I've been using github.com/microsoft/guidance for weeks, and it genuinely, truly guarantees correct output, because <i>it simply does not sample from tokens that would be invalid.</i><p>It just seems so obvious, I still have no clue why OpenAI does not do this. Like, why fuss around with validating JSON after the fact, when you can simply guarantee it is correct in the first place, by only sampling tokens <i>if they conform to the grammar you are trying to emit?</i>
<a href="https://github.com/newhouseb/clownfish">https://github.com/newhouseb/clownfish</a><p>Which I've been using for a while now, also restricts the sampling space to force correct generation, but does so as the result of a different process than yours.
I tried slight modifications from the example pydantic model and it's incredibly slow. Maybe I'm doing something wrong but I've a hefty box and a 3090, an example using gpt-2 doesn't seem like it should be that taxing.
It says "Outlines 〰 is compatible with all models.".
But does this actually work with gpt3.5-turbo or gpt4?
I was using guidance before and you only get value when using davinci due to the constraints of chat api based models.
This is what we did at Trex (<a href="https://github.com/automorphic-ai/trex">https://github.com/automorphic-ai/trex</a>). The tricky part is doing it quickly and efficiently.
It does seem inapt to claim this “eliminates” hallucinations in your blog post. Sort of like unnamed FP languages claiming to eliminate bugs.<p>Both eliminate a subclass of failures, but don’t preclude failure categorically.
Notable that you can't seem to use this trick to have an LLM create JSON that has JSON embedded in it. Which... happens far more often than it probably should. :(
How is this different from generating such things without an LLM? In other words picking random valid tokens from the grammar via fuzzing or similar techniques.
You should probably look into Guidance [1](previously Microsoft Guidance but looks like it’s been separated from their main organization), which is a language for controlling the output of LLMs (so you can, among many other things, output JSON in a deterministic way)<p>[1]: <a href="https://github.com/guidance-ai/guidance">https://github.com/guidance-ai/guidance</a>
<p><pre><code> print(guided)
# What is the IP address of the Google DNS servers?
# 2.2.6.1
</code></pre>
correctly formatted wrong answers are still wrong answers.
> LLMs can generate valid JSON 100% of the time<p>If that seems surprising, it is worth doing a course like Karpathy's zero to hero NN, and have all the magic peeled away a layer at a time.<p>The reason you can do this is because LLMs don't just generate the next word or token, it produces a probability distribution over all tokens. A JSON parser can give you a list of next valid tokens. The tokens in each case might be from a different set, e.g LLM thinks of " The" whereas the JSON parser might think of "{", so you need some conversion there. But if you sample randomly from only the valid tokens, the output must be valid JSON.<p>What you can't build a parser for though is ... the truth! You may still be told lies or made up stuff.
Regex-constrained GPT, what is a mnemonic for pi?<p>> It's a word, a short statement or phrase which you learn.<p>Can you make a good one?<p>> Man, I wish I could recommend an answer. You're not gonna remember something, because, obviously, pi's so big. Actually, let's forget pi. There's only one way: Googling for it.<p>(count the letters)
I also released a hosted version of my open-source libraries ReLLM and ParserLLM that already supports APIs for<p>* Regex completion for LLMs<p>* Context-free Grammar completion for LLMs<p><a href="https://thiggle.com/" rel="nofollow noreferrer">https://thiggle.com/</a><p>[0] <a href="https://github.com/r2d4/rellm">https://github.com/r2d4/rellm</a><p>[1] <a href="https://github.com/r2d4/parserllm">https://github.com/r2d4/parserllm</a><p>[2] <a href="https://github.com/thiggle/api">https://github.com/thiggle/api</a><p>There's also another API on Thiggle that I've build that supports classification via a similar logit-based strategy.
The “trick” seems to blatantly rip off FlashText without citing it?<p><a href="https://arxiv.org/pdf/1711.00046.pdf" rel="nofollow noreferrer">https://arxiv.org/pdf/1711.00046.pdf</a><p>I’m a fan of the approach. I normally wouldn’t care if this was just another LLM library taking inspiration, but if you’re going to go out of your way to put a paper on the ArXiv, feels like doing a literature review is a good step?