Show HN: LLMs can generate valid JSON 100% of the time

854 pointsby remiloufalmost 2 years ago

Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.Recently we came up with a fast way to generate text that matches a regex (<a href="https://blog.normalcomputing.ai/posts/2023-07-27-regex-guided-generation/regex-guided-generation.html" rel="nofollow noreferrer">https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...</a>). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.From there it was only a small leap to be able to generate text that follows a JSON schema (<a href="https://json-schema.org/" rel="nofollow noreferrer">https://json-schema.org/</a>), or is parseable into a Pydantic model (<a href="https://docs.pydantic.dev/latest/usage/models/" rel="nofollow noreferrer">https://docs.pydantic.dev/latest/usage/models/</a>). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.I look forward to feedback, bug reports, feature requests and discussions!Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar <a href="https://arxiv.org/abs/2307.09702" rel="nofollow noreferrer">https://arxiv.org/abs/2307.09702</a>

62 comments

activatedgeekalmost 2 years ago

Mechanistically, I think this library takes the simple idea of masking part of the vocabulary space and steps in time efficiently. Great!I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?

评论 #37126811 未加载

评论 #37129917 未加载

评论 #37130885 未加载

评论 #37127252 未加载

评论 #37127435 未加载

panarkyalmost 2 years ago

I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten.But it's still probabilistic, and nine times out of ten isn't good enough.Occasionally it will hallucinate responses like this:{"key1": "value1", "key2": "value2" for i in range(n)}Re-prompting with the parsing error message is usually enough to get it on the second try.But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.Re-prompting for escaping errors still yields a ~50% success rate.

评论 #37127042 未加载

评论 #37126587 未加载

评论 #37127224 未加载

评论 #37129631 未加载

评论 #37130792 未加载

评论 #37127035 未加载

评论 #37126717 未加载

评论 #37132041 未加载

评论 #37126963 未加载

评论 #37125700 未加载

评论 #37126751 未加载

评论 #37126896 未加载

hansvmalmost 2 years ago

A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.

评论 #37131508 未加载

sneedchuckeralmost 2 years ago

Relevant; LLama.cpp implemented grammar-based sampling last month.<a href="https://news.ycombinator.com/item?id=36819906">https://news.ycombinator.com/item?id=36819906</a> <a href="https://github.com/ggerganov/llama.cpp/pull/1773">https://github.com/ggerganov/llama.cpp/pull/1773</a>

评论 #37125691 未加载

评论 #37128086 未加载

xigencyalmost 2 years ago

Thanks for building this. The mechanics are such an obvious idea that it's astounding that the first-party platforms haven't done this yet. I would be interested to see how this could be used for other tasks outside of JSON that require structured input.

评论 #37126190 未加载

评论 #37125712 未加载

评论 #37126851 未加载

BoorishBearsalmost 2 years ago

I'm not sure how this is different than:<a href="https://github.com/1rgs/jsonformer">https://github.com/1rgs/jsonformer</a>or<a href="https://github.com/newhouseb/clownfish">https://github.com/newhouseb/clownfish</a>or<a href="https://github.com/mkuchnik/relm">https://github.com/mkuchnik/relm</a>or<a href="https://github.com/ggerganov/llama.cpp/pull/1773">https://github.com/ggerganov/llama.cpp/pull/1773</a>or<a href="https://github.com/Shopify/torch-grammar">https://github.com/Shopify/torch-grammar</a>Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)

评论 #37125974 未加载

J_Shelby_Jalmost 2 years ago

So to explain this another way:After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?Very slick!

评论 #37126564 未加载

评论 #37125761 未加载

评论 #37129380 未加载

评论 #37126305 未加载

Q6T46nT668w6i3malmost 2 years ago

Is this Brandon Willard the breakdancer from Detroit Brandon Willard?Edit: It is! <a href="https://brandonwillard.github.io/" rel="nofollow noreferrer">https://brandonwillard.github.io/</a>

评论 #37126535 未加载

YeGoblynQueennealmost 2 years ago

Hi, remilouf. You say that your background is in "probabilistic, relational and symbolic programming". In that case I suspect you understand that it is no problem to generate text from a regular or context-free grammar, or really any level of grammar. For example, you can do that very easily in Prolog (a relational language) given a grammar in Definite Clause Grammars notation.As far as I can tell your approach requires a grammar to be given by a user. In that case, what is the advantage of using an LLM to generate text? Why can't you just run your grammar as a generator and generate the text you want? That would save you the considerable trouble and cost of training an LLM in the first place. And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?

评论 #37132093 未加载

评论 #37133064 未加载

评论 #37133005 未加载

aduffyalmost 2 years ago

This is exciting, we built a similar tool[1] recently specifically targeted at constraining llama output to match a TypeScript interface.I firmly believe that output format guarantees are going to be important for real (non-toy) decades for LLMs[1] <a href="https://github.com/ggerganov/llama.cpp/discussions/2494">https://github.com/ggerganov/llama.cpp/discussions/2494</a>

Scaevolusalmost 2 years ago

Are there temperature or sampling parameters for generate.regex? I'm poking around trying to generate password mnemonics (<a href="https://rmmh.github.io/abbrase/" rel="nofollow noreferrer">https://rmmh.github.io/abbrase/</a>), and it really doesn't like actually giving me proper words:<pre><code> >> model = models.transformers("gpt2-medium") >> generate.regex(model, r"Rea[a-z']{,10} lik[a-z']{,10} acr[a-z']{,10} ene[a-z']{,10} sta[a-z']{,10}\.", max_tokens=30)("A memorable phrase is:") 'Rearmingandme like acrowetteanda eneatubootank stackfishkies.'</code></pre>

Scene_Cast2almost 2 years ago

One potential drawback I can see is if the viable tokens are far down the list of predictions. In that case, filtering down to just those tokens is a distribution shift with resulting output being less stable / less sensible.

评论 #37126159 未加载

评论 #37128536 未加载

评论 #37126033 未加载

评论 #37131760 未加载

Deukhoofdalmost 2 years ago

Looks interesting! How would you say it compares to Microsoft's TypeChat (beyond the obvious Python/TypeScript difference)?<a href="https://microsoft.github.io/TypeChat/blog/introducing-typechat/" rel="nofollow noreferrer">https://microsoft.github.io/TypeChat/blog/introducing-typech...</a>

评论 #37125470 未加载

评论 #37126002 未加载

Ilaskyalmost 2 years ago

OpenAI has this capability built in with functions[0], I believe! Building my own project[1] I have implemented functions in combination with guidance[2] and haven’t had a hiccup yet! I have a JSON parser function there, just in case, but it seems to be working reliably.Here’s a bit more of a description of using the functions API for JSON returns: <a href="https://yonom.substack.com/p/native-json-output-from-gpt-4" rel="nofollow noreferrer">https://yonom.substack.com/p/native-json-output-from-gpt-4</a>[0] <a href="https://openai.com/blog/function-calling-and-other-api-updates" rel="nofollow noreferrer">https://openai.com/blog/function-calling-and-other-api-updat...</a>[1] <a href="https://resgen.app" rel="nofollow noreferrer">https://resgen.app</a>[2] <a href="https://github.com/guidance-ai/guidance">https://github.com/guidance-ai/guidance</a>

评论 #37125951 未加载

评论 #37126772 未加载

Animatsalmost 2 years ago

OK, you get syntactically valid JSON, but does it contain the correct info? This is effectively a polisher, like spell check, which gives the output superficially correct form but doesn't understand the content. Right?

评论 #37126253 未加载

评论 #37127237 未加载

评论 #37126482 未加载

anotherpaulgalmost 2 years ago

For complex tasks like coding, my experience is that asking for a complex output format hurts performance on the underlying task. This showed up clearly in code editing benchmarks of GPT-3.5 and GPT-4:<a href="https://aider.chat/docs/benchmarks.html" rel="nofollow noreferrer">https://aider.chat/docs/benchmarks.html</a>I’m curious if you have measured whether the “constrained generation” that you’re doing suffers from similar downsides?

评论 #37128315 未加载

评论 #37128117 未加载

simonwalmost 2 years ago

I really hope OpenAI add something like this to their endpoints soon.Being able to pass up some kind of grammar (a regular expression, or a JSON schema, or some other format) and have this trick run during their token sampling process to ensure the output was compliant would be incredibly useful.

评论 #37127292 未加载

评论 #37127320 未加载

coder543almost 2 years ago

As a more general comment, the repo README provides examples that all use gpt2. It would be nice to see at least one example that invokes llama2, since I feel like that would make sure the reader knows that this library can use models that are more modern and interesting.

评论 #37127605 未加载

评论 #37130678 未加载

lettergramalmost 2 years ago

Few thoughts, you're effectively creating representations that can convert to JSON (kudos!)Can't mention how we did it (there are a lot of public patents, if interested), but back in 2018 we had a way to generate synthetic data (statistically, structurally similar) off any dataset - <a href="https://medium.com/capital-one-tech/why-you-dont-necessarily-need-data-for-data-science-48d7bf503074" rel="nofollow noreferrer">https://medium.com/capital-one-tech/why-you-dont-necessarily...</a> You could also design datasets if you wanted.It'd keep similar relations and worked pretty darn well. Not the exact same, but always produced valid JSON.

评论 #37134590 未加载

visargaalmost 2 years ago

Enforcing JSON schema, regex and grammars is very useful. But how can we enforce decoding spans from a document? decoded text should be copied from a list of spans in the input document. It would be useful for extractive tasks.

gsuuonalmost 2 years ago

Generating an FSM over the vocabulary is a really interesting approach to guided sampling! I'm hacking on a structured inference library (<a href="https://github.com/gsuuon/ad-llama">https://github.com/gsuuon/ad-llama</a>) - I also tried to add a vocab preprocessing step to generate a valid tokens mask (just with regex or static strings initially) but discovered that doing so would cause unlikely / unnatural tokens to be masked rather than the token which represents the natural encoding given the existing sampled tokens.Given the stateful nature of tokenizers, I decided that trying to preprocess the individual token ids was a losing battle. Even in the simple case of whitespace - tokenizer merges can really screw up generating a static mask, e.g. we expect a space next, but a token decodes to 'foo', but is actually a '_foo' and would've decoded with a whitespace if it were following a valid pair. When I go to construct the static vocab mask, it would then end up matching against 'foo' instead of ' foo'.How did you work around this for the FSM approach? Does it somehow include information about merges / whitespace / tokenizer statefulness?

itissidalmost 2 years ago

I have noob thought on the potential of these in Formal path planning. Specifically given a set of functions that basically map {State -> Actions} given preconditions, transition functions (heavily paraphrasing STRIPS[1]) can a correct and optionally "realistic" plan be generated[2]? I am quite interested in this. It seems clear that the issue is that there is no "guidance" like DFA on what is the correct next symbol for a Plan, but perhaps the AI can generate some kind of a probability or order on what is the best step and one can go from there...Are you guys thinking about this direction?[1] <a href="https://en.wikipedia.org/wiki/Stanford_Research_Institute_Problem_Solver#Complexity" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Stanford_Research_Institute_Pr...</a>[2] Formal Planning decision problem(plan exists) given STRIPS spec is at least NP-Complete[1]. There are several mathematical, logical and statistical "tricks"(e.g. [3]) that are used to bring down the complexity and try find a plan using heuristics(thinking MDPs, POMDPs here). This is not new, everyone in LLM research knows this.[3] "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning": <a href="https://www.sciencedirect.com/science/article/pii/S0004370299000521?ref=pdf_download" rel="nofollow noreferrer">https://www.sciencedirect.com/science/article/pii/S000437029...</a>

评论 #37135570 未加载

评论 #37135568 未加载

thatcherthornalmost 2 years ago

This is awesome. I have a vision to build self-managed software. This will be a great tool.

评论 #37125847 未加载

评论 #37125814 未加载

cztomsikalmost 2 years ago

FYI llama.cpp can do that for a "while" <a href="https://github.com/ggerganov/llama.cpp/pull/1773">https://github.com/ggerganov/llama.cpp/pull/1773</a>Somebody is also working on a whisper.cpp version, which is maybe even more interesting because if you have grammar you can speak not only JSON but also a code (or anything)

kevinlu1248almost 2 years ago

This is amazing! I think for production and rapid development use-cases though we just use XML for information extraction. It's extremely easy to parse with regex and rarely do the models make mistakes since the start and end tokens are uncommon. At least this is just for the OpenAI model which are different from the use cases in this ShowHN.

Ycrosalmost 2 years ago

Having played around with this sort of thing in the llama.cpp ecosystem when they added it a few weeks ago, I will say that it also helps if your models a) are tuned to output json and b) you prompt them to do so. Anything you can do to help the output fit the grammar helps.

leetharrisalmost 2 years ago

How does this compare in terms of latency, cost, and effectiveness to jsonformer? <a href="https://github.com/1rgs/jsonformer">https://github.com/1rgs/jsonformer</a>

评论 #37126388 未加载

评论 #37126352 未加载

Havocalmost 2 years ago

That looks intriguing. Managing that interface has proven challenging - especially on data cleaning tasks where the model ends up talking rather than doing. Bit more guiderails would be helpful on that

评论 #37125528 未加载

dvasdekisalmost 2 years ago

Would love to have a tutorial on how to install and run this locally with a nice model, for those of us who are behind the 8-ball with torch, transformers, diffusers, llama2 etc.

btbuildemalmost 2 years ago

I feel like I'm missing something very basic here, but is this library intended to be used with an existing model? If so, could you point to an example?

评论 #37126636 未加载

jmcminisalmost 2 years ago

Are there edge cases here due to context length?1. I have a json schema with required fields. I complete the json, but do not include the required fields.2. I run out of token from the model before I finish the json object because I'm in the middle of some deep, nested structure.These seem solvable, just edge cases to control for by either reserving tokens, randomly generating required tokens until completing the json, or something more sophisticated.

demarqalmost 2 years ago

I've spent two days trying to make this work with anything other than gpt2 and I just can't get it to work.GPT2 doesn't seem to take instruction well. I've tried llama gpt-medium etc etc.They all either pick up a different language, or freeze.EDIT: I see tons of activity and work in the github issues, so ignore this for now.Super excited when I'll be able to have this working for myself!

coding123almost 2 years ago

Can someone re-explain all of this. If I got to GPT3.5 and ask it to give me some information in json, vs whatever this library is doing?

评论 #37127662 未加载

AtlasBarfedalmost 2 years ago

Ok so:- for what energy/processing cost per validation?- how much of the input space was tested (unicode chars, escaped chars, newlines, etc)?- are you doing this as a service? We've seen LLMs already evolve negatively in some capabilities over time, so do you have a constant "ping" test suite validating the LLM's performance?

2bitencryptionalmost 2 years ago

it still blows my mind that OpenAI exposes an API with Functions calling, and yet does not guarantee the model will call your function correctly, in fact, it does not even guarantee the output will be valid JSON.When this is, really, a solved problem. I've been using github.com/microsoft/guidance for weeks, and it genuinely, truly guarantees correct output, because it simply does not sample from tokens that would be invalid.It just seems so obvious, I still have no clue why OpenAI does not do this. Like, why fuss around with validating JSON after the fact, when you can simply guarantee it is correct in the first place, by only sampling tokens if they conform to the grammar you are trying to emit?

评论 #37126802 未加载

评论 #37126512 未加载

评论 #37126007 未加载

ianbutleralmost 2 years ago

<a href="https://github.com/newhouseb/clownfish">https://github.com/newhouseb/clownfish</a>Which I've been using for a while now, also restricts the sampling space to force correct generation, but does so as the result of a different process than yours.

IanCalalmost 2 years ago

I tried slight modifications from the example pydantic model and it's incredibly slow. Maybe I'm doing something wrong but I've a hefty box and a 3090, an example using gpt-2 doesn't seem like it should be that taxing.

评论 #37168594 未加载

dsrtslnd23almost 2 years ago

It says "Outlines 〰 is compatible with all models.". But does this actually work with gpt3.5-turbo or gpt4? I was using guidance before and you only get value when using davinci due to the constraints of chat api based models.

sandkoanalmost 2 years ago

This is what we did at Trex (<a href="https://github.com/automorphic-ai/trex">https://github.com/automorphic-ai/trex</a>). The tricky part is doing it quickly and efficiently.

kristjanssonalmost 2 years ago

It does seem inapt to claim this “eliminates” hallucinations in your blog post. Sort of like unnamed FP languages claiming to eliminate bugs.Both eliminate a subclass of failures, but don’t preclude failure categorically.

评论 #37131147 未加载

aiunboxedalmost 2 years ago

Open AI has released this as a feature, is this news ? what am i missing ?

sberensalmost 2 years ago

What happens if max_tokens cuts the model off from generating valid JSON?

taericalmost 2 years ago

Notable that you can't seem to use this trick to have an LLM create JSON that has JSON embedded in it. Which... happens far more often than it probably should. :(

评论 #37168543 未加载

Cholicalalmost 2 years ago

This looks great!

vlovich123almost 2 years ago

How is this different from generating such things without an LLM? In other words picking random valid tokens from the grammar via fuzzing or similar techniques.

评论 #37132728 未加载

评论 #37131028 未加载

oarsalmost 2 years ago

Excited to incorporate this into my developer workflow.

calderwoodraalmost 2 years ago

Have you found a solution to output exceeding the context window? That's been our only issue with generating json output.

评论 #37133110 未加载

tantaloralmost 2 years ago

"Generating valid JSON" is not impressive. Here's some valid JSON: []The tricky part is generating useful JSON.

评论 #37126383 未加载

评论 #37128151 未加载

评论 #37126846 未加载

评论 #37126726 未加载

rmonvferalmost 2 years ago

You should probably look into Guidance [1](previously Microsoft Guidance but looks like it’s been separated from their main organization), which is a language for controlling the output of LLMs (so you can, among many other things, output JSON in a deterministic way)[1]: <a href="https://github.com/guidance-ai/guidance">https://github.com/guidance-ai/guidance</a>

评论 #37133366 未加载

popinman322almost 2 years ago

Does this work in tandem with beam search or does it do greedy sampling?

评论 #37126196 未加载

haolezalmost 2 years ago

Can I use this locally with models that run on my CPU? Like llama.cpp

评论 #37129923 未加载

nikcheerlaalmost 2 years ago

Does this work with GPT-4?

Kiroalmost 2 years ago

Does this mean that I need to call the LLM API once for each token?

评论 #37127246 未加载

spottalmost 2 years ago

How does this relate to ggmls bnf sampling?

评论 #37125656 未加载

SethTroalmost 2 years ago

<pre><code> print(guided) # What is the IP address of the Google DNS servers? # 2.2.6.1 </code></pre> correctly formatted wrong answers are still wrong answers.

huevosabioalmost 2 years ago

Very cool! How much latency does it add?

评论 #37126066 未加载

jhhgzgftalmost 2 years ago

+hacker:com:de.wegt8wfcvd

quickthrower2almost 2 years ago

> LLMs can generate valid JSON 100% of the timeIf that seems surprising, it is worth doing a course like Karpathy's zero to hero NN, and have all the magic peeled away a layer at a time.The reason you can do this is because LLMs don't just generate the next word or token, it produces a probability distribution over all tokens. A JSON parser can give you a list of next valid tokens. The tokens in each case might be from a different set, e.g LLM thinks of " The" whereas the JSON parser might think of "{", so you need some conversion there. But if you sample randomly from only the valid tokens, the output must be valid JSON.What you can't build a parser for though is ... the truth! You may still be told lies or made up stuff.

评论 #37128262 未加载

评论 #37128685 未加载

评论 #37128603 未加载

评论 #37129356 未加载

评论 #37128113 未加载

malftalmost 2 years ago

Regex-constrained GPT, what is a mnemonic for pi?> It's a word, a short statement or phrase which you learn.Can you make a good one?> Man, I wish I could recommend an answer. You're not gonna remember something, because, obviously, pi's so big. Actually, let's forget pi. There's only one way: Googling for it.(count the letters)

评论 #37131065 未加载

rckrdalmost 2 years ago

I also released a hosted version of my open-source libraries ReLLM and ParserLLM that already supports APIs for* Regex completion for LLMs* Context-free Grammar completion for LLMs<a href="https://thiggle.com/" rel="nofollow noreferrer">https://thiggle.com/</a>[0] <a href="https://github.com/r2d4/rellm">https://github.com/r2d4/rellm</a>[1] <a href="https://github.com/r2d4/parserllm">https://github.com/r2d4/parserllm</a>[2] <a href="https://github.com/thiggle/api">https://github.com/thiggle/api</a>There's also another API on Thiggle that I've build that supports classification via a similar logit-based strategy.

评论 #37126597 未加载

lefttoreaderalmost 2 years ago

The “trick” seems to blatantly rip off FlashText without citing it?<a href="https://arxiv.org/pdf/1711.00046.pdf" rel="nofollow noreferrer">https://arxiv.org/pdf/1711.00046.pdf</a>I’m a fan of the approach. I normally wouldn’t care if this was just another LLM library taking inspiration, but if you’re going to go out of your way to put a paper on the ArXiv, feels like doing a literature review is a good step?

评论 #37128508 未加载

评论 #37127303 未加载

faangiqalmost 2 years ago

Is generating valid json nontrivial?