Jsonformer: Generate structured output from LLMs

340 pointsby yunyuabout 2 years ago

31 comments

kcorbittabout 2 years ago

I've thought about building this for a while, glad it's out there!Not only does this guarantee your output is JSON, it lowers your generation cost and latency by filling in many of the repetitive schema tokens without passing them through the LLM.For the very common case of "extracting multiple structured fields from a piece of unstructured text," I believe there's an even stronger optimization possible that would further decrease costs, latency and potentially even improve accuracy.Assuming the fields you want to extract are independent (and they often are), you don't need to generate them all in one go autoregressively. Eg. instead of running the following pseudo-prompt:<pre><code> "Input: 'It's sunny and cold today' Output schema: {"sunny": boolean, "temperature": string}" </code></pre> You could instead run the following two:<pre><code> "Input: 'It's sunny and cold today' Output schema: {"sunny": boolean}" "Input: 'It's sunny and cold today' Output schema: {"temperature": string}" </code></pre> We don't do that today because when done naively it's very inefficient -- you'd be tokenizing, passing to the GPU, and computing the KV cache of the shared part of the prompt twice. But a library with the right abstraction could run the second two queries in a batch in parallel and reuse the same tokenization and KV cache for both of them. It would actually be more efficient than generating both fields in one go, since when you factor out the shared prefixes both the generated text and its context are shorter!I mentioned above that this could also improve accuracy. Of course it doesn't do that by default (except that by excluding all the irrelevant fields it makes self-attention's job easier). But what it does do is give you an independent prompt for each field you're interested in. And so for particularly tricky fields you're trying to extract, you have the flexibility to eg. add several examples to make the generation N-shot.

评论 #35794949 未加载

评论 #35793400 未加载

评论 #35793905 未加载

newhousebabout 2 years ago

Oh nice! I built a similar system a few weeks ago: <a href="https://github.com/newhouseb/clownfish">https://github.com/newhouseb/clownfish</a>I think the main differentiating factor here is that this is better if you have a simpler JSON schema without enums or oneOf constraints. If you do have these constraints, i.e. let's say you wanted an array of different types that represented a items on a menu { kind: pizza, toppings: [pepperoni] } or { kind: ice_cream, flavor: vanilla | strawberry } then you would need something more sophisticated like clownfish that can ask the LLM to pick specific properties (and an ability to do some backtracking so you can do proper beam search).For completeness, another common approach can be found here: <a href="https://github.com/ShreyaR/guardrails">https://github.com/ShreyaR/guardrails</a> which essentially boils down to "provide the schema in the prompt and ask the LLM to correct things if it fails to get the schema right the first time."

评论 #35793079 未加载

评论 #35798587 未加载

评论 #35794307 未加载

评论 #35792659 未加载

评论 #35794966 未加载

评论 #35802211 未加载

sundarurfriendabout 2 years ago

> Bulletproof JSON generation: Jsonformer ensures that the generated JSON is always syntactically correct and conforms to the specified schema.This is an important definition to take note of: "bulletproof" doesn't mean that you'll get good or correct data. It only means that it'll be valid JSON and in a particular schema that you specify (because the LLM isn't building the JSON in the first place, the library is).It's an interesting idea. But it's not clear if they've validated the heuristics they use, to see how well it performs in terms of accuracy against, say, some kind of BeautifulSoup-like attempt to make sense of the JSON-ish that the LLM produces and correct that to be valid JSON, or any other approach to the problem.

评论 #35793762 未加载

Der_Einzigeabout 2 years ago

Love to see further work on constrained decoding like this and other systems introduced in the comments!See my work and the paper about it. I've got a lot of y'all beat on this (constrained decoding, not the templating and structuring) by about a year:<a href="https://github.com/hellisotherpeople/constrained-text-generation-studio">https://github.com/hellisotherpeople/constrained-text-genera...</a>

andrewcamelabout 2 years ago

Seen a lot of things trying to do this by pressure testing the outputs, but all feel like anti-patterns. This is the first that seems like the "right" way to do it. Better to manage how the model is generating vs creating one more potentially faulty "glue" layer.

评论 #35794369 未加载

评论 #35794056 未加载

motoboiabout 2 years ago

I found it rather strange that the new AndrewNG course about prompting, that features an OpenAI employee, says nothing about templated output.To me this is a killer feature of GPT, being able to turn a document into a json or any other template.The kind of prompt is just amazing for GPT (try it with a blog post, document or any other thing): "Analyze this document and transform it into the following format:<title><summary (text conciseness: 5/10)><content bullet points (text conciseness 3/10)><content_item 1><content_item 2><content_item N>"Also you can ask the same prompt in a json and GPT will gladly transform a PDF into a JSON.

评论 #35799097 未加载

toughabout 2 years ago

I knew a similar one called GPTyped, just posted it on HN <a href="https://news.ycombinator.com/item?id=35793056#35793057" rel="nofollow">https://news.ycombinator.com/item?id=35793056#35793057</a>

benobabout 2 years ago

How about going one step further and constrain transformer output with a context-free grammar? That way you can generate more conformant code such as Python or C.

评论 #35792378 未加载

rickcarlinoabout 2 years ago

Has anyone seen a tool like this that uses Node rather than Python? I have this exact problem in a GPT-based web application I am building and have had to resort to some “creative” solutions. At the very least I am glad to see people are tackling this problem.

评论 #35794387 未加载

评论 #35796751 未加载

评论 #35793979 未加载

aligajaniabout 2 years ago

Nice tool, will check it out. I had to go through a painstaking trial and error process to generate valid and deterministic JSON for my AI presentation tool called Slide Genie (<a href="https://slidegenie.vercel.app/" rel="nofollow">https://slidegenie.vercel.app/</a>). The hard part was making it work when temperature > 0.

ianbutlerabout 2 years ago

Nice this codifies something similar I've been doing in my prompts! Will be using this instead.What I currently have been doing:The JSON template for your response is provided below. The parts to fill out are capitalized. Please do not modify the template. Please fill in the template with one of the above options for your response. <result> { "rating": "N. RATING", "reason": "REASON" } </result>

xephoid42about 2 years ago

I actually did this with an silly little app I made that generates fake social media profiles (<a href="https://lookface.app" rel="nofollow">https://lookface.app</a>). I gave it a prompt telling it what to generate and an example JSON. As long as you say it must be in JSON I haven't had any problems with it generating bad JSON.

tanepiperabout 2 years ago

Nice job - I've tried to massage the outputs to be structured and sometimes it works, but sometimes it fails badly. Having a more specific set of constraints around it will definitely make it more effective.

评论 #35799066 未加载

visargaabout 2 years ago

I wanted to see the opposite - parsing JSON and YAML generated from LLMs. It doesn't happen much with GPT-4 but lesser models might mess up the format and then you can't simply parse it.

评论 #35795162 未加载

diginovaabout 2 years ago

Something like this should be integrated with library like <a href="https://fakerjs.dev/" rel="nofollow">https://fakerjs.dev/</a> With LLM or in general AI based generation of the fake data it can be more diverse and generalized for lot's more applications and help developers My bad if I am unaware of faker having AI based generation already, but afaik it does not have right now

drbojingleabout 2 years ago

I like the idea of getting ChatGPT to return something easily parse-able by a program. I've been using an XML derivative for that. <a href="https://github.com/ColinRyan/Chat-Markup-Language">https://github.com/ColinRyan/Chat-Markup-Language</a>Never thought to use json schema. I'll check this out!

layoricabout 2 years ago

I might be reading the code wrong but it looks like it crawls the schema making a generation per primitive type. While that’s a clever way to ensure valid JSON, I don’t know if I’d go as far as to describe it as efficient.Saying that if the model is unable to generate JSON due to its training/fine tuning, this is indeed a clever solution!

评论 #35795705 未加载

yawnxyzabout 2 years ago

<pre><code> Efficiency: By generating only the content tokens and filling in the fixed tokens, Jsonformer is more efficient than generating a full JSON string and parsing it. </code></pre> I was excited to try this in Replit... and realized it required pytorch. Ouch. Replit was not happy about that!

syntaxingabout 2 years ago

Is there a way to do something like this but with fine tuning? For example, I want to train a Lora to become a email spam classifier. I have the training data for the prompt as the email and the response as {Boolean:True/False}?

评论 #35796168 未加载

wy35about 2 years ago

Very interesting. I've only been using OpenAI APIs so this logit stuff is new to me.

评论 #35792087 未加载

评论 #35792381 未加载

评论 #35793724 未加载

apalmerabout 2 years ago

Trying to understand why this is necessary? LLMs cannot reliably generate valid Jason?

评论 #35793539 未加载

评论 #35795665 未加载

评论 #35794379 未加载

pankajdohareyabout 2 years ago

Its not very hard through prompting. You can just ask the LLM to generate on these parameters. I did this exact same thing and never wrote any code for it.

Jayakumarkabout 2 years ago

This is great that it does not use OpenAI and runs locally

pkleeabout 2 years ago

This is pretty cool. I tried with dolly and then I tried with T5-base, both of it did not give me result. It broke for me. Has anyone tried it ?

zaptheimpalerabout 2 years ago

How does this work? I guess its different from something like fine-tuning because that wouldn't 100% guarantee the right schema?

Aerbil313about 2 years ago

Is it possible to use this with OpenAI’s models? i.e do they support such in-line token generation?

EGregabout 2 years ago

Fantastic! This makes it easy to let humans write prompts and generate requests an API can consume.

rain1about 2 years ago

There is no point in constructing a fixed template JSON object like that just to parse it again.

评论 #35798777 未加载

评论 #35798772 未加载

91Jacobabout 2 years ago

I'm trying to imagine what a possible use case would be for this. Any simple examples?

评论 #35799127 未加载

phhabout 2 years ago

I hope that this is new to no-one generating JSON using LLM, because it felt like the first thing you'd do when I implemented that kind of stuff. That being said, it's nice to have that as a library ready-to-go.

phate334about 2 years ago

It may be easier to use with Pydantic.