You're comparing apples to oranges - structured output (a capability) with structured output + CoT (a technique), saying that structured output isn't good for reasoning. Well, it's not supposed to "reason", and you didn't apply CoT to it!<p>I didn't like it overall and I think it can confuse people and it shouldn't mention best practices:<p>1. To ensure a reasonable amount of variation, its temperature was set to 1.0.<p>Why would you use any other temperature than 0 when you are asserting the correctness of the data extraction and the "reasoning" of the LLM? You don't want variation.<p>2. true_answer = (50<i>29) + (1.7</i>50<i>9)<p>Why are you using the LLM to do math? If the data was extracted correctly (with structured output or function calling) let it write the formula and evaluate it.<p>3. The new Structured Outputs in the API feature from OpenAI is a significant advancement, but it's not a silver bullet. / there's more to this story than meets the eye<p>The new API is just a nicer built-in way to extract structured data. Previously (still valid) you had to use function calling and pass it a "returnResult" function that had its payload typed to your expected schema. This is one of the most powerful and effective tools we have to work with LLMs, if used for what it's supposed to do, we shouldn't avoid this because it doesn't "reason" as well.<p>> John Doe is a freelance software engineer. He charges a
> base rate of $50 per hour for the first 29 hours of work
> each week. For any additional hours, he charges 1.7
> times his base hourly rate. This week, John worked on a
> project for 38 hours.<p>And the sample scenario is something I wouldn't use LLMs for. Nevertheless, CoT can still be applied with structured output, I'd like to see your structured-output-reasoning-cot to figure out why it didn't work.<p>Structured outputs allow you to extend the prompts by providing descriptions for the fields you expect, making it much more effective than other solutions, and you can also implement features like self-healing loops (fx. to get rid of the 1% chance of gpt-4o to not reply with data following the schema), etc. The paper's authors used the plain "JSON mode" that is useless, glad to see you did it better.<p>---<p>Anyway, here's GPT-4o with function calling (not even structured output) solving the issue correctly every time: <a href="https://gist.ro/gpt-reason.mp4" rel="nofollow">https://gist.ro/gpt-reason.mp4</a><p>As you can see it's super consistent with GPT-4o and temp 0, with a silly simple prompt. If someone worked on the prompts / split it into a "multi-step pipeline" (come on, is this what we call fn(fn(x)) now?) they would achieve the same result with 4o-mini.</i>