Thanks for sharing, I've followed these authors for a while and they're great.<p>Some notes from my own experience on LLMs for NLP problems:<p>1) The output schema is usually more impactful than the text part of a prompt.<p>a) Field order matters a lot. At inference, the earlier tokens generated influence the next tokens.<p>b) Just have the CoT as a field in the schema too.<p>c) PotentialField and ActualField allow the LLM to create some broad options and then select the best. This mitigates the fact that they can't backtrack a bit. If you have human evaluation in your process, this also makes it easier for them to correct mistakes.<p>`'PotentialThemes': ['Surreal Worlds', 'Alternate History', 'Post-Apocalyptic'], 'FinalThemes': ['Surreal Worlds']`<p>d) Most well definined problems should be possible zero-shot on a frontier model. Before rushing off to add examples really check that you're solving the correct problem in the most ideal way.<p>2) Defining the schema as typescript types is flexible and reliable and takes up minimal tokens. The output JSON structure is pretty much always correct (as long as the it fits in the context window) the only issue is that the language model can pick values outside the schema but that's easy to validate in post.<p>3) "Evaluating LLMs can be a minefield." yeah it's a pain in the ass.<p>4) Adding too many examples increases the token costs per item a lot. I've found that it's possible to process several items in one prompt and, despite it being seemingly silly and inefficient, it works reliably and cheaply.<p>5) Example selection is not trivial and can cause very subtle errors.<p>6) Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).