The "Strategies" section looks valuable.<p>Here are a few more great resources from my notes (including one from Lilian Weng who leads Applied Research at OpenAI):<p>- <a href="https://lilianweng.github.io/posts/2023-03-15-prompt-engineering" rel="nofollow">https://lilianweng.github.io/posts/2023-03-15-prompt-enginee...</a><p>- <a href="https://www.promptingguide.ai" rel="nofollow">https://www.promptingguide.ai</a> (check the "Techniques" section for several research-vetted approaches)<p>- <a href="https://learnprompting.org/docs/intro" rel="nofollow">https://learnprompting.org/docs/intro</a>
Are there established best practices for "engineering" prompts systematically, rather than through trial-and-error?<p>Editing prompts is like playing whack-a-mole: once you clear an edge case, a new problem pops up elsewhere. I'd really like to be able to say, "this new prompt performs 20% better across all our test cases".<p>Because I haven't found a better way, I am building <a href="https://github.com/typpo/promptfoo">https://github.com/typpo/promptfoo</a>, a CLI that outputs a matrix view for quickly comparing outputs across multiple prompts, variables, and models. Good luck to everyone else out there tuning prompts :)
Why are we calling this "engineering"?<p>Isn't engineering the application of science to solve problems? (math, definitive logic, etc.)<p>Maybe one day we'll have instruments that let us reason about the connections between prompts and the exact state of the AI, so that we can understand the mechanics of causation, but until then, I would not think that being good at asking questions is "engineering"<p>Are most 10 year olds veteran "search engineers"?<p>Btw I'm asking this slightly tongue-in-cheek, as a discussion point. For example plenty of computer system hacks are done by way of "social engineering", so clearly that term is malleable even within the tech community.
Is it me or is the bot's output in the section "Give a Bot a Fish" incorrect? It states that the most recent receipt is from Mar 5th, 2023 but there are two receipts after that date. This is what worries me about using ChatGPT - the possibility of errors in financial matters, which won't go down well I fear.
Thanks very much for posting this! I haven't yet finished reading the whole thing, but even just the first section about the history of LLMs, explaining some of the basic concepts, etc., I found to be a very well-written and useful info, and it was really nice that it linked out to source material. So many times when you go into reading stuff about the latest AI technique or feature it can feel like you need to do a ton of background reading just to understand what they're talking about (especially as the field moves so quickly), so having a nice simple primer at the beginning of this doc was most appreciated!
The suggestion to use markdown tables was quite interesting. It makes a lot of sense, and I haven't seen it described elsewhere.<p>I have been getting good results by asking GPT to produce semi structured responses based on other aspects of (GitHub) markdown.<p>In general, I find it very helpful to find an already popular format that suits your problem. The model is probably already fluent in rendering that output format. So you spend less time trying to teach it the output syntax.
Worringly, I am it sure the people working on this really understand what a Transformer is<p>Quote from them:<p>“ There is still active research in non-transformer based language models though, such as Amazon’s AlexaTM 20B which outperforms GPT-3“<p>Quote from said paper<p>“ For AlexaTM 20B, we used the standard Transformer model architecture“<p>(Its just an encoder decoder transformer)
This reflects astonishingly poorly on Brex. What customer wants to hear that Brex is using "a non-deterministic model" for "production use cases" like "staying on top of your expenses"? I don't see them acknowledge the downsides of that non-determinism anywhere, let alone hallucination, even though they mention the latter. Hallucinating an extra expense, or missing one, could have serious consequences.<p>This is also potentially terrible from a privacy standpoint. That "staying on top of your expenses" example suggests that you upload "a list of the entire [receipts] inbox" to the model. It _seems_ like they're using OpenAI's API, which doesn’t use customer data for training (unlike ChatGPT), but they should be crystal clear about this. Even if OpenAI doesn't retain/reuse the data, would Brex's customers be happy with this 3rd-party sharing?<p>The expenses example seems like sloppy engineering too—there's no reason to share expense amounts with the model if you just want it to count the number of expenses. Merchant names could be redacted too, replaced with identifiers that Brex would map back to the real data. These suggestions would save on tokens too.<p>Despite Brex saying they're using this in production, I suspect it's mostly a recruiting exercise. It's still a very bad look for their engineering.
This seems overall well-written and well-explained, but curious for that piece on fine-tuning. This article only recommends it as a last resort. That makes sense for a casual user, but if you're a company seriously using LLMs to provide services for your customers, wouldn't the cost of training data be offset by the potential gains you have and the edge cases you might automatically cover by fine-tuning instead of trying to whack-a-mole predict every single way the prompt can fail?
YAML is just as effective at communicating data structure to the model while using ~50% less tokens. I now convert all my JSON to YAML before feeding it to GPT API's
One thing I haven't heard much discussion about is the fact that ChatGPT is constantly being updated.<p>This means that if you build a prompt for classification and become confident that you've whacked all of the moles so that it is pretty solid with all of the edge cases, it can later start breaking again.<p>Some solutions I can think of are 1) choose a fixed model to test against but they become deprecated over time or 2) perhaps fine-tuning might help.
I've been playing Gandalf in the last few days, it does a great job at giving an intuition for some of the subtleties of prompt engineering: <a href="https://gandalf.lakera.ai" rel="nofollow">https://gandalf.lakera.ai</a><p>Thanks for putting this together!
I'm working on the idea of features instead of prompts: <a href="https://inventai.xyz" rel="nofollow">https://inventai.xyz</a>