Hey HN!<p>In my experiments with prompt engineering, I ran into a problem: most prompts I'm designing can't be quantitatively tested because they don't have a right/wrong answer (eg. providing essay feedback, deciphering corporate-speak in meeting minutes). That means I can't run evals, super powerful tools like ChainForge[1] are too high-overhead, and running one prompt at a time in ChatGPT... sucks.<p>I built Prompt Octopus[2] to evaluate as many prompts as I want, side by side, and it's sped up my workflow dramatically. You can plug in an API key online or self-host (I added python + node.js boilerplates in the repo). Click the Octopus icon in the top right to change your model type, see your history, and change the number of prompt-response boxes you're working with. I'm open sourcing it here and want your feedback, both on the UX and the self-hosting experience!<p>This week I hope to add diff checking, batch API calls to speed things up, and options to add more LLMs.<p>[1] <a href="https://chainforge.ai/" rel="nofollow">https://chainforge.ai/</a>
[2] <a href="https://promptoctopus.com" rel="nofollow">https://promptoctopus.com</a>