Hi HN,<p>I built this because I'm tuning a bunch of prompts and don't have a great way to do this systematically.<p>This CLI tool helps you pick the best prompt and model by allowing you to configure multiple prompts and variables. It outputs "before" and "after" so you can easily compare LLM outputs side-by-side and determine if the prompt has improved the quality of each example.<p>Example use cases:<p>- Deciding whether it's worth using GPT-4 over GPT-3.5<p>- Evaluating quality improvements to your prompt across a large range of examples<p>- Catching regressions in edge cases as you iterate on your prompt<p>It supports a handful of useful output formats: console, HTML table view, csv, json, yaml, so you can integrate into your workflow as needed. It also can be used as a library, not a CLI.<p>I'm interested in hearing your thoughts and suggestions on how to improve this tool further. Thanks!
This is a good idea, I used gradio and streamlit to list outputs from different models to check the output manually. But using CLI makes more sense for running multiple use cases and evaluate.<p>You have lots of steps to run, I would suggest:<p>1. Create a config file (yaml or json) to define prompts, variables, models, and output file.<p>2. Create an init command which will create empty files with the required structure. For example:<p>`promptfoo init`<p>output will be:<p>config.yaml
var.json
prompts.json<p>Good luck!