Synthetic data generation is an essential step in training and evaluating LLMs/Agents/RAG pipelines, but tooling around this is still lacking. We're introducing Curator, an open-source library designed to streamline the data curation process.<p>While there are many libraries to prompt LLMs, the semantics of generating synthetic data is different from prompting. For example, we need to process a large number of prompts (sometimes in millions or more) while accepting some failures, utilize several stages of prompting, incorporate human feedback, and filter out bad data using verifiers and heuristics.<p>Curator addresses these challenges:
1. It supports efficient data generation by several API providers and local models.
2. Recovers from failures and caches previous output.
3. Utilizes structured outputs to enable programming complex data generation pipelines.
4. Visualize your data generation in real time.<p>We are working on many more features (such as adding verifiers, diversity and data quality indicators, calling external tools to generate data, etc.). We hope to help the community create high-quality datasets to train great bespoke models!