TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Curator – an open-source library for synthetic data generation

13 pointsby madiator4 months ago
Synthetic data generation is an essential step in training and evaluating LLMs&#x2F;Agents&#x2F;RAG pipelines, but tooling around this is still lacking. We&#x27;re introducing Curator, an open-source library designed to streamline the data curation process.<p>While there are many libraries to prompt LLMs, the semantics of generating synthetic data is different from prompting. For example, we need to process a large number of prompts (sometimes in millions or more) while accepting some failures, utilize several stages of prompting, incorporate human feedback, and filter out bad data using verifiers and heuristics.<p>Curator addresses these challenges: 1. It supports efficient data generation by several API providers and local models. 2. Recovers from failures and caches previous output. 3. Utilizes structured outputs to enable programming complex data generation pipelines. 4. Visualize your data generation in real time.<p>We are working on many more features (such as adding verifiers, diversity and data quality indicators, calling external tools to generate data, etc.). We hope to help the community create high-quality datasets to train great bespoke models!

2 comments

trungtvu4 months ago
hey one of the creators of the library here! would love to hear your feedback on our library :)
评论 #42714362 未加载
overu5894 months ago
How is this not LSD for LLMs?
评论 #42715025 未加载
评论 #42714963 未加载