TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Promptwright: Generate large synthetic datasets using a local LLM

3 点作者 trickleup7 个月前

1 comment

trickleup7 个月前
Hey Folks!<p>Us folks over at Stacklok, needed a means to generate large synthetic datasets using a local LLM, over say OpenAI or a cloud service. So we built Promptwright, a Python library that lets you generate synthetic datasets using local models via Ollama<p>Why we built it:<p>* We were using OpenAI&#x27;s API for dataset generation, but the costs were getting expensive for large-scale experiments. * We looked at existing solutions like pluto, but they were only capable of running on OpenAI. This project started as a fork of [pluto](<a href="https:&#x2F;&#x2F;github.com&#x2F;redotvideo&#x2F;pluto">https:&#x2F;&#x2F;github.com&#x2F;redotvideo&#x2F;pluto</a>), but we soon started to extend and change it so much, it was practically new - still kudos to the redotvideo folks for the idea. * We wanted something that could run entirely locally and would means no concerns about leaking private information. * We wanted the flexibility of using any model we needed to.<p>What it does:<p>* Runs entirely on your local machine using Ollama (works great with llama2, mistral, etc.) * Super simple Python interface for dataset generation * Configurable instructions and system prompts * Outputs clean JSONL format that&#x27;s ready for training * Direct integration with Hugging Face Hub for sharing datasets<p>We&#x27;ve been using it internally for a few projects, and it&#x27;s been working well. You can process thousands of samples without the worry of API costs or rate limits. Plus, since everything runs locally, you don&#x27;t have to worry about sensitive data leaving your environment.<p>Checkout the examples&#x2F;* folder , for examples for generating code, scientific or creative writing<p>We&#x27;d love to get feedback from the community, if you&#x27;re doing any kind of synthetic data generation for ML, give it a try and let us know what you think!<p>GitHub: <a href="https:&#x2F;&#x2F;github.com&#x2F;StacklokLabs&#x2F;promptwright">https:&#x2F;&#x2F;github.com&#x2F;StacklokLabs&#x2F;promptwright</a>