TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Promptwright: Generate large synthetic datasets using a local LLM

3 pointsby trickleup7 months ago

1 comment

trickleup7 months ago
Hey Folks!<p>Us folks over at Stacklok, needed a means to generate large synthetic datasets using a local LLM, over say OpenAI or a cloud service. So we built Promptwright, a Python library that lets you generate synthetic datasets using local models via Ollama<p>Why we built it:<p>* We were using OpenAI&#x27;s API for dataset generation, but the costs were getting expensive for large-scale experiments. * We looked at existing solutions like pluto, but they were only capable of running on OpenAI. This project started as a fork of [pluto](<a href="https:&#x2F;&#x2F;github.com&#x2F;redotvideo&#x2F;pluto">https:&#x2F;&#x2F;github.com&#x2F;redotvideo&#x2F;pluto</a>), but we soon started to extend and change it so much, it was practically new - still kudos to the redotvideo folks for the idea. * We wanted something that could run entirely locally and would means no concerns about leaking private information. * We wanted the flexibility of using any model we needed to.<p>What it does:<p>* Runs entirely on your local machine using Ollama (works great with llama2, mistral, etc.) * Super simple Python interface for dataset generation * Configurable instructions and system prompts * Outputs clean JSONL format that&#x27;s ready for training * Direct integration with Hugging Face Hub for sharing datasets<p>We&#x27;ve been using it internally for a few projects, and it&#x27;s been working well. You can process thousands of samples without the worry of API costs or rate limits. Plus, since everything runs locally, you don&#x27;t have to worry about sensitive data leaving your environment.<p>Checkout the examples&#x2F;* folder , for examples for generating code, scientific or creative writing<p>We&#x27;d love to get feedback from the community, if you&#x27;re doing any kind of synthetic data generation for ML, give it a try and let us know what you think!<p>GitHub: <a href="https:&#x2F;&#x2F;github.com&#x2F;StacklokLabs&#x2F;promptwright">https:&#x2F;&#x2F;github.com&#x2F;StacklokLabs&#x2F;promptwright</a>