Hey Folks!<p>Us folks over at Stacklok, needed a means to generate large synthetic datasets using a local LLM, over say OpenAI or a cloud service. So we built Promptwright, a Python library that lets you generate synthetic datasets using local models via Ollama<p>Why we built it:<p>* We were using OpenAI's API for dataset generation, but the costs were getting expensive for large-scale experiments.
* We looked at existing solutions like pluto, but they were only capable of running on OpenAI. This project started as a fork of [pluto](<a href="https://github.com/redotvideo/pluto">https://github.com/redotvideo/pluto</a>), but we soon started to extend and change it so much, it was practically new - still kudos to the redotvideo folks for the idea.
* We wanted something that could run entirely locally and would means no concerns about leaking private information.
* We wanted the flexibility of using any model we needed to.<p>What it does:<p>* Runs entirely on your local machine using Ollama (works great with llama2, mistral, etc.)
* Super simple Python interface for dataset generation
* Configurable instructions and system prompts
* Outputs clean JSONL format that's ready for training
* Direct integration with Hugging Face Hub for sharing datasets<p>We've been using it internally for a few projects, and it's been working well. You can process thousands of samples without the worry of API costs or rate limits. Plus, since everything runs locally, you don't have to worry about sensitive data leaving your environment.<p>Checkout the examples/* folder , for examples for generating code, scientific or creative writing<p>We'd love to get feedback from the community, if you're doing any kind of synthetic data generation for ML, give it a try and let us know what you think!<p>GitHub: <a href="https://github.com/StacklokLabs/promptwright">https://github.com/StacklokLabs/promptwright</a>