TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs

211 点作者 krawfy将近 2 年前
Hey HN! We’re Kevin and Steve. We’re building PromptTools (<a href="https:&#x2F;&#x2F;github.com&#x2F;hegelai&#x2F;prompttools">https:&#x2F;&#x2F;github.com&#x2F;hegelai&#x2F;prompttools</a>): open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts.<p>Evaluating prompts, LLMs, and vector databases is a painful, time-consuming but necessary part of the product engineering process. Our tools allow engineers to do this in a lot less time.<p>By “evaluating” we mean checking the quality of a model&#x27;s response for a given use case, which is a combination of testing and benchmarking. As examples: - For generated JSON, SQL, or Python, you can check that the output is actually JSON, SQL, or executable Python. - For generated emails, you can use another model to assess the quality of the generated email given some requirements, like whether or not the email is written professionally. - For a question-answering chatbot, you can check that the actual answer is semantically similar to an expected answer.<p>At Google, Steve worked with HuggingFace and Lightning to support running the newest open-source models on TPUs. He realized that while the open-source community was contributing incredibly powerful models, it wasn’t so easy to discover and evaluate them. It wasn’t clear when you could use Llama or Falcon instead of GPT-4. We began looking for ways to simplify and scale this evaluation process.<p>With PromptTools, you can write a short Python script (as short as 5 lines) to run such checks across models, parameters, and prompts, and pass the results into an evaluation function to get scores. All these can be executed on your local machine without sending data to third-parties. Then we help you turn those experiments into unit tests and CI&#x2F;CD that track your model’s performance over time.<p>Today we support all of the major model providers like OpenAI, Anthropic, Google, HuggingFace, and even LlamaCpp, and vector databases like ChromaDB and Weaviate. You can evaluate responses via semantic similarity, auto-evaluation by a language model, or structured output validations like JSON and Python. We even have a notebook UI for recording manual feedback.<p>Quickstart:<p><pre><code> pip install prompttools git clone https:&#x2F;&#x2F;github.com&#x2F;hegelai&#x2F;prompttools.git cd prompttools &amp;&amp; jupyter notebook examples&#x2F;notebooks&#x2F;OpenAIChatExperiment.ipynb </code></pre> For detailed instructions, see our documentation at <a href="https:&#x2F;&#x2F;prompttools.readthedocs.io&#x2F;en&#x2F;latest&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;prompttools.readthedocs.io&#x2F;en&#x2F;latest&#x2F;</a>.<p>We also have a playground UI, built in streamlit, which is currently in beta: <a href="https:&#x2F;&#x2F;github.com&#x2F;hegelai&#x2F;prompttools&#x2F;tree&#x2F;main&#x2F;prompttools&#x2F;playground">https:&#x2F;&#x2F;github.com&#x2F;hegelai&#x2F;prompttools&#x2F;tree&#x2F;main&#x2F;prompttools...</a>. Launch it with:<p><pre><code> pip install prompttools git clone https:&#x2F;&#x2F;github.com&#x2F;hegelai&#x2F;prompttools.git cd prompttools &amp;&amp; streamlit run prompttools&#x2F;ui&#x2F;playground.py </code></pre> We’d love it if you tried our product out and let us know what you think! We just got started a month ago and we’re eager to get feedback and keep building.

10 条评论

fatso784将近 2 年前
I like the support for Vector DBs and LLaMa-2. I&#x27;m curious as to whether and what influences compelled PromptTools, and how it differs from other tools in this space. For context, we&#x27;ve also released a prompt engineering IDE, ChainForge, which is open-source and has many of the features here, such as querying multiple models at once, prompt templating, evaluating responses with Python&#x2F;JS code and LLM scorers, plotting responses, etc (<a href="https:&#x2F;&#x2F;github.com&#x2F;ianarawjo&#x2F;ChainForge">https:&#x2F;&#x2F;github.com&#x2F;ianarawjo&#x2F;ChainForge</a> and a playground at <a href="http:&#x2F;&#x2F;chainforge.ai" rel="nofollow noreferrer">http:&#x2F;&#x2F;chainforge.ai</a>).<p>One big problem we&#x27;re seeing in this space is over-trust in LLM scorers as &#x27;evaluators&#x27;. I&#x27;ve personally seen that minor tweaks to a scoring prompt can sometimes result in vastly different evaluation &#x27;results.&#x27; Given recent debacles (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36370685">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36370685</a>), I&#x27;m wondering how we can design LLMOps tools for evaluation which both support the use of LLMs as scorers, but also caution users about their results. Are you thinking similarly about this question, or seen usability testing which points to over-trust in &#x27;auto-evaluators&#x27; as an emerging problem?
评论 #36959297 未加载
评论 #36959291 未加载
robszumski将近 2 年前
I&#x27;ll put in a friendly request for a Dockerfile in the repo.<p>I&#x27;ve been trying out AI tools as test cases for our supply chain security platform and had to cobble a Dockerfile together to get this running easily. Really cool tool overall!<p>Across 200+ transitive dependencies in prompttool, risk prioritization can remove 97% of security investigation in my quick test, and I most of these came from a thick base image. I&#x27;d love one curated from y&#x27;all.
catlover76将近 2 年前
Super cool, the need for tooling like this is something one realizes pretty quickly when starting to build apps that leverage LLMs.
评论 #36958880 未加载
esafak将近 2 年前
I&#x27;d like to see support for qdrant.
评论 #36959558 未加载
politelemon将近 2 年前
Similar tool I was about to look at: <a href="https:&#x2F;&#x2F;github.com&#x2F;promptfoo&#x2F;promptfoo">https:&#x2F;&#x2F;github.com&#x2F;promptfoo&#x2F;promptfoo</a><p>I&#x27;ve seen this in both tools but I wasn&#x27;t able to understand: In the screenshot with feedback, I see thumbs up and thumbs down options. Where do those values go, what&#x27;s the purpose? Does it get preserved across runs? It&#x27;s just not clicking in my head.
评论 #36958870 未加载
neelm将近 2 年前
Something like this is going to be needed to evaluate models effectively. Evaluation should be integrated into automated pipelines&#x2F;workflows that can scale across models and datasets.
评论 #36963921 未加载
mmaia将近 2 年前
I like that it&#x27;s not limited to single prompts and allows to have chat messages. It would be great if `OpenAIChatExperiment` could also handle OpenAI&#x27;s function calling.
评论 #36964480 未加载
tikkun将近 2 年前
This looks great, thanks<p>See also this related tool: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36907074">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36907074</a>
评论 #36963911 未加载
8awake将近 2 年前
Great work! We will make use of that with <a href="https:&#x2F;&#x2F;www.formula8.ai" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.formula8.ai</a>
评论 #36963604 未加载
pk19238将近 2 年前
This is super cool man!
评论 #36964115 未加载