TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Opik, an open source LLM evaluation framework

86 点作者 calebkaiser8 个月前
Hey HN! I&#x27;m Caleb, one of the contributors to Opik, a new open source framework for LLM evaluations.<p>Over the last few months, my colleagues and I have been working on a project to solve what we see as the most painful parts of writing evals for an LLM application. For this initial release, we&#x27;ve focused on a few core features that we think are the most essential:<p>- Simplifying the implementation of more complex LLM-based evaluation metrics, like Hallucination and Moderation.<p>- Enabling step-by-step tracking, such that you can test and debug each individual component of your LLM application, even in more complex multi-agent architectures.<p>- Exposing an API for &quot;model unit tests&quot; (built on Pytest), to allow you to run evals as part of your CI&#x2F;CD pipelines<p>- Providing an easy UI for scoring, annotating, and versioning your logged LLM data, for further evaluation or training.<p>It&#x27;s often hard to feel like you can trust an LLM application in production, not just because of the stochastic nature of the model, but because of the opaqueness of the application itself. Our belief is that with better tooling for evaluations, we can meaningfully improve this situation, and unlock a new wave of LLM applications.<p>You can run Opik locally, or with a free API key via our cloud platform. You can use it with any model server or hosted model, but we currently have a built-in integration with the OpenAI Python library, which means it automatically works not just with OpenAI models, but with any model served via a compatible model server (ollama, vLLM, etc). Opik also currently has out-of-the-box integrations with LangChain, LlamaIndex, Ragas, and a few other popular tools.<p>This is our initial release of Opik, so if you have any feedback or questions, I&#x27;d love to hear them!

6 条评论

itssteadyfreddy8 个月前
Ran through some colabs. Signed up for a key and tested the ollama colab. Got a little error on cell 2, &quot;ConnectError: [Errno 99] Cannot assign requested address&quot; but the traces went through which was fine. Just a little heads up<p>I am using Arize Phoenix and trying to see the difference. Can you highlight?
tcsizmadia8 个月前
It looks very promising! Congratulations, great tool! I can&#x27;t wait to start experimenting with it. I plan to use it locally, with Ollama.
评论 #41567413 未加载
trolan8 个月前
I&#x27;m in a University course related to AI testing and quality assurance. This is something I&#x27;ll definitely bring up and see how it can be used.<p>With OpenAI comparability, hoping it supports OpenRouter out of the box, which means it supports Anthropic and Google too, along with a host of open models hosted elsewhere.
评论 #41580053 未加载
评论 #41606316 未加载
smcleod8 个月前
Looks interesting, great to see it specifically calls out supporting LLM servers as first class citizens!<p>I see some of the code is Java, that strikes me as an interesting choice - is there a reason behind that or simply the language that the devs were already familiar with?
评论 #41580261 未加载
hrpnk8 个月前
Is there a reason you didn&#x27;t just implement OpenTelemetry (OT) straight away? Curious about the trade offs to opt for a home-grown telemetry inspired by OT instead.
评论 #41568812 未加载
yu3zhou48 个月前
Hello! How does it compare to DeepEval (open source)?
评论 #41571993 未加载