We are the developers of Phoenix, which we released in April of this year with a goal of bringing LLM observability to the notebook. In the time since, the growth of LLM frameworks and complex agent workflows led us to add support for LLM spans and traces and introduce a simple Eval harness for testing the data from those spans.<p>The latest Traces & Spans release of Phoenix offers:
-Out of the box tracing for LlamaIndex and LangChain
-Fully local execution, no data sent anywhere, outside of your own LLM calls
-Ability to get a common dataframe format across frameworks back to a notebook for Evals
-Code based LLM Eval harness: light, simple, and fast
-Benchmarking scripts for retrieval setup - chunk size, K, and retrieval approach<p><a href="https://github.com/Arize-ai/phoenix">https://github.com/Arize-ai/phoenix</a><p>We love to hear more from the community about what kind of LLM applications you are building, whether you are using a framework or building from scratch, and how are you running/measuring LLM Evals today. Thinking a lot about what “non-framework” integrations look like.<p>Also genuinely interested in people's opinions of LLM spans and traces versus OTEL. Is the divergence because of something intrinsic to the application or as a community are we reinventing the wheel?<p>Colab if you want to test it out:
<a href="https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/tracing/llama_index_tracing_tutorial.ipynb" rel="nofollow noreferrer">https://colab.research.google.com/github/Arize-ai/phoenix/bl...</a>
Hey, Michael here, CTO of Arize and minor contributor to Phoenix.<p>The ability to trace a piece of software is a night and day difference for understanding. Once you know it's possible to have this level of visibility into what your code is doing, its impossible to go back. Im really excited to see what people learn by applying Phoenix Traces to their applications.<p>I've spent ton of time debating the data model of the span information we collect, as well as many different instrumentation options. We started with LlamaIndex and LangChain callback systems as the hook for instrumentation, since those frameworks are a common way for developers to get started with LLMs. We will add support for custom instrumentation, allowing users to manage the creation of spans themselves and avoid framework lock-in. When it comes to custom instrumentation, Im curious where the community lies - are people planning to use OTEL for this, or is the expectation that LLM spans are different enough that they warrant a different approach?