TechEcho

In LLM applications, a common pattern is: the browser sends a request to the application's backend API, the backend code requests the LLM (like OpenAI)'s API, and streams the response back to the browser.<p>I've noticed this brings new challenges to deployment that not many people are talking about: the response time for streaming can sometimes last several minutes (especially when using reasoning models), which is quite different from the traditional API requests that complete in just a few seconds. At the same time, we don't want ongoing requests to be interrupted when deploying a new version.<p>How did you guys do it?

Ask HN: Best practice for deploying LLM API with streaming

no comments

Ask HN: Best practice for deploying LLM API with streaming

no comments