TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Best practice for deploying LLM API with streaming

1 pointsby wonderfuly4 months ago
In LLM applications, a common pattern is: the browser sends a request to the application&#x27;s backend API, the backend code requests the LLM (like OpenAI)&#x27;s API, and streams the response back to the browser.<p>I&#x27;ve noticed this brings new challenges to deployment that not many people are talking about: the response time for streaming can sometimes last several minutes (especially when using reasoning models), which is quite different from the traditional API requests that complete in just a few seconds. At the same time, we don&#x27;t want ongoing requests to be interrupted when deploying a new version.<p>How did you guys do it?

no comments

no comments