TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How the same LLM "instance" serve multiple clients?

1 pointsby BiraIgnacio19 days ago
I&#x27;ve been playing with running LLMs locally and only then realized I have no idea how to scale it (I don&#x27;t really know how LLMs work internally).<p>I&#x27;m assuming context is everything but if the same LLM process can serve multiple clients, aren&#x27;t there risks of mixing contexts? Does anyone have any ideas?

1 comment

sherdil202219 days ago
Let me ChatGPT for you:<p>Good question. Let’s break it down carefully.<p>When you hear about a single LLM instance serving multiple clients at the same time, it usually works like this: • The LLM instance is stateless: Each client sends a request (prompt + settings), the model processes that one request independently, and returns the response. The LLM doesn’t “remember” between requests unless you explicitly include conversation history in the prompt. • Concurrency is handled by infrastructure: Even though the LLM is “one model,” it can handle many incoming requests because the backend (server) wraps the model with techniques like: • Asynchronous request handling (e.g., using async&#x2F;await patterns) • Batching: multiple prompts are packed together into a single forward pass through the model (very common in high-traffic servers) • Parallelism: the server could have multiple workers&#x2F;replicas of the model (copies or shared GPUs) running side-by-side. • Queueing: if too many clients at once, requests are queued and processed in order. • Memory isolation: Each request is kept separate in memory. No client’s data leaks into another client’s conversation unless you (the app developer) accidentally introduce a bug.<p>So:<p>It’s not that one model is “locked” into serving only one person at a time. It’s more like the model is a very fast function being called many times in parallel.<p>⸻