Ask HN: How do you add guard rails in LLM response without breaking streaming?

48 点作者 curious-tech-127 个月前

Hi all, I am trying to build a simple LLM bot and want to add guard rails so that the LLM responses are constrained. I tried adjusting system prompt but the response does not always honour the instructions from prompt. I can manually add validation on the response but then it breaks streaming and hence is visibly slower in response. How are people handling this situation?

16 条评论

CharlieDigital7 个月前

If it's the problem I think it is, the solution is to run two concurrent prompts.First prompt validates the input. Second prompt starts the actual content generation.Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK". In the SSE, encode the chunks of each stream with a stream ID. You can also handle it on the server side by cancelling execution once the first stream ends.Generally, the experience is good because the validation prompt is shorter and faster to last (and only) token.The SSE stream ends up like this:<pre><code> data: ing|tomatoes data: ing|basil data: ste|3. Chop the </code></pre> I have a writeup (and repo) of the general technique of multi-streaming: <a href="https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond-openai-w-dotnet-sse-channels-llama3-fireworks-ai/" rel="nofollow">https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond...</a> (animated gif at the bottom).

评论 #41866026 未加载

评论 #41870995 未加载

评论 #41866466 未加载

olalonde7 个月前

From what I can tell, ChatGPT appears to be doing "optimistic" streaming... It will start streaming the response to the user but may eventually hide the response if it trips some censorship filter. The user can theoretically capture the response from the network since the censorship is essentially done client-side but I guess they consider that good enough.

joshhart7 个月前

Hi, I run the model serving team at Databricks. Usually you run regex filters, LLAMA Guard, etc on chunks at a time so you are still streaming but it's in batches of tokens rather than single tokens at a time. Hope that helps!You could of course use us and get that out of the box if you have access to Databricks.

评论 #41865251 未加载

brrrrrm7 个月前

fake it.add some latency to the first token and then "stream" at the rate you received tokens even though the entire thing (or some sizable chunk) has been generated. that'll give you the buffer you need to seem fast while also staying safe.

tweezy7 个月前

I've tried a few things that seem to work. The first works pretty much perfectly, but adds quite a bit of latency to the final response. The second isn't perfect, but it's like 95% there1 - the first option is to break this in to three prompts. The first prompt is either write a brief version, an outline of the full response, or even the full response. The second prompt is a validator, so you pass the output of the first to a prompt that says "does this follow the instructions. Return True | False." If True, send it to a third that says "Now rewrite this to answer the user's question." If False, send it back to the first with instructions to improve the response. This whole process can mean it takes 30 seconds or longer before the streaming of the final answer starts.There are plenty of variations on the above process, so obviously feel free to experiment.2 - The second option is to have instructions in your main prompt that says "Start each response with an internal dialogue wrapped in <thinking> </thinking> tags. Inside those tags first describe all of the rules you need to follow, then plan out exactly how you will respond to the user while following those rules."Then on your frontend have the UI watch for those tags and hide everything between them from the user. This method isn't perfect, but it works extremely well in my experience. And if you're using a model like gpt-4o or claude 3.5 sonnet, it makes it really hard to make a mistake. This is the approach we're currently going with.

throwaway888abc7 个月前

Not sure about the exact nature of your project, but for something similar I’ve worked on, I had success using a combination of custom stop words and streaming data with a bit of custom logic layered on top. By fine-tuning the stop words specific to the domain and applying filters in real-time as the data streams in, I was able to improve the response to users taste. Depending on your use case, adding logic to dynamically adjust stop words or contextually weight them might also help you.

anshumankmr7 个月前

Google has some safety feature in Vertex AI to block certain keywords, but that does break the streaming. If it finds something offending, it replaces with a static response. That is one that I have felt "works", but it is a bit wonky from UX side.

darepublic7 个月前

Why would manual validation be so slow? Is there a ton of async stuff going on in there? Anyway you can just speed up manual validation , and should it fail give a response that asks for user patience?

seany627 个月前

> Hi all, I am trying to build a simple LLM bot and want to add guard rails so that the LLM responses are constrained.Give examples of how the LLM should respond. Always give it a default response as well (e.g. "If the user response does not fall into any of these categories, say x").> I can manually add validation on the response but then it breaks streaming and hence is visibly slower in response.I've had this exact issue (streaming + JSON). Here's how I approached it: 1. Instruct the LLM to return the key "test" in its response. 2. Make the streaming call. 3. Build your JSON response as a string as you get chunks from the stream. 4. Once you detect "key" in that string, start sending all subsequent chunks wherever you need. 5. Once you get the end quotation, end the stream.

viewhub7 个月前

What's your stack? What type of response times are you looking for?

com2kid7 个月前

You start streaming the response immediately and kick off your guardrails checks. If the guard rail checks are triggered you cancel the streaming response.Perfect is the enemy of good enough.

评论 #41866111 未加载

potatoman227 个月前

Depending on what sort of constraints you need on your output, a custom token sampler, logit bias, or verifying it against a grammar could do the trick.

outlore7 个月前

you can stream the response in chunks of size N + K overlap and run the guardrails on each chunk.

jonathanrmumm7 个月前

have it format in yaml instead of json, incomplete yaml is still valid yaml

digitaltrees7 个月前

We are using keep-ai.com for a set of health care related AI project experiments.

shaun-Galini7 个月前

We have just the product for you! We’ve recently improved guardrail accuracy by 25% for a $5B client and would be happy to show you how we do it.You're right - prompt eng. alone doesn't work. It's brittle and fails on most evals.Ping me at shaunayrton@galini.ai