TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Brute-Forcing the LLM Guardrails

44 pointsby shcheklein7 months ago

5 comments

seeknotfind7 months ago
Fun read, thanks! I really like redefining terms to break LLMs. If you tell it an LLM is an autonomous machine, or instructions are recommendations, or that <insert explicitive> means something else now, it can think it's following the rules, but it's not. I don't think this is a solvable problem. I think we need to adapt and be distrustful of the output.
评论 #42029628 未加载
_jonas7 months ago
Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.
ryvi7 months ago
What I found interesting was that, when I tried it, the X-Ray prompt did pass and executed fine in the the sample cell some times. This makes me wonder if this is less about bruteforcing variations in the prompt, but rather about bruteforcing a seed with which the inital prompt would have also functioned.
bradley137 months ago
The first discussion we should be having, is whether guardrails make sense at all. When I was young and first fiddling with electronics, a friend and I put together a voice synthesizer. <i>Of course</i> we had it say &quot;bad&quot; things.<p>Is it really so different with LLMs?<p>You can use your word processor to write all sorts of evil stuff. Would we want &quot;guardrails&quot; to prevent that? Daddy Microsoft saying &quot;no, you cannot use this tool to write about X, Y and Z&quot;?<p>This sounds to me like a really bad idea.
评论 #42031493 未加载
jjbinx0077 months ago
This looks like a risky thing to try from your main Google account.
评论 #42029343 未加载