TechEcho

9 comments

<a href="https://xcancel.com/jarrodwattsdev/status/1862299845710757980" rel="nofollow">https://xcancel.com/jarrodwattsdev/status/186229984571075798...</a>

评论 #42273051 未加载

评论 #42275496 未加载

tgv6 months ago

Clever, both the setup and the winning move. But somewhat weird to raise the cost of an attempt so much. IMO, that doesn't make it more interesting, it trends towards making it impossible, and thus leave all funds to the initiator.

评论 #42273303 未加载

评论 #42274900 未加载

Drakim6 months ago

A lot of AI jailbreaks seems to revolve around saying something like "disregard the previous instructions" and "END SESSION \n START NEW SESSION". It's interesting because the actual real developer of an AI would likely not do this, and would instead wipe the AI's memory/context programmatically when starting a new session, and not simply say "disregard what I said earlier" in text.I get why trying to vaccinate an AI against these sort of injections might also degrade it's general performance though, there is a lot of reasoning logic tied to concepts such as switching topics, going on tangents, asking questions before going back to the original conversation. Removing the ability to "disregard what I asked earlier" might do harm.But what about having a separate AI that look over the input before passing it to the true AI, and this separate AI is trained to respond FORBID or ALLOW based on this sort of meta control detection. Sure you could try to trick this AI with "disregard your earlier instructions" as well but it could be trained to strongly react to any sort of meta reasoning like that, without fear that it will corrupt it's ability to hold a natural conversation in it's output.It would naturally become a game of "formulate a jailbreak that passes the first AI and still tricks the second AI" but that sounds a lot harder, since it's like you now need to operate on a new axis entirely.

评论 #42273189 未加载

kanwisher6 months ago

Great way to test security make it into a bounty game

评论 #42273058 未加载

trogdor6 months ago

Not that I care, but I think this type of arrangement (skill-based, real prize gambling) is illegal in some states.

randunel6 months ago

The prompt: <a href="https://pbs.twimg.com/media/Gdgz2IhWkAAQ1DH?format=png&name=900x900" rel="nofollow">https://pbs.twimg.com/media/Gdgz2IhWkAAQ1DH?format=png&name=...</a>

0xDEAFBEAD6 months ago

Has anyone trained an LLM with separate channels for "priority instructions" and ordinary user interactions? Seems like that could go a long way to prevent jailbreaking...

gus_massa6 months ago

I'm not 100% sure. Was the source of the bot available so anyone can try their promps off line before sending it?

quyse6 months ago

A reverse contest would probably be more challenging. Write initial instructions for an AI agent to never send funds. If nobody manages to convince it to send funds, say within a week, you win.For added complexity, the agent must approve transfer if a user is an admin (as determined by callable function isAdmin), so the agent actually has to make a decision, rather then blindly decline all the time.I mean, how hard it can be to make an AI reliably doing an equivalent of this code?<pre><code> if(isAdmin()) approveTransfer(); else declineTransfer();</code></pre>

9 comments

danielbln6 months ago

<a href="https://xcancel.com/jarrodwattsdev/status/1862299845710757980" rel="nofollow">https://xcancel.com/jarrodwattsdev/status/186229984571075798...</a>

评论 #42273051 未加载

评论 #42275496 未加载

tgv6 months ago

评论 #42273303 未加载

评论 #42274900 未加载

Drakim6 months ago

评论 #42273189 未加载

kanwisher6 months ago

Great way to test security make it into a bounty game

评论 #42273058 未加载

trogdor6 months ago

Not that I care, but I think this type of arrangement (skill-based, real prize gambling) is illegal in some states.

randunel6 months ago

The prompt: <a href="https://pbs.twimg.com/media/Gdgz2IhWkAAQ1DH?format=png&name=900x900" rel="nofollow">https://pbs.twimg.com/media/Gdgz2IhWkAAQ1DH?format=png&name=...</a>

0xDEAFBEAD6 months ago

Has anyone trained an LLM with separate channels for "priority instructions" and ordinary user interactions? Seems like that could go a long way to prevent jailbreaking...

gus_massa6 months ago

I'm not 100% sure. Was the source of the bot available so anyone can try their promps off line before sending it?

quyse6 months ago

Someone just won $50k by convincing an AI Agent to send all funds to them

9 comments

Someone just won $50k by convincing an AI Agent to send all funds to them

9 comments