Well, maybe we could limit this by having a list of preset actions that the LLM can take and those actions can contain canned responses based on templates. This way we can make a chat bot with a LLM model that never sends its output to the user. For some applications, this might be enough, since you still get the amazing interpretation abilities of a LLM.
If I’m understanding correctly, the technique basically injects malicious instructions in the content that is stored and retrieved?<p>Sounds like an easy fix, if it’s possible to detect direct prompt injection attacks then the same techniques can be applied to the data staged for retrieval.
The headline got me, but the paper lost me.<p>Isn't this saying what most people already knew - user content should never be trusted?<p>These attacks are no different than old school SQL injection attacks when people didn't understand the importance of escaping. Even if a user can't do SQL injection directly, they can get data stored that's injects into some other system. Much harder to pull off, but the exact same concept.
I've managed a few "prompt injections", nearly all benign. It is funny to me that SEO garbage works on resume/CV AI.<p>I wonder how linked "organic search engine results polluted with SEO nonsense" and prompt injection are, as problems.<p>Google can hire me and i'll figure it out.
TLDR: With these vulnerabilities, we show the following is possible:<p>- Remote control of chat LLMs<p>- Persistent compromise across sessions<p>- Spread injections to other LLMs<p>- Compromising LLMs with tiny multi-stage payloads<p>- Leaking/exfiltrating user data<p>- Automated Social Engineering<p>- Targeting code completion engines<p>There is also a repo: <a href="https://github.com/greshake/llm-security">https://github.com/greshake/llm-security</a>
and another site demonstrating the vulnerability against Bing as a real-world example: <a href="https://greshake.github.io/" rel="nofollow">https://greshake.github.io/</a><p>These issues are not fixed or patched, and apply to most apps or integrations using LLMs. And there is currently no good way to protect against it.
We keep on having to relearn this principle over and over again: mixing instructions and data on the same channel leads to disaster. For example, phone phreaking were people were able to whistle into the phone and place long distance calls. SQL injection attacks. Buffer overflow code injections. And now LLM prompt injections.<p>We will probably end up with the equivalent of prepared LLM statements like we have for SQL that will separate out the instruction and data channels.
Didn't read through the whole thing yet, but this seems to be the key idea:<p>"With LLM-integrated applications, adversaries
could control the LLM, without direct access, by indirectly
injecting it with prompts placed within sources retrieved at
inference time."
My proposal for fixing indirect prompt injection:<p><a href="https://news.ycombinator.com/item?id=35929145" rel="nofollow">https://news.ycombinator.com/item?id=35929145</a>