Hi HN ! I'm Alex, a tech enthusiast. I have an idea that I can't test and that concerns an area in which I am not an expert. I am making this post to find out to what extent this idea is relevant to the state of the art.<p>From what little I know, raw user inputs are not directly submitted to LLMs. Typically, user input is carefully wrapped in a special format before being sent to the LLM. The format usually has tags, including special tags to tell the AI, for example, which topic is prohibited.<p>As with SQL injection, an attacker can craft malicious user input by introducing special tags. Input sanitization can be seen as a solution, but it seems that it isn't enough. Anyway, it doesn't seem very intuitive, I think a document intended to be read by an LLM should also be very human-readable. I also wonder what happens when an attacker uses obscure Unicode characters to forge a string that looks like a special tag.<p>Instead of using an XML-like language, my idea is to use a format that seamlessly interweave human-readable structured data with prose within a single document. Also, the format must natively support indentation to remove the need for input sanitization, thereby eliminating an entire class of injection attacks.<p>I am the author of Braq, a data format that seems to be a good candidate.<p>The idea to better structure a prompt is described in this Markdown section: https://github.com/pyrustic/braq?tab=readme-ov-file#ai-prompts<p>And here, ChatML from OpenAI: https://news.ycombinator.com/item?id=34988748<p>As mentioned above, I can't test this idea. Therefore, I'm asking to you: Can we solve AI prompt injection attacks with an indented data format ?