Hi. With burgeoning AI, I don't particularly like the idea of my persona being unwittingly scraped into an AI corpus.<p>Would denormalizing a string to unicode help prevent AI from matching text in a prompt? For example, changing "The quick brown fox" to "𝓣𝓱𝓮 𝓺𝓾𝓲𝓬𝓴 𝓫𝓻𝓸𝔀𝓷 𝓯𝓸𝔁" or "apple" to "ÁÞÞlé". Since the obfuscated strings use different tokens, they wouldn't match in a prompt, correct? And although normalization of strings is possible, would it be (im)possible to scale it in LLMs?<p>Note that I'm not suggesting that an AI couldn't <i>produce</i> obfuscated unicode, it can. This question is only about preventing one's text from aiding a corpus.
I was working on foundation models for business and we had done some work on character embeddings that would counteract that back in 2017.<p>Pro Tip: people whose ideas were worth stealing were worried about Google’s web scraping and the whole economy about it were unfair and exploitative 10 years ago. Suddenly the people whose ideas aren’t worth stealing are up in arms about it.<p>Think more about having ideas that are worth stealing (e.g. <i>leading</i> the herd not <i>following</i> the herd) instead of getting your ideas stolen.