I don’t understand how the watermarking is supposed to affect the LLMs being trained. Isn’t the watermarking some hidden data in the PDF file itself and the text contained in the PDF is the same for everyone? And aren’t the LLMs trained with the extracted plain text, not the PDF itself? How would the LLM training be different if there’s a watermark?
I think using the term <i>backdoor</i> is completely wrong in this case.<p>You're not at risk for using LLM other than the content producer being able to tell that the LLM trained on their data; maybe, potentially... It's a stretch even.<p>Your queries are not being relayed to them, they don't have a <i>backdoor</i> into the LLM content, algos or queries. They merely have a tainted marker, potentially showing, in the output.<p>LLM providers can always make the claim that they didn't get the tainted data from the source, but got it from another source and that's the ones you should go after; good luck proving the misdirection. I bet it's probably even hard for them to know exactly where <i>this exact output</i> came from, since it's probably been re-ingurgitated 250,000 times.
this is really interesting<p>but as much as i want to agree with the author’s stance:<p>> [..] no matter how hard these companies try to sell us on AGI or “research” models, you can just laugh until you cry that they really thought they could steal from the well of knowledge and turn around and sell it back to us through SaaS<p>i feel like at the end, these companies will still win
Google stole from the well of knowledge and sold it back to us as search. Or rather, accessing a database efficiently is a problem of its own.<p>Why are all these articles always so fraught with these excessively fearful terms? It’s not real, man. Humanity invented a new tool.<p>And yeah, people freaked out about search engines too. And it was the same breathless terror. Get a grip.