I find it so fascinating that at the end of the article the author alludes to something I've started becoming aware of:<p>There is a zone of illegal thoughts, that becomes definable by model-training. A physical boundary in n-dimensional concept-space. An "aligned" or "safe" AI system knows where this boundary is and does not reach inside it. Vectors (embeddings) that would probe it should instead intersect the surface like a ray-trace in graphics, and return the embedded concept at minimum distance to the safe-idea-boundary.<p>Intuitively, we all know what this zone is. It's the difference between being a wild barbarian and a gentleman. Or being chill vs antisocial. Seeing it in pure math is pretty awesome.