> <i>The basic idea is as follows: there is evidence that today's proteins emerged out of an ancient peptidic soup, one that may have left its mark on the evolutionary record. I.e., the proteins we see today may in some sense be formed out of primordial peptides. As proteins grew in size and complexity, it would have been advantageous to reuse existing components, to build bigger proteins from existing protein parts. We already know this is true on the level of protein domains, in that larger proteins are often comprised from chaining together smaller globular domains. But the phenomenon of reuse may go further, where even smaller protein fragments (handful of residues to dozens) may reflect an underlying evolutionary pressure to reuse working parts, fragments that fold in tried-and-tested ways (from the perspective of evolution.) If this is the case, then the space of naturally occurring proteins may occupy a very special "manifold", one that exhibits a hierarchical organization spanning small fragments to entire domains. Other evolutionary pressures could further drive the reuse phenomenon. For example, once a protein-protein or protein-DNA interface is established, presumably through some sort of structural motif, reusing that motif would present an efficient way for the cell to rewire its cellular circuitry. The end result of all this would be the emergence of something resembling a linguistic structure, a grammar that defines the reusable parts and how these parts can be combined to form larger assemblies. Given that this is biology, it’s unlikely to be rigid or minimal. It would be messy and hacky, with many exceptions and ad hoc evolutionary optimizations. But the manifold would be there, potentially discoverable and learnable.</i><p>Instead of characters -> 'byte-pair-encoding'-like sequences -> words -> sentences, think primordial peptides -> simple protein parts -> more complicated protein components -> proteins. If this "protein linguistic hypothesis" is correct, I see no reason why the manifold wouldn't be discoverable and learnable with modern SGD techniques.