I just tried the demo on the homepage and I don’t know what kind of sorcery this is but it’s blowing my mind.<p>I input a bunch of completely made up words (Quastral Syncing, Zarnix Meshing, HIBAX, Bilxer) and used them in a sentence and the model zero-shotted perfect speech recognition!<p>It’s so counterintuitive for me that this would work. I would have bet that you have to provide at least one audio sample in order for the model to recognize a word it was never trained on.<p>Providing it to the model in text modality and it being able to recognize it in the audio modality must be an emergent property.