I think audio model will be much more sensitive to input issues relative to text or art. Humans are very good at picking up the nuances in audio and also process it very quickly. I wonder how far we are from being able to manipulate the emotions of how something sounds. In my opinion, that's the turing test for any audio generative AI. Native speakers will immediately know when something is AI generated or adjusted for the same reason they immediately detect accents.<p>I am curious what kind of audio repair AI models are being worked to help make outputs sound more natural. This research feels like progress towards that goal as well.
Possibly weird question, but have there been any attempts at modeling this sort audio model specifically where tokens aren't defined by its audio, but instead by the movement of the tongue/mouth/lips/vocal chords, etc?
It's off-topic (or maybe not?) but I get a very strong "ChatGPT wrote the first draft of this" vibe from a lot of the introductory prose in this post.
More examples on the AudioLM page. Some are pretty impressive (assuming they are cherry picked).<p><a href="https://google-research.github.io/seanet/audiolm/examples/" rel="nofollow">https://google-research.github.io/seanet/audiolm/examples/</a>