The main limitation for such codecs is CPU/battery life - and I like how they sparsely applied ML in it here and there, combining it with classic approach (non-ML algos) to achieve better tradeoff of CPU vs quality. E.g. for better low bitrate support/LACE - "we went for a different approach: start with the tried-and-true postfilter idea and sprinkle just enough DNN magic on top of it." The key was not to feed raw audio samples to the NN - "The audio itself never goes through the DNN. The result is a small and very-low-complexity model (by DNN standards) that can run even on older phones."<p>Looks like the right direction for embedded algos and it seems to be a pretty unexplored one, as compared to the current fashion to do ML E2E.
I'm using Opus as one of the main codecs in my peer-to-peer audio streaming library (<a href="https://git.iem.at/cm/aoo/" rel="nofollow">https://git.iem.at/cm/aoo/</a> - still alpha), so this is very exciting news!<p>I'll definitely play around with these new ML features!
I find the interplay between audio codecs, speech synthesis, and speech recognition fascinating. Advancements in one usually results in advancements in the others.
I wonder: did they address common ML ethics questions? Specifically: Are the ML algorithms better/worse on male than on female speech? How about different languages or dialects? Are they specifically tuned for speech at all, or do they also work well for music or birdsong?<p>That said, the examples are impressive and I can't wait for this level of understandability to become standard in my calls.
How about adding a text "subtitle" stream to the mix. The encoder may use ML to perform speech-to-text. The decoder may then use the text, along with the audio surrounding the audio drop outs, to feed a conditional text-to-speech DNN. This way the network does not have to learn the harder problem of blindly interpolating across the drop outs from just the audio. The text stream is low bitrate so it may have substantial redundancy in order to increase the likelihood that any given (text) message is received.
Very cool. seems like they addressed the problem of hallucination. would be interesting to see an example of it hallucinating without redundancy and corrected with redundancy
Love how Opus 1.5 is now actually transparent at 16kbps for voice and 96kbps is still beats 192kbps MP3. Meanwhile xHE-AAC still feels like It was farted out since It 96 ~ 256kbps area Is legit worse than AAC-LC(Apple, FDK) are at ~160kbps.
What if there was a profiler or setting that helps to reencode existing lossy formats without introducing too many more artifacts? An sizeable collection runs into the issue, if the don't have (easily accessible) lossless masters.<p>I'd be very interested if I could move a variety of mp3s, aacs and vorbis to Opus if I knew additional quality loss was minimal.
Why the hell opus still not in Bluetooth? Well i know - sweet sweet license fees<p>(aKKtually, there IS opus codec, supported by pixel phones - google made it for VR/AR stuff. No one uses it, there are about ~1 headphone with opus support )
>That's why most codecs have packet loss concealment (PLC) that can fill in for missing packets with plausible audio that just extrapolates what was being said and avoids leaving a hole in the audio<p>...How far can ML PLC "hallucinate" audio? A sound , a syllable, a whole word, half a sentence?<p>Can I trust anymore what I hear?
Isn't it a strange coincidence that this shows up on HN while Claude Opus is also announced today and is on HN front page? I mean, what are the odds of seeing the word "Opus" twice in a day on one internet page?