The performance in the examples is phenomenal. And at 3 kbps? It just blows opus and speex out of the water.<p>Really excited to see if it holds up when they roll it out in Duo. I remember noticing the ML based improvements in Duo kicking in when talking to my Dad a while back (US<->EU video call and he was using mobile data). Was even more impressive seeing it work in the wild.
libopus 1.3-26-ge85ed772 has a huge jump in quality on the 'noisy' sample at 9 kbps (CVBR or true VBR), because it moves from a 4 kHz lowpass to 8 kHz. In 2010 Nokia's listening test found [1] that SILK (at the time an independent speech codec, later incorporated into Opus) gained a quality benefit from reaching 8 kHz vs. the ITU-T and 3GPP codecs that would top out at 7 kHz for comparable modes and bitrates. So any gain in bandwidth in the 0-8 kHz range makes an appreciable difference, especially when there's distracting background noise in the lower bands and you can't filter it out so you have to encode along with your signal.<p>While Opus can go as low as 6 kbps, at that bitrate it very clearly sounds like narrowband audio we're used to from telephones. Frustrating, but not an unfamiliar kind of degradation.<p>Speex behaves like a classic CELP codec and will get robotic at its low end; 3 kbps is just a cruelly low bitrate for a codec whose advertised range is 2-44 kbps.<p>Lyra does sound richer and wider-band than both Opus and Speex, but there's also a peculiar style transfer going on that's most apparent to me in the chocolate bread sample. Opus clearly sounds like a low-quality encode of the original -- it would benefit from some background noise reduction prior to the encode.<p>But the Lyra version exaggerates the pronunciation of the phrase 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness, and overshoots both the lead consonant and first vowel of 'choc', and then proceeds to wash the entire rest of the sentence with a peculiar brightened voice that's high, lacks consonant definition, and is close to ringing.<p>I'm guessing it's actually style transfer, because though the result sounds not much like the speaker's original, the result is reminiscent of the speech pattern and accent that people with East Asian and Southeast Asian ancestry adopt when speaking American English. It was surprising, given that the speaker doesn't sound like that in the original. Does anyone else hear this too?<p>[1] Rämö, Anssi & Toukomaa, Henri. (2010). Voice quality evaluation of recent open source codecs.
Related and recently:<p>Satin: Microsoft’s latest AI-powered audio codec for real-time communications (microsoft.com)<p>13 points by panabee 5 days ago | flag | hide | past | favorite | 2 comments<p><a href="https://news.ycombinator.com/item?id=26218002" rel="nofollow">https://news.ycombinator.com/item?id=26218002</a><p>Direct link: <a href="https://techcommunity.microsoft.com/t5/microsoft-teams-blog/satin-microsoft-s-latest-ai-powered-audio-codec-for-real-time/ba-p/2141382" rel="nofollow">https://techcommunity.microsoft.com/t5/microsoft-teams-blog/...</a><p><i>...Satin can deliver super wide band speech starting at a bitrate of 6 kbps, and full-band stereo music starting at a bitrate of 17 kbps, with progressively higher quality at higher bitrates. Satin has been designed to provide great audio quality even under high packet loss...</i>
I suppose this is subjective. But listening to the short samples, I found the 6kbps Opus audio easier on the ears than Lyra. In the first sample ("pot of gold"), Lyra made the speaker sound like they had a condition which was disrupting their ability to naturally form sounds (incl. a slight slurs). Granted, it sounded a lot better than Speex which sucks.
What is the purpose of putting effort into such low bitrate audio?<p>A 3g connection on a decade old phone with a low signal strength might only get 50 kbps, but it tends to be bursty (ie. Offline for a few seconds, then a few hundred kilobytes arriving all at once).<p>That makes it impractical for audio conferencing. It might be useful for streaming YouTube, but for that it would need to be able to encode music and sound effects reasonably too.
As someone who has a tangential interest in audio codecs , are there any go to books for learning about them and how they work, including the math involved and the physics / biology of the sound. I’m dealing with some very simple stuff (like G.711) and just working with ffmpeg, but I’d like to learn more about the subject in general.
For some reason the 1st Lyra example is disturbingly loud at 1 second for me on my mobile phone speaker. Speex doesn't have that problem for example.
With ML, we ought to be able to bridge the gap between sending audio (here 3000 bps to be usable) and sending a compressed transcription (20 bps to get words across at a similar rate).<p>Surely there is some middle ground where we dedicate say 100bps to get the words over, together with a small bit of info about the emphasis, accent, tone and timing of the words.