TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Zimtohrli: A New Psychoacoustic Perceptual Metric for Audio Compression

95 pointsby judiisisabout 1 year ago

12 comments

Dave_Rosenthalabout 1 year ago
A few comments:<p>- My understanding is that a gamma chirp is the established filter to use for an auditory filter bank--any reason you choose an elliptical filter instead?<p>- I didn&#x27;t look too closely, but it seems like you are analyzing the output of the filter bank as real numbers. I highly recommend you convolve with a complex representation of the filter and keep all of the math in the complex domain until you collapse to loudness.<p>- I&#x27;d not bucket to discrete 100hz time slices, instead just convolve the temporal masking function with the full time resolution of the filter bank output.<p>- You want to think about some volume normalization step that would give the final minimized Zimtohrli distance metric between A and B*x, where x is a free variable for volume. Otherwise, a perceptual codec that just tends to make things a bit quieter might get a bad score.<p>- For fletcher munson, I assume you are just using a curve at a high-ish volume? If so, good :)<p>- Not sure how you are spacing filter bank center frequencies relative to ERB size, but I&#x27;d recommend oversampling by a factor of 2-3. (That is, a few filters per ERB).<p>Apologies if any of these are off base--I just took a quick look.
评论 #40303737 未加载
givinguflacabout 1 year ago
I looked through the deeper explanation and found this interesting:<p>“Performing a simple experiment where we have 5 separate components<p>1000 Hz sine probe 57 dB SPL 750 Hz sine masker A at 71dB SPL 800 Hz sine masker B at 71 dB SPL 850 Hz sine masker C at 67 dB SPL 900 Hz sine masker D at 65 dB SPL I record the following data<p>When playing probe + masker A through D individually I experience the probe approximately as intensely as a 1000Hz tone at 53dB SPL. When playing probe + all maskers I experience the probe approximately as intensely as a 1000Hz tone at 48dB SPL.”<p>I would be very interested in understanding more about their testing methodology and hardware setup especially.<p>Is the perceiver a trained listener? Are they using headphones or speakers or some other transducer method?<p>It&#x27;s awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener. Especially given the different frequency response for different listening setups.<p>The average user has no chance; hence my curiosity of their specific credentials considering they’re building an entirely new perceptual model based on that.
评论 #40301332 未加载
Thoreandanabout 1 year ago
Interesting, if hard-to-understand.<p>It would be nice to see ELi5 explanations for items like this akin to Monty&#x27;s &#x27;A Digital Media Primer for Geeks&#x27; ( <a href="https:&#x2F;&#x2F;people.xiph.org&#x2F;~xiphmont&#x2F;demo&#x2F;#:~:text=Xiph" rel="nofollow">https:&#x2F;&#x2F;people.xiph.org&#x2F;~xiphmont&#x2F;demo&#x2F;#:~:text=Xiph</a> )
formerly_provenabout 1 year ago
I&#x27;m guessing the name is meant to allude to cinnamon pig ears (<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Palmier" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Palmier</a>).
评论 #40299474 未加载
评论 #40299765 未加载
DoctorOetkerabout 1 year ago
Are there any associated scientific articles and&#x2F;or datasets that back up the experimental claim&#x2F;insinuation of matching JNDs or perceptual differences?<p>Is this a proposal without experimental verification?
Lercabout 1 year ago
This seems to be targeted at signals that are already quite close. Is there anything similar for broad ballpark similarity?<p>Whenever I save searched for such things I have more often encountered techniques designed to detect re-use for copyright reasons.<p>I have played around with generating instrument sounds from a blend of very few basic waveforms with attack,decay,sustain,release, pitch sliding and bell modulation.<p>While it is quite fun just trying to make things by tweaking parameters, your ear&#x2F;perception drifts as you hear the same thing over and over.<p>It would be really nice to have an automated &quot;how close is this abomination?&quot;. I&#x27;d even give evolution a go to try and make some more difficult matches.
评论 #40304565 未加载
yalokabout 1 year ago
It’d be very interesting to see the results for this metric for the existing audio and voice codecs (like AAC, AAC-LD, mp3, opus), and how it compares to the existing metrics for them?<p>Couldn’t find it in their paper.
ant6nabout 1 year ago
This says it works on just-noticeable-differences. Would this work well if the quality of the compressed audio is very poor? Could one for example compare two speech codecs at 8Khz, 4bit against the original source to find out which one sounds better?<p>Or should one just... I dunno, calculate the mean squared error in some sort of continuous frequency domain, perhaps weighted by some hearing curve.
评论 #40301466 未加载
评论 #40300967 未加载
marcodiegoabout 1 year ago
Can it be used to make LAME even better? I mean, I&#x27;m still fond of mp3, specially now that it is patent&#x2F;royalty free and there are literary billions of compatible devices.
iamnotsureabout 1 year ago
Lossy compression may be a bad idea, brains may not support it very well.
bbstatsabout 1 year ago
very useful - I find a lot of audio SR (compression) algos to sound really bad - likely just because of the loss functions and&#x2F;or eval metrics are &#x27;inhuman&#x27;.
p0nceabout 1 year ago
How does it compare to visqol v3?
评论 #40340133 未加载