After testing Whisper (small model) on real recordings, some of them just noice such as chairs moving or doors being shut, I have seen that it sometimes invent crazy things. Sometimes, it infers urls from what, to my ear, is just noise or things that may have emerged from movies (like "GUNSHOT" or "The storm becomes stronger"), also from just noise. Since it often creates strange things with correct grammar, it can be hard to identify which parts that are erroneous
Great breakdown… with some interesting results and a ton of effort.<p>Are there any open benchmarks like this for all models that are actually runnable like the data exposed in <a href="https://github.com/syhw/wer_are_we" rel="nofollow">https://github.com/syhw/wer_are_we</a> but with some of your additional metrics?