This may call out to ffmpeg for pre-processing. If you're reluctant to running that on you Mac straight, you can use this wrapper script to have ffmpeg run in a docker instance: <a href="https://gist.github.com/ndurner/636d37fd83aed4b875cdb66653017ae7" rel="nofollow">https://gist.github.com/ndurner/636d37fd83aed4b875cdb6665301...</a><p>However, I found that Whisper is thrown off by background music in a prodcast - and will not recover. (That was with the mlx-community/whisper-large-v3-mlx checkpoint, OP uses distil-whisper-large-v3). I concluded for myself that Whisper might be used in larger processing pipelines that will handle such - can someone provide insights about that? The podcast I used it on was <a href="https://www.heise.de/news/KI-Update-Deep-Dive-Was-taugen-KI-Suchmaschinen-9850904.html" rel="nofollow">https://www.heise.de/news/KI-Update-Deep-Dive-Was-taugen-KI-...</a>.<p>I ended up using Google Gemini, which handled it well. (Blog post: <a href="https://ndurner.github.io/mlx-whisper-gemini" rel="nofollow">https://ndurner.github.io/mlx-whisper-gemini</a>)