The only thing Whisper misses is speaker diarization. I'm currently working on a model that uses Whisper + pyannote to transcribe Interviews and also detects who is speaking. It's working but damn it takes so long
By the way there is also another project called Whisper.cpp:<p><a href="https://github.com/ggerganov/whisper.cpp">https://github.com/ggerganov/whisper.cpp</a><p>Which uses x8 less memory than the Python implementation for the tiny model. It would be a good idea to keep an eye on it since there are Python bindings planned on the roadmap:<p><a href="https://github.com/ggerganov/whisper.cpp#bindings">https://github.com/ggerganov/whisper.cpp#bindings</a>
I understand this is self-hosting the OpenAI Whisper model (which I see is fully MIT-licensed, weights and all). So not calling any OpenAI APIs like other GPT-related tools do.<p>Am I correct on this? The README is not explicit.
People interested in this might also be interested in transcribe-anything [1].<p>It automates video fetching and uses whisper to generate .srt, .vtt and .txt files.<p>[1] <a href="https://github.com/zackees/transcribe-anything">https://github.com/zackees/transcribe-anything</a>
Whisper-UI is also looking really nice lately but I think it's still pretty early in development. The ability to click on the transcript and hear the sound of that particular moment is great.
<a href="https://github.com/hayabhay/whisper-ui">https://github.com/hayabhay/whisper-ui</a>
Run this locally for a few work related tasks. One useful feature is being able to provide in your own 'jargon' in the initial prompt which improves recognition quality ('--initial_prompt 'jargon1 jargon 2 ... ')
Is there an open source speech recognition model which can be restricted to a smaller domain-specific dictionary?<p>Use case: I want to transcribe my poker hands while playing, eg: "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button raised to $20" etc.<p>When I tried using Whisper and some other model, the recognition accuracy was atrocious, and it kept finding non-poker words that sounded similar to poker words. I want to restrict its search space to my own list of poker words which should significantly increase the accuracy (theoretically).<p>Any suggestions on how to go about this?
This looks really good, thanks!
Really appreciate this and all the other Whisper implementations in this thread as I am sorting up transcriptions for my 120+ podcast episodes.
That's very interesting. I've been using whisper via pip also but I'm surprised you haven't sought to optimize whisper at all?<p>I've been looking at using compilation in torch but not successful yet as otherwise it can take awhile to run.
<a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="nofollow">https://pytorch.org/tutorials/intermediate/torch_compile_tut...</a>
Related/Off Topic: Is there a documented way to improve the accuracy of a particular language model? Say we can put in the effort to collect 1000's of verified/transcribed samples of a language that is currently scored poorly (WER). What steps do I have to take to get those improvements into the system?
Very cool - I have a homegrown setup where a script scans my iCloud audio notes directory and generates transcriptions for any new notes. Works like a charm.
Looks interesting. I noticed that the README says "containe" or "containes" several times, where I think you mean "container(s)".