I’ve been following along Whisper.com’s incredible progress.<p>It is a high quality piece of software which performs better on its intended hardware than any other implementation. It can easily be embedded anywhere. This is truly remarkable. A big shoutout to Georgi for this.<p>We need to remind ourselves that a part of him is choosing to give this away by open sourcing it. And he has gone through a lot of effort to make it easy to use and understand (just look at the documentation). Georgi to me, personifies every open-source author who put in their sweat and toil towards something that benefits our entire community.<p>Thank you Georgi. Salut, my friend!
I recently laid off and currently trying to build some apps that could create some revenue that can afford my costs for some weeks. I built a transcription and dictation app for Mac [0] using whisper.cpp, small model works really well on 2019 mpb and m1 for streaming (dictation). It was really straight forward to use, however the streaming algorithm doesn't ready for production so I implement my own algorithm using VAD. I believe in that with that pace this could also be fixed.<p>[0] <a href="https://apple.co/3j2k8E7" rel="nofollow">https://apple.co/3j2k8E7</a>
Seemed impressive enough to me, but I don't know what the current best-in-class looks like these days. Can anybody working in this area explain if this is a significant milestone and what opportunities it might unlock? The consumer value proposition of basic speech-to-text input seems to be well-handled by most major OS's, but I appreciate that's proprietary tech and only one use case.
Wow, near-perfect transcription on desktop Firefox! Didn't seem to work on Android Chrome, though.<p>I wonder if this can be sped up using WebGPU...
I highly recommend trying out <a href="https://whisper.ggerganov.com/talk/" rel="nofollow">https://whisper.ggerganov.com/talk/</a>. It lets you talk to GPT-2 using your voice, all running locally in your browser. Holy cow.
This is incredible! Thank you for sharing! Did OpenAI release these pretrained models, or was the training done separately alone with this project?<p>Of OpenAI releases the pretrained models, why would we use their service?
Would be interesting to see this connected to YouTube, to improve upon their auto generated transcripts. There is this command line version using YouTube-dl and OpenAI's API <a href="https://simonwillison.net/2022/Sep/30/action-transcription/" rel="nofollow">https://simonwillison.net/2022/Sep/30/action-transcription/</a>
Running in the latest safari iPhone browser I get the error:<p>failed to asynchronously prepare wasm: CompileError: WebAssembly.Module doesn't parse at byte 5: can't get Function local's type in group 1, in function at index 9
Aborted(CompileError: WebAssembly.Module doesn't parse at byte 5: can't get Function local's type in group 1, in function at index 9)
My understanding is each inference run requires 30 seconds. Therefore anything processed process under 30 seconds is padded out with silence.<p>To my knowledge, nobody's been able to work around this and it may not be possible without work. Upstream.
If someone wants to self host you can also try this decent web interface: <a href="https://codeberg.org/pluja/web-whisper" rel="nofollow">https://codeberg.org/pluja/web-whisper</a><p>I'm not the creator, just a fan.
I think we should make standard browser API for transcribing, otherwise each website wanting to implement private voice recognition will need to download 500MB of data
Three clicks to find out what it is:<p>1: “Minimal whisper.cpp example running fully in the browser”<p>2: “Port of OpenAI's Whisper model in C/C++”<p>3: “Whisper is a general-purpose speech recognition model.”