I'm building a product that requires speech-to-text. I'm thinking of going with Whisper as it seems cheap $0.006/min and heard the transcribed text quality is good. Are there any better alternatives?
- AssemblyAI was the winner for the tests we did some months ago, very reliable and accurate.<p>- Deepgram also looks interesting, recently they released a new model (Nova), they also offer Whisper for a cheaper price ($0.0048/min), I've briefly played a little bit with it but the DX looked a bit bad. They're also offering $200 in credits now.<p>- If you're on a really tight budget. Most browsers [1] support the SpeechRecognition API [2] where you can transcribe for free. Depends on the browser it works better, for example in Google Chrome it works excellent as the browser actually sends the audio to the cloud (probably uses GCP's Google Cloud Speech to Text)<p>[1] <a href="https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition#browser_compatibility" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...</a>
[2] <a href="https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...</a>
I've experimented with whisper. I don't know of a way to do commands without parsing dictation. Bottom line, the model has to pass 30 seconds of audio to my knowledge. So say if you're utterance is 5 seconds, you'll need 25 seconds of silence.<p>Depending on the platform you're targeting.<p><a href="https://github.com/dictation-toolbox/dragonfly">https://github.com/dictation-toolbox/dragonfly</a>
Might be interesting to you.
I've tried a few:<p>Whisper is cheapest<p>AssemblyAI and Google Cloud Speech to Text are more accurate<p>Overall, I wouldn't recommend Whisper unless the transcription accuracy doesn't need to be high. I'm hoping they release the "GPT-4" equivalent of Whisper.
I've been using whisper since it was there and it's also open source and I know I can host my own. I use it with I would say 95% accuracy, possibly more.<p>I'm interacting with GPT, so it usually doesn't care about the mistakes, it normally interprets them as what they are supposed to be.
if your decision is cost-oriented, then Whisper API is the cheapest - at least based on what other API companies promote on their websites.<p>however, depending on what you're building, you may consider local speech-to-text by running speech-to-text on user's devices, basically you do not pay for the cloud.<p>you should understand whether you'll need model adaptation -like adding custom industry jargon or so. whisper might be challenging.
You can use TranscribeMe, it's for Telegram and WhatsApp; it's totally free! <a href="https://transcribeme.app" rel="nofollow">https://transcribeme.app</a>