This project is the result of a one year long learning process in speech recognition and speech synthesis.<p>The original task was to automate the testing of a voice-enabled IVR system. While we started with real audio recordings, very soon it was clear that this approach is not feasible for a non-trivial app and it will be impossible to reach a satisfying test coverage. On the other hand, we had to find a way to transcribe the voice app response to text for doing our automated assertions.<p>As cloud-based solutions where not an option (company policy), we very quickly got frustrated as there was no "get shit done" Open Source stack available for doing medium-quality text-to-speech and speech-to-text conversions. We learned how to train and use Kaldi, which is according to some benchmarks the best available system out there, but mainly targeting academic users and research. We made heavy-weight MaryTTS work to synthesize speech in reasonable quality.<p>And finally, we packaged all of this in a DevOps-friendly HTTP/JSON API with a Swagger definition.<p>As always, feedback and contributions are welcome!
I built something quite similar on my own product. Is there any interest on adding more STT/TTS backends to the software? Think services like Lyrebird or Trint.<p>I could contribute towards it since I have done it before.<p>Thank you for building this!
Here's a sample wav output from using their swagger endpoint: <a href="https://drive.google.com/file/d/15y83NSXOCrEW9v9eQVCy6oHcWJ8DXGE0/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/15y83NSXOCrEW9v9eQVCy6oHcWJ8...</a><p>Why does the voice/pronunciation have such drastic volume spikes and dips?
Could you explain, what's the difference to<p>- <a href="https://github.com/gooofy/zamia-speech#asr-models" rel="nofollow">https://github.com/gooofy/zamia-speech#asr-models</a><p>- <a href="https://github.com/mpuels/docker-py-kaldi-asr-and-model" rel="nofollow">https://github.com/mpuels/docker-py-kaldi-asr-and-model</a><p>in regards of speech recognition except the fact that its easier to use?
Can anyone in the space expand on why it's increasingly rare to see people using/building on Sphinx[0]? Do people avoid it simply because of an impression that it won't be good enough compared to deep learning driven approaches?<p>[0]: <a href="https://cmusphinx.github.io/" rel="nofollow">https://cmusphinx.github.io/</a>
Any recommendations for a real time solution?<p>I maintain a platform which features live video events we'd like to add captioning and so far can only see IBM Watson providing a websockets interface for near real time stt.
Is MaryTTS still as good as it gets for free TTS? I've been researching this topic and it seems like there are some open-source implementations of Tacotron, but the quality isn't necessarily great.
If marytts is so good, why are we in many linux distros still using <a href="https://en.wikipedia.org/wiki/Festival_Speech_Synthesis_System" rel="nofollow">https://en.wikipedia.org/wiki/Festival_Speech_Synthesis_Syst...</a> as our default tts system?