I've been working on an idea for an MVP that leverages speech recognition, for which there are a few viable API's. However, I'm interested in not only speech to text, but also determining the timing of each spoken word relative to the input audio. Unfortunately I haven't been able to find any good resources on how to accomplish this.<p>Any ideas on where to start?