It's great to see the whole chain of "speech to text to model to text to speech" in action. The huge amount of computation that has to be done shows in the delay during the API calls.<p>Before this is actually usable in a game or product the models must be made cheaper to compute and smaller in size.
That’s a cool demo (starts at 4:24). I assume these NPCs don’t actually know anything about the game, so there actually isn’t a “City Hall Street Location” in the game, and you can’t actually pay 10 coins and get a sweet hot dog. What would it take to hook this up to the reality of the gameplay?