How would this actually work in practice? Do I ask the user to utter specific words then train on that? How is it different from the traditional speech recognition that I need to 'train' to work better on my voice?<p>The Holy Grail would be to train the model while using it, without any friction. I don't think these methods support that though.
This is cool. This might be a silly question, but what are the scenarios where it's useful for fine-tuning on the edge with small devices? I get inference on the edge, and curious about metrics on that for Whisper, but isn't it better to fine-tune on beefier infrastructure and then deploy it for inference on the edge?