科技回声

5 条评论

gronky_大约 1 个月前

I just tried the demo on the homepage and I don’t know what kind of sorcery this is but it’s blowing my mind.I input a bunch of completely made up words (Quastral Syncing, Zarnix Meshing, HIBAX, Bilxer) and used them in a sentence and the model zero-shotted perfect speech recognition!It’s so counterintuitive for me that this would work. I would have bet that you have to provide at least one audio sample in order for the model to recognize a word it was never trained on.Providing it to the model in text modality and it being able to recognize it in the audio modality must be an emergent property.

suchire大约 1 个月前

Is their WER graph just completely made up? It’s comically bad

four_fifths大约 1 个月前

so if i understand this correctly — you want the speech recognition model to identify a vocabulary of specific terms that it wasn't trained on. instead of fine-tuning with training data that includes the new vocabulary, you input the full vocabulary at test time as a list of words and the model is able to generate transcripts that include words from the vocabulary.seems like it could be very useful but it really comes down to the specifics.you can prompt whisper with context — how does this compare?how large of a vocabulary can it work with? if it's a few dozen words it's only gonna help for niche use cases. if it can handle 100s-1000s with good performance that could completely replace fine-tuning for many uses

评论 #43544345 未加载

评论 #43544426 未加载

FloatArtifact大约 1 个月前

How does this keyword spotting compare versus grammar or intent approach for speech recognition commands with dictation?How does keyword spotting handle complex phrases as commands?

htrp大约 1 个月前

perhaps it's using openai advanced voice or another tts to create waveforms for comparison?

Jargonic: Industry-Tunable ASR Model

5 条评论

Jargonic: Industry-Tunable ASR Model

5 条评论