Has anyone had any luck with a free, offline, open-source, real-time speech-to-speech translation app on under-powered devices (i.e., older smart phones)?<p>* <a href="https://github.com/ictnlp/StreamSpeech">https://github.com/ictnlp/StreamSpeech</a><p>* <a href="https://github.com/k2-fsa/sherpa-onnx">https://github.com/k2-fsa/sherpa-onnx</a><p>* <a href="https://github.com/openai/whisper">https://github.com/openai/whisper</a><p>I'm looking for a simple app that can listen for English, translate into Korean (and other languages), then perform speech synthesis on the translation. Basically, a Babelfish that doesn't stick in the ear. Although real-time would be great, a max 5-second delay is manageable.<p>RTranslator is awkward (couldn't get it to perform speech-to-speech using a single phone). 3PO sprouts errors like dandelions and requires an online connection.<p>Any suggestions?
It's not exactly what OP wants out-of-the-box, but if anyone is considering building one I suggest taking a look at this.¹ It is really easy to tinker with, can run both on devide or in a client-server model.
It has the required speech-to-text and text-to-speech endpoints, with multiple options for each built-in. If you can make the LLM AI assistant part of the pipeline to perform translation to a degree you're comfortable with, this could be a solution.<p>¹ <a href="https://github.com/huggingface/speech-to-speech">https://github.com/huggingface/speech-to-speech</a>
> free<p>> offline<p>> real-time<p>> speech-to-speech translation app<p>> on under-powered devices<p>I genuinely don't think the technology is there.<p>I can't even find a half-good real-time "speech to second language text" tool, not even with "paid/online/on powerful device" options.
It’s not free, but I’ve had some success using ChatGPT’s Advanced Voice mode for sequential interpreting between English and Japanese. I found I had to first explain the situation to the model and tell it what I wanted it to do. For example: “I am going to have a conversation with my friend Taro. I speak English, and he speaks Japanese. Translate what I say into Japanese and what he says into English. Only translate what we say. Do not add any explanations or commentary.”<p>We had to be careful not to talk over each other or the model, and the interpreting didn’t work well in a noisy environment. But once we got things set up and had practiced a bit, the conversations went smoothly. The accuracy of the translations was very good.<p>Such interpreting should get even better once the models have live visual input so that they can “see” the speakers’ gestures and facial expressions. Hosting on local devices, for less latency, will help as well.<p>In business and government contexts, professional human interpreters are usually provided with background information in advance so that they understand what people are talking about and know how to translate specialized vocabulary. LLMs will need similar preparation for interpreting in serious contexts.
It is impossible to accurately interpret with a max 5 second delay. The structure of some languages requires the interpreter to occasionally wait for the end of a statement being the start of interpretation is possible.
‘Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation.’
Only seems to cover half of what you're asking for... Starred this the other day and haven't gotten to trying it out :<p><a href="https://github.com/usefulsensors/moonshine">https://github.com/usefulsensors/moonshine</a>
A friend recommends SayHi, which does near-realtime speech-to-speech translation (<a href="https://play.google.com/store/apps/details?id=com.sayhi.app&hl=en-US">https://play.google.com/store/apps/details?id=com.sayhi.app&...</a>). Unfortunately it's not offline though.
I've develop an macOS App: BeMyEars which can realtime speech-to-text translation. It first transcribe and then translate between language. All of this is working on-device.
If you only want smart phone app: you can also try YPlayer, it's also working on-device.
They can be downloaded from AppStore.
I've been looking for something like this (Not for Korean though) and I'd even be happy to pay - though I'd prefer to pay by usage rather than a standing subscription fee.
So far, no luck, but watching this thread!
>Although real-time would be great, a max 5-second delay is manageable.<p>Humans can't even do this in immediate real-time, what makes you think a computer can? Some of the best real-time translators that work at the UN or for governments still have a short delay to be able to correctly interpret and translate for accuracy and context. Doing so in real-time actually impedes the translator from working correctly - especially in languages that have different grammatical structures. Even in langauges that are effectively congruent (think Latin derivatives), this is hard, if not outright impossible to do in real time.<p>I worked in the field of language education and computer science. The tech you're hoping would be free and able to run on older devices is easily a decade away at the very best. As for it being offline, yeah, no. Not going to happen, because accurate real-time translation of even a database of the 20 most common languages on earth is probably a few terrabytes at the very least.
Is this possible to do smoothly with languages that have an extremely different grammar to English? If you need to wait until the end of the sentence to get the verb, for instance, then that could take more than five seconds, particularly if someone is speaking off the cuff with hesitations and pauses (Or you could translate clauses as they come in, but in some situations you'll end up with a garbled translation because the end of the sentence provides information that affects your earlier translation choices).<p>AFAIK, humans who do simultaneous interpretation are provided with at least an outline, if not full script, of what the speaker intends to say, so they can predict what's coming next.
> * <a href="https://github.com/openai/whisper">https://github.com/openai/whisper</a><p>I would be very concerned about any LLM model being used for "transcription", since they may injecting things that nobody said, as in this recent item:<p><a href="https://news.ycombinator.com/item?id=41968191">https://news.ycombinator.com/item?id=41968191</a>
This phone has been around for ages, and does the job. It's well weapon!
<a href="https://www.neatoshop.com/product/The-Wasp-T12-Speechtool" rel="nofollow">https://www.neatoshop.com/product/The-Wasp-T12-Speechtool</a>