GGUF version created by "isaiahbjork" which is compatible with LM Studio and llama.cpp server at: <a href="https://github.com/isaiahbjork/orpheus-tts-local/" rel="nofollow">https://github.com/isaiahbjork/orpheus-tts-local/</a><p>To run llama.cpp server:
llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock
I always am a bit skeptical of these demos, and indeed I think they didn't put much effort into getting the most out of ElevenLabs. In the demo, they used the Brian voice. For the first example, I can get this in ElevenLabs [1]. Stability was set to 20 here and all the other settings were at their default. Having stability at the default of 50 sounds more like what is in the demo on the site [2].<p>Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.<p>[1]: <a href="https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA" rel="nofollow">https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA</a><p>[2]: <a href="https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7" rel="nofollow">https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7</a><p>[3]: <a href="https://github.com/transformrs/trv" rel="nofollow">https://github.com/transformrs/trv</a><p>[4]: <a href="https://youtu.be/Ni-dKlCpnb4" rel="nofollow">https://youtu.be/Ni-dKlCpnb4</a>
I'm looking forward to having an end-to-end "docker compose up" solution for self hosted chatgpt conversational voice mode. This is probably possible today, with enough glue code, but I haven't seen a neatly wrapped solution yet on par with ollama's.
Impressive for a small model.<p>Two questions / thoughts:<p>1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?<p>2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)
It sounds like reading from a script, or like an influencer. In that sense it's quite good: i could buy this is human.<p>However it's not a very <i>good</i> reading of the script, in human terms. It feels even more forced and phony than aforementioned influencers.
Impressive for a small model, and I think it could be improved by fixing individual phrases sounding like they were recorded separately. Subtle differences in sound quality, and no natural transitions between individual words, it fails to sound realistic. I think these should be fixable as we figure out how to fine tune on (and thus normalizing) recording characteristics.
A couple things I noticed:<p>- in the prompt "SO serious" it pronounces each letter as "ess oh" instead of emphasizing the word "so"<p>- there's no breathing sounds or natural breathing based pauses<p>Choosing which words in a sentence to emphasize can completely change the meaning of a sentence. This doesn't appear to be able to do that.<p>Still, huge progress over where we were just a couple years ago.
What is the difference between small and large models in case of TTS?<p>For language models I understand the thinking quality is different. But for TTS?
Do anyone used small models in production use case?