TechEcho

12 comments

Metriconabout 2 months ago

GGUF version created by "isaiahbjork" which is compatible with LM Studio and llama.cpp server at: <a href="https://github.com/isaiahbjork/orpheus-tts-local/" rel="nofollow">https://github.com/isaiahbjork/orpheus-tts-local/</a>To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock

评论 #43421619 未加载

评论 #43420158 未加载

评论 #43435650 未加载

评论 #43420464 未加载

huijzerabout 2 months ago

I always am a bit skeptical of these demos, and indeed I think they didn't put much effort into getting the most out of ElevenLabs. In the demo, they used the Brian voice. For the first example, I can get this in ElevenLabs [1]. Stability was set to 20 here and all the other settings were at their default. Having stability at the default of 50 sounds more like what is in the demo on the site [2].Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.[1]: <a href="https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA" rel="nofollow">https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA</a>[2]: <a href="https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7" rel="nofollow">https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7</a>[3]: <a href="https://github.com/transformrs/trv" rel="nofollow">https://github.com/transformrs/trv</a>[4]: <a href="https://youtu.be/Ni-dKlCpnb4" rel="nofollow">https://youtu.be/Ni-dKlCpnb4</a>

评论 #43457655 未加载

评论 #43435632 未加载

hadlockabout 2 months ago

I'm looking forward to having an end-to-end "docker compose up" solution for self hosted chatgpt conversational voice mode. This is probably possible today, with enough glue code, but I haven't seen a neatly wrapped solution yet on par with ollama's.

评论 #43419110 未加载

评论 #43420422 未加载

评论 #43419728 未加载

rcarmoabout 2 months ago

Slightly less enthusiastic Californian - good - but the “British” voice feels cringe.

评论 #43425240 未加载

nicoabout 2 months ago

> even on an A100 40GB for the 3 billion parameter modelWould any of the models run on something like a raspberry pi?How about a smartphone?

评论 #43419670 未加载

deetabout 2 months ago

Impressive for a small model.Two questions / thoughts:1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)

评论 #43419225 未加载

ForTheKidzabout 2 months ago

It sounds like reading from a script, or like an influencer. In that sense it's quite good: i could buy this is human.However it's not a very good reading of the script, in human terms. It feels even more forced and phony than aforementioned influencers.

evrimoztamurabout 2 months ago

Impressive for a small model, and I think it could be improved by fixing individual phrases sounding like they were recorded separately. Subtle differences in sound quality, and no natural transitions between individual words, it fails to sound realistic. I think these should be fixable as we figure out how to fine tune on (and thus normalizing) recording characteristics.

8organicbitsabout 2 months ago

A couple things I noticed:- in the prompt "SO serious" it pronounces each letter as "ess oh" instead of emphasizing the word "so"- there's no breathing sounds or natural breathing based pausesChoosing which words in a sentence to emphasize can completely change the meaning of a sentence. This doesn't appear to be able to do that.Still, huge progress over where we were just a couple years ago.

admiralrohanabout 2 months ago

What is the difference between small and large models in case of TTS?For language models I understand the thinking quality is different. But for TTS? Do anyone used small models in production use case?

michaelgibaabout 2 months ago

Nice, I’m particularly excited for the tiny models.

NetOpWibbyabout 2 months ago

Having a NetNavi is gonna be possible at some point. This is nuts.

12 comments

Metriconabout 2 months ago

评论 #43421619 未加载

评论 #43420158 未加载

评论 #43435650 未加载

评论 #43420464 未加载

huijzerabout 2 months ago

评论 #43457655 未加载

评论 #43435632 未加载

hadlockabout 2 months ago

评论 #43419110 未加载

评论 #43420422 未加载

评论 #43419728 未加载

rcarmoabout 2 months ago

Slightly less enthusiastic Californian - good - but the “British” voice feels cringe.

评论 #43425240 未加载

nicoabout 2 months ago

> even on an A100 40GB for the 3 billion parameter modelWould any of the models run on something like a raspberry pi?How about a smartphone?

评论 #43419670 未加载

deetabout 2 months ago

评论 #43419225 未加载

ForTheKidzabout 2 months ago

evrimoztamurabout 2 months ago

8organicbitsabout 2 months ago

admiralrohanabout 2 months ago

michaelgibaabout 2 months ago

Nice, I’m particularly excited for the tiny models.

NetOpWibbyabout 2 months ago

Having a NetNavi is gonna be possible at some point. This is nuts.

Orpheus-3B – Emotive TTS by Canopy Labs

12 comments

Orpheus-3B – Emotive TTS by Canopy Labs

12 comments