科技回声

15 条评论

tkgally4 个月前

I tried it with a paragraph of English taken from a formal speech, and it sounded quite good. I would not have been able to distinguish it from a skilled human narrator.But then I tried a paragraph of Japanese text, also from a formal speech, with the language set to Japanese and the narrator set to Yumiko Narrative. The result was a weird mixture of Korean, Chinese, and Japanese readings for the kanji and kana, all with a Korean accent, and numbers read in English with an American accent. I regenerated the output twice, and the results were similar. Completely unusable.I tried the same paragraph on ElevenLabs. The output was all in Japanese and had natural intonation, but there were two or three misreadings per sentence that would render it unusable for any practical purpose. Examples: 私の生の声 was read as watashi no koe no koe when it should have been watashi no nama no koe. 公開形式 was read as kōkai keiji instead of kōkai keishiki. Neither kanji misreading would be correct in any context. Even weirder, the year 2020 was read as 2021. Such misreadings would confuse and mislead any listeners.I know that Japanese text-to-speech is especially challenging because kanji can often be read many different ways depending on the context, the specific referent, and other factors. But based on these tests, neither PlayAI nor ElevenLabs should be offering Japanese TTS services commercially yet.

评论 #42969306 未加载

评论 #42970376 未加载

评论 #42969043 未加载

quickgist4 个月前

For some reason, most of these (and other narration AIs) sound like someone reading off a teleprompter, rather than natural speaking voices. I'm not sure what exactly it is, but I'm left feeling like the speaker isn't really sure of what the next words are, and the stresses between the words are all over the place. It's like the emphasis over a sentence doesn't really match how humans sound.

评论 #42969187 未加载

评论 #42969266 未加载

评论 #42968834 未加载

评论 #42968619 未加载

评论 #42968682 未加载

评论 #42969398 未加载

masto4 个月前

I love the tech. I hate that it gets used to fill YouTube with zero-effort slop. I don't have a solution.

评论 #42969764 未加载

评论 #42969538 未加载

wilg4 个月前

I wanted to use this to do temp voices for a video game project. Not realtime, just creating the audio at build time basically. However, the pricing model is not conducive to that because you cannot pay-per-use, and on top of that none of the lower cost plans support more than 10 requests per minute so its difficult to use for batch operations. $299/mo seemed steep for my use case of infrequent bursts, and they couldn't help me with a custom plan, so I have ended up just using Azure AI Text-to-Speech. (Which is also much faster to render.)

thot_experiment4 个月前

I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are:- zero shot voice cloning isn't there yet- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps- F5 and fish-speech are both good as well- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)- finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid- really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural netsTL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech

评论 #42969260 未加载

评论 #42969059 未加载

jpkw4 个月前

Testing with "Banana... Banana???? Banana!!!!!" Yields interesting results each time, and none so far the way a human would read it.

adriand4 个月前

Do these services restrict the content that their AIs give voice to? If so, what are the typical restrictions? Like do they seek to prevent their tech being used for scamming, erotica, hate speech, etc? Or is it pretty much anything goes?

评论 #42968986 未加载

jesperwe4 个月前

Maybe launched a bit too quickly? You can select "Swedish" for the non-Swedish voices, but the results are very poor. Far from useable. And there is no Swedish voice. So that language support claim is made a bit to soon I would say.Also I found no way to filter/sort the voice selection modal on language, so I have to visually search the entire list.

pj_mukh4 个月前

To the founders: Would love to share my audio files with my team before we commit to a payment plan. Is there anyway to share audio files I've generated?Our whole team is on Elevenlabs and a switch is significant work, but I think the results are worth it! Super awesome work!

hsbauauvhabzb4 个月前

I have a 3:1 preference for services that don’t unnecessarily require a mobile number to sign up.

WhitneyLand4 个月前

Interesting OpenAI is left out from the comparison. I know they don’t do cloning, but there’s definitely overlap for narratives.I wonder which comparison they hope to avoid, quality or price?

vessenes4 个月前

So, this is really impressive. Expressivity and pacing are wayyy better. Eleven Labs has been tops for some time, but the difference is pretty remarkable!

评论 #42926790 未加载

SeanAnderson4 个月前

wow! The preview is amazing! I would've 100% assumed those were human narrations if I wasn't given leading context.

chzp944 个月前

Awesome!

visarga4 个月前

I signed up and right from the start I have -3 words left. LOL

15 条评论

tkgally4 个月前

评论 #42969306 未加载

评论 #42970376 未加载

评论 #42969043 未加载

quickgist4 个月前

评论 #42969187 未加载

评论 #42969266 未加载

评论 #42968834 未加载

评论 #42968619 未加载

评论 #42968682 未加载

评论 #42969398 未加载

masto4 个月前

I love the tech. I hate that it gets used to fill YouTube with zero-effort slop. I don't have a solution.

评论 #42969764 未加载

评论 #42969538 未加载

wilg4 个月前

thot_experiment4 个月前

评论 #42969260 未加载

评论 #42969059 未加载

jpkw4 个月前

Testing with "Banana... Banana???? Banana!!!!!" Yields interesting results each time, and none so far the way a human would read it.

adriand4 个月前

评论 #42968986 未加载

jesperwe4 个月前

pj_mukh4 个月前

hsbauauvhabzb4 个月前

I have a 3:1 preference for services that don’t unnecessarily require a mobile number to sign up.

WhitneyLand4 个月前

Interesting OpenAI is left out from the comparison. I know they don’t do cloning, but there’s definitely overlap for narratives.I wonder which comparison they hope to avoid, quality or price?

vessenes4 个月前

So, this is really impressive. Expressivity and pacing are wayyy better. Eleven Labs has been tops for some time, but the difference is pretty remarkable!

评论 #42926790 未加载

SeanAnderson4 个月前

wow! The preview is amazing! I would've 100% assumed those were human narrations if I wasn't given leading context.

chzp944 个月前

Awesome!

visarga4 个月前

I signed up and right from the start I have -3 words left. LOL

PlayAI's new Dialog model achieves 3:1 preference in human evals

15 条评论

PlayAI's new Dialog model achieves 3:1 preference in human evals

15 条评论