ESpeak-ng: speech synthesizer with more than one hundred languages and accents

256 pointsby nateb2022about 1 year ago

24 comments

retracabout 1 year ago

Classic speech synthesis is interesting, in that relatively simple approaches, produce useful results. Formant synthesis takes relatively simple sounds, and modifies them according to the various distinctions the human speech tract can make. The basic vowel quality can be modelled as two sine waves that change over time. (Nothing more complex than what's needed to generate touch tone dialing tones, basically.) Add a few types of buzzing or clicking noises before or after that for consonants, and you're halfway there. The technique predates computers; it's basically the same technique used by the original voder [1] just under computer control.Join that with algorithms which can translate English into phonetic tokens with relatively high accuracy, and you have speech synthesis. Make the dictionary big enough, add enough finesse, and a few hundred rules about transitioning from phoneme to phoneme, and it's produces relatively understandable speech.Part of me feels that we are losing something, moving away from these classic approaches to AI. It used to be that, to teach a machine how to speak, or translate, the designer of the system had to understand how language worked. Sometimes these models percolated back into broader thinking about language. Formant synthesis ended up being an inspiration to some ideas for how the brain recognizes phonemes. (Or maybe that worked in both directions.) It was thought, further advances would come from better theories about language, better abstractions. Deep learning has produced far better systems than the classic approach, but they also offer little in terms of understanding or simplifying.[1] <a href="https://en.wikipedia.org/wiki/Voder" rel="nofollow">https://en.wikipedia.org/wiki/Voder</a>

评论 #40232918 未加载

评论 #40232461 未加载

评论 #40237333 未加载

评论 #40237129 未加载

评论 #40232490 未加载

评论 #40238926 未加载

评论 #40234246 未加载

评论 #40233022 未加载

miki123211about 1 year ago

Blind person here, ESpeak-ng is literally what I use on all of my devices for most of my day, every day.I switched to it in early childhood, at a time where human-sounding synthesizers were notoriously slow and noticeably unresponsive, and just haven't found anything better ever since. I've used Vocalizer for a while, which is what iOS and Mac OS ship with, but then third-party synthesizer support was added and I switched right back.

评论 #40233232 未加载

评论 #40233750 未加载

mewse-hnabout 1 year ago

No example output? Here's a youtube video where he plays with this software<a href="https://www.youtube.com/watch?v=493xbPIQBSU" rel="nofollow">https://www.youtube.com/watch?v=493xbPIQBSU</a>

评论 #40235446 未加载

jrybabout 1 year ago

When speaking Chinese, it says the tone number in English after each character. So "你好" is pronounced "ni three hao three". Am I using this wrong? I'm running `espeak-ng -v cmn "你好"`.If this is just how it is, the "more than one hundred languages" claim is a bit suspect.

评论 #40240977 未加载

评论 #40237856 未加载

fisianabout 1 year ago

I used it on Android and it seems to be one of very few apps that can replace the default Google services text-to-speech engine.However, I wasn't satisfied with the speech quality so now I'm using RHVoice. RHVoice seems to produce more natural/human-sounding output yo me.

评论 #40236302 未加载

评论 #40238184 未加载

nmstokerabout 1 year ago

I always feel sympathy for the devs on this project as they get so many issues raised by people that are largely lazy (since the solution is documented and/or they left out obvious detail) or plain wrong. I suspect it's a side effect from espeak-ng being behind various other tools and in particular critical to many screen readers, thus you can see why the individuals need help even if they struggle to ask for it effectively.

vlovich123about 1 year ago

Anyone know why the default voice is set to be so bad?

评论 #40233234 未加载

评论 #40234226 未加载

评论 #40240069 未加载

bArrayabout 1 year ago

I think it would be good if they provided some samples on the readme. It would be good for example if their list of languages/accents could be sampled [1][1] <a href="https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md">https://github.com/espeak-ng/espeak-ng/blob/master/docs/lang...</a>> eSpeak NG uses a "formant synthesis" method. This allows many languages to be provided in a small size. The speech is clear, and can be used at high speeds, but is not as natural or smooth as larger synthesizers which are based on human speech recordings. It also supports Klatt formant synthesis, and the ability to use MBROLA as backend speech synthesizer.I've been using eSpeak for many years now. It's superb for resource constrained systems.I always wondered whether it would be possible to have a semi-context aware, but not neural network, approach.I quite like the sound of Mimic 3, but it seems to be mostly abandoned: <a href="https://github.com/MycroftAI/mimic3">https://github.com/MycroftAI/mimic3</a>

评论 #40239932 未加载

liotierabout 1 year ago

I hoped "-ng" would be standing for Nigeria - which would have been most fitting, considering Nigeria's linguistic diversity !

SoftTalkerabout 1 year ago

Can I get my map navigation prompts in the voice of Yoda please?"At the roundabout, the second exit take.""At your destination, arrived have you."

评论 #40233032 未加载

评论 #40233774 未加载

评论 #40235579 未加载

readmemyrightsabout 1 year ago

I'm quite surprised to find this on HN, synthesizers like espeak and eloquence (ibm TTS) have fallen out of favor these days. I'm a blind person who uses espeak on all my devices except my macbook, where unfortunately I can't install the speech synthesizer because it apparently only supports MacOS 13 (installing the library itself works fine though).Most times I try to use modern "natural-sounding" voices they take a while to initialize, and when you speed them at a certain point the words mix together into meaningless noise, while at the same rate eloquence and espeak would handle just great, well, for me at least.I was thinking about this a few days back while I was trying out piper-tts [0] how supposedly "more advanced" synthesizers powered by AI use up more ram and cpu and disk space to deliver a voice which doesn't sound much better than something like RH voice and gets things like inflection wrong. And that's the english voice, the voice for my language (serbian) makes espeak sound human and according to piper-tts it's "medium".Funny story about synthesizers taking a while to initialize, there's a local IT company here that specializes in speech synthesis and their voices take so long to load they had to say "<company> Mary is initializing..." whenever you start your screen reader or such. Was annoying but in a fun way. Their newer Serbian voices also have this "feature" where they try to pronounce some english words it comes upon properly. It also has another "feature" where it tries to pronounce words right that were spelled without accent marks or such, and like with most of these kinds of "features" they combine badly and hilariously. For example if you asked them to pronounce "topic" it would pronounce it as "topich, which was fun while browsing forums or such.[0] <a href="https://github.com/rhasspy/piper">https://github.com/rhasspy/piper</a>

sandbachabout 1 year ago

Anyone interested in formants and speech synthesis should have a look at Praat[0], a marvellous piece of free software that can do all kinds of speech analysis, synthesis, and manipulation.<a href="https://www.fon.hum.uva.nl/praat/" rel="nofollow">https://www.fon.hum.uva.nl/praat/</a>

deknosabout 1 year ago

Is this better than the classical espeak which is available in opensource repositories?I would be very glad if there's a truly open source local hosted text to speech software which brings good human sounding speech in woman/man german/english/french/spanish/russian/arabic language...

评论 #40233362 未加载

评论 #40240110 未加载

repleteabout 1 year ago

I listen to ebooks with TTS. On Android via FDroid the speech packs in this software are extremely robotic.There aren't many options for degoogled Android users. In the end I settled for the Google Speech Services and disabled network access and used the default voice. GSS has its issues and voices don't download properly, but the default voice is tolerable in this situation.

评论 #40238216 未加载

droopyEyelidsabout 1 year ago

Another project falls victim to the tragic “ng” relative naming, leaving it without options for future generations

评论 #40232391 未加载

spdustinabout 1 year ago

Now I just want DECTalk ported to MacOS. The original Stephen Hawking voice.I have an Emic2 board I use (through UART so my ESP32 can send commands to it) and I use Home Assistant to send notifications to it. My family are science nerds like me, so when the voice of Stephen Hawking tells us there is someone at the door, it brings a lot of joy to us.

评论 #40237713 未加载

评论 #40238273 未加载

dheeraabout 1 year ago

Why is the quality of open source TTS so horribly, horribly, horribly behind the commercial neural ones? This is nowhere near the quality of Google, Microsoft, or Amazon TTS, yet for image generation and LLMs almost everything outside of OpenAI seems to be open-sourced.

评论 #40234673 未加载

评论 #40233808 未加载

评论 #40233191 未加载

评论 #40233403 未加载

评论 #40236519 未加载

devinpraterabout 1 year ago

ESpeak is pretty great, and now that Piper is using it, hopefully strange issues like it saying nineteen hundred eighty four for 1984 the year, can be fixed.

评论 #40242782 未加载

评论 #40240742 未加载

synergy20about 1 year ago

just used it a few days ago, the quality is honestly subpar.I use chrome's extension 'read aloud', which is as natural as you can get.

评论 #40240316 未加载

iamleppertabout 1 year ago

SORA AI should integrate this into their LLM.

zambonidriverabout 1 year ago

Is it an LLM? What base model does it use?

评论 #40232790 未加载

manzanaramaabout 1 year ago

hugging face?

followerabout 1 year ago

Based on my own recent experience[0] with espeak-ng, IMO the project is currently in a really tough situation[3]:* the project seems to provide real value to a huge number of people who rely on it for reasons of accessibility (even more so for non-English languages); and,* the project is a valuable trove of knowledge about multiple languages--collected & refined over multiple decades by both linguistic specialists and everyday speakers/readers; but...* the project's code base is very much of "a different era" reflecting its mid-90s origins (on RISC OS, no less :) ) and a somewhat piecemeal development process over the following decades--due in part to a complex Venn diagram of skills, knowledge & familiarity required to make modifications to it.Perhaps the prime example of the last point is that `espeak-ng` has a hand-rolled XML parser--which attempts to handle both valid & invalid SSML markup--and markup parsing is interleaved with internal language-related parsing in the code. And this is implemented in C.[Aside: Due to this I would strongly caution against feeding "untrusted" input to espeak-ng in its current state but unfortunately that's what most people who rely on espeak-ng for accessibility purposes inevitably do while browsing the web.][TL;DR: More detail/repros/observations on espeak-ng issues here:* <a href="https://gitlab.com/RancidBacon/floss-various-contribs/-/blob/main/espeak/_issues.md" rel="nofollow">https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...</a>* <a href="https://gitlab.com/RancidBacon/floss-various-contribs/-/blob/main/espeak/mirror/notes-on-espeak-codebase.md" rel="nofollow">https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...</a>* <a href="https://gitlab.com/RancidBacon/notes_public/-/blob/main/notes/notes--text-to-speech.md#related-espeak-ng--ssml-issues" rel="nofollow">https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...</a>]Contributors to the project are not unaware of the issues with the code base (which are exacerbated by the difficulty of even tracing the execution flow in order to understand how the library operates) nor that it would benefit from a significant refactoring effort.However as is typical with such projects which greatly benefit individual humans but don't offer an opportunity to generate significant corporate financial return, a lack of developers with sufficient skill/knowledge/time to devote to a significant refactoring means a "quick workaround" for an specific individual issue is often all that can be managed.This is often exacerbated by outdated/unclear/missing documentation.IMO there are two contribution approaches that could help the project moving forward while requiring the least amount of specialist knowledge/experience:* Improve visibility into the code by adding logging/tracing to make it easier to see why a particular code path gets taken.* Integrate an existing XML parser as a "pre-processor" to ensure that only valid/"sanitized"/cleaned-up XML is passed through to the SSML parsing code--this would increase robustness/safety and facilitate future removal of XML parsing-specific workarounds from the code base (leading to less tangled control flow) and potentially future removal/replacement of the entire bespoke XML parser.Of course, the project is not short on ideas/suggestions for how to improve the situation but, rather, direct developer contributions so... shrugIn light of this, last year when I was developing the personal project[0] which made use of a dependency that in turn used espeak-ng I wanted to try to contribute something more tangible than just "ideas" so began to write-up & create reproductions for some of the issues I encountered while using espeak-ng and at least document the current behaviour/issues I encountered.Unfortunately while doing so I kept encountering new issues which would lead to the start of yet another round of debugging to try to understand what was happening in the new case.Perhaps inevitably this effort eventually stalled--due to a combination of available time, a need to attempt to prioritize income generation opportunities and the downsides of living with ADHD--before I was able to share the fruits of my research. (Unfortunately I seem to be way better at discovering & root-causing bugs than I am at writing up the results...)However I just now used the espeak-ng project being mentioned on HN as a catalyst to at least upload some of my notes/repros to a public repo (see links in TLDR section above) in that hopes that maybe they will be useful to someone who might have the time/inclination to make a more direct code contribution to the project. (Or, you know, prompt someone to offer to fund my further efforts in this area... :) )[0] A personal project to "port" my "Dialogue Tool for Larynx Text To Speech" project[1] to use the more recent Piper TTS[2] system which makes use of espeak-ng for transforming text to phonemes.[1] <a href="https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to-speech" rel="nofollow">https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...</a> & <a href="https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/feature-piper-port" rel="nofollow">https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/featur...</a>[2] <a href="https://github.com/rhasspy/piper">https://github.com/rhasspy/piper</a>[3] Very much no shade toward the project intended.

webprofusionabout 1 year ago

"More than hundred"

评论 #40233105 未加载