Common Voice

323 点作者 oblib超过 1 年前

15 条评论

sxp超过 1 年前

FF's TTS is an important project for anyone who wants a trivial to use text-to-speech system. It's built into the browser so you can just run<pre><code> wss = window.speechSynthesis; for (let i = 0; i < wss.getVoices().length; ++i){ str = `Voice ${i} is ${wss.getVoices()[i].name}`; s = new SpeechSynthesisUtterance(str); s.voice = wss.getVoices()[i]; wss.speak(s); console.log(str); } in the console to get various TTS examples. For some browsers, this can be done offline while others use a cloud based TTS system.</code></pre>

评论 #38536547 未加载

评论 #38536756 未加载

评论 #38535632 未加载

评论 #38539645 未加载

评论 #38537034 未加载

评论 #38535725 未加载

评论 #38537694 未加载

评论 #38539961 未加载

CoBE10超过 1 年前

I'd like to give a shout-out to Common Voice Android: <a href="https://github.com/Sav22999/common-voice-android">https://github.com/Sav22999/common-voice-android</a>It's a handy app for those interested in contributing to the project. You can record voices for the languages you speak and validate other user contributions. I used to be a frequent contributor about two years ago, and this app had a much more user-friendly design compared to the official website version.Additionally, check out the official Common Voice Matrix channel: <a href="https://chat.mozilla.org/#/room/#common-voice:mozilla.org" rel="nofollow noreferrer">https://chat.mozilla.org/#/room/#common-voice:mozilla.org</a>

pimlottc超过 1 年前

With recent events in AI and deepfake technology, I would need to see some assurances before I agreed to “donate my voice” to something like this. It seems like the project is for voice recognition, not generation, but it’s not immediately clear.

评论 #38539956 未加载

评论 #38537882 未加载

spadufed超过 1 年前

Crowdsourced datasets like this and the ones produced by the OpenAssistant project could easily become the ONLY way to build foundational models if the courts decide that what OpenAI and co are doing is not Fair-Use. I don't think I would call this scenario unlikely, either.

imjonse超过 1 年前

While this dataset is orders of magnitude smaller than what recent speech models like Whisper and Seamless got trained on, and while it is meant for supervised as opposed to self-supervised learning where data is more abundant, it can still be useful for finetuning an existing model for improving its score on a specific language.

user_7832超过 1 年前

Didn't mozilla also have a related speech to text software that got canned/moved to a different company? Or was that different?

评论 #38533831 未加载

评论 #38534596 未加载

nojvek超过 1 年前

Amazing.One of my hopes with OpenAI were that they were going to be truly open.Open datasets, open code, open models, open evaluation.But it is now a Microsoft puppet running on corporate profit goals.This and HuggingFace are great to see. I hope HuggingFace isn’t acquired by Microsoft like GitHub did.

jeena超过 1 年前

Why then is the text2speech in reader mode (which other than that is excellent) on a Linux Firefox so extremely bad? Much worse than Steven Hawkins text2speech.

dang超过 1 年前

Related. Others?Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech - <a href="https://news.ycombinator.com/item?id=28073016">https://news.ycombinator.com/item?id=28073016</a> - Aug 2021 (170 comments)Firefox Voice - <a href="https://news.ycombinator.com/item?id=24096082">https://news.ycombinator.com/item?id=24096082</a> - Aug 2020 (154 comments)Firefox Voice: Browse the web with your voice - <a href="https://news.ycombinator.com/item?id=23902560">https://news.ycombinator.com/item?id=23902560</a> - July 2020 (2 comments)Mozilla Common Voice Dataset: More data, more languages - <a href="https://news.ycombinator.com/item?id=23695377">https://news.ycombinator.com/item?id=23695377</a> - June 2020 (41 comments)The Common Voice Project by Mozilla reached its first goal: 1k hours in englisch - <a href="https://news.ycombinator.com/item?id=23051756">https://news.ycombinator.com/item?id=23051756</a> - May 2020 (1 comment)Common Voice: A Massively-Multilingual Speech Corpus - <a href="https://news.ycombinator.com/item?id=21887693">https://news.ycombinator.com/item?id=21887693</a> - Dec 2019 (9 comments)Common Voice – Mozilla's initiative to help teach machines how real people speak - <a href="https://news.ycombinator.com/item?id=21268579">https://news.ycombinator.com/item?id=21268579</a> - Oct 2019 (49 comments)Mozilla releases the largest to-date public domain transcribed voice dataset - <a href="https://news.ycombinator.com/item?id=19270646">https://news.ycombinator.com/item?id=19270646</a> - Feb 2019 (61 comments)Mozilla Overhauls Speech-To-Text Contribution Interface - <a href="https://news.ycombinator.com/item?id=17436958">https://news.ycombinator.com/item?id=17436958</a> - July 2018 (42 comments)Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Data - <a href="https://news.ycombinator.com/item?id=15808124">https://news.ycombinator.com/item?id=15808124</a> - Nov 2017 (88 comments)Project Common Voice - <a href="https://news.ycombinator.com/item?id=14794654">https://news.ycombinator.com/item?id=14794654</a> - July 2017 (57 comments)Mozilla: Project Common Voice - <a href="https://news.ycombinator.com/item?id=14786881">https://news.ycombinator.com/item?id=14786881</a> - July 2017 (1 comment)

arcastroe超过 1 年前

How many people here have a different "reading voice" vs their normal conversational voice? Can conversational models be trained even if much of the training data sounds "scripted"?

评论 #38544692 未加载

moron4hire超过 1 年前

> Voice datasets also underrepresent: non-English speakers, people of colour, disabled people, women and LGBTQIA+ people.How does being gay change your voice?

评论 #38537815 未加载

skrebbel超过 1 年前

I’m sad that this is English only. I’ll love to contribute lots of voice for a Dutch TTS from an nonprofit org like Mozilla

评论 #38535567 未加载

评论 #38535928 未加载

vidarh超过 1 年前

I submitted a request for Norwegian Bokmål, and realised a complication which I'm sure must affect other languages too:Norway has two separate official languages. They are unusually close - one is relatively close to Danish, and the other started as a collection of dialects, but technically they are written languages, especially Bokmål which basically means "book language".I'm unusual in that I speak close to "pure" Bokmål. Thanks to expectations at school etc., a lot of speakers who write Bokmål will adjust or tone down their dialect if asked to read a text that is written in grammatically and orthographically correct bokmål, but will otherwise speak in a manner that can deviate fairly significantly from the written language.As such, depending on whether your goal is text to speech or speech recognition, the pronunciation you will need is very different.E.g. people I know who write Bokmål might say something like "hva erredu ser på a?" ("what are you looking at?") with hardly any gaps between words, while I would stick close to the written "hva er det du ser på?" with clear gaps. In recognition you need to handle both (and many other variations), while for generation you'd at least by default usually want the latter unless there are indications the text is written in dialect.It strikes me you'd really want people to write more detail about what it is they are speaking and/or let people tag/label data with additional info about accents. Not just for this, but for other multi-lingual speakers as well. E.g. it'd be helpful to have many foreign accents in the English (and other languages) dataset for recognition, but as much as I want speech recognition to understand me, I'm not particularly interested in teaching it to speak English with a strong Norwegian accent.That is less of an issue than the dialects in some languages that can involve much more than just speaking the same words differently.To take another example "Jeg åpnet døren og gikk ut i solen" og "Jeg åpna døra og gikk ut i sola" are both valid Bokmål. Depending on context a reader may stick strictly to the text or swap åpnet<->åpna, døren<->døra, sola<->sola, and every permutation is valid. Which exact set you use differs and some speakers will write one but use the other when speaking. E.g. I would say åpna, døra, sola, but write åpnet, døren, solen. The latter is more formal and/or old-fashioned in some parts of the country, but the perception of that also varies by region. And this totally leaves out all the dialect variations used by people who'd say their language is Bokmål, and would be recognized as such by Norwegian speakers, but who use variants of words or conjugations that aren't technically recognized as valid Bokmål.The former is more "modern" (several of the forms are only valid Bokmål as a result of successive language reforms), more common in the Eastern part of Norway outside of the posher parts of Oslo and other wealthy regions, and (weirdly) more common in 1970's radical left-wing academics (especially people involved with the Maoist Workers Communist Party/AKP-ML) as an affectation/sociolect, with each of these groups also deviating in other aspects....If you want to maximize the utility of a dataset like this, you really would want to let each speaker at least assign a lot of tags/labels to their profile; even if you don't want to deal with the hornet nest of trying to figure out all the distinctions, even unstructured labels would be a start, and ideally allowing people to tag individual recordings as well, because there are a lot more variations than just "language" and "accent" here.