Meta AI announces Massive Multilingual Speech code, models for 1000+ languages

705 pointsby crakenzakalmost 2 years ago

38 comments

crakenzakalmost 2 years ago

Code: <a href="https://github.com/facebookresearch/fairseq/tree/main/examples/mms">https://github.com/facebookresearch/fairseq/tree/main/exampl...</a>Blog Post: <a href="https://ai.facebook.com/blog/multilingual-model-speech-recognition/" rel="nofollow">https://ai.facebook.com/blog/multilingual-model-speech-recog...</a>Paper: <a href="https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/" rel="nofollow">https://research.facebook.com/publications/scaling-speech-te...</a>Languages coverage: <a href="https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html" rel="nofollow">https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mm...</a>

评论 #36035795 未加载

评论 #36035462 未加载

qwertoxalmost 2 years ago

I would like to use stuff like this as a side-project. Buy a Nvidia Geforce GPU and stick it into my 24/7 server and play around with it in my free time, to see what can be done.The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.

评论 #36037388 未加载

评论 #36040303 未加载

评论 #36036189 未加载

评论 #36036310 未加载

评论 #36036493 未加载

评论 #36037870 未加载

评论 #36036282 未加载

评论 #36037370 未加载

评论 #36038344 未加载

评论 #36037868 未加载

评论 #36040389 未加载

archon1410almost 2 years ago

ASR: "Automatic Speech Recognition"; also known as "Speech to Text" (STT)TTS: "Text to Speech"LID: "Language Identification"In case anyone else was confused about what the acronyms mean.

评论 #36038116 未加载

armatavalmost 2 years ago

Imagine if we used these types of models for like 500 years and it locked their vocabulary in time, disallowing any further language blending; then somehow the servers turned off and nobody could communicate across language barriers anymore.Someone should write that down in some sort of short-story involving a really tall structure.

评论 #36038752 未加载

评论 #36039632 未加载

评论 #36057378 未加载

eigenvaluealmost 2 years ago

I just wanted to test out the TTS locally on a powerful Ubuntu 22.04 machine, but the process for setting it up seems pretty broken and poorly documented. After 20 minutes of trying I finally gave up since I couldn't get the VITS dependency to build (despite having a fully updated machine with all required compilers). It seems like they never really bother to see if the stuff works on a fresh machine starting from scratch. Somehow for my own projects I'm always able to start from a fresh git clone and then directly install everything using this block of code:``` python3 -m venv venv source venv/bin/activate python3 -m pip install --upgrade pip python3 -m pip install wheel pip install -r requirements.txt ```But whenever I try using these complicated ML models, it's usually an exercise in futility and endless mucking around with conda and other nonsense. It ends up not being worth it and I just move on. But it does feel like it doesn't need to be like this.

评论 #36038322 未加载

评论 #36041816 未加载

评论 #36036896 未加载

评论 #36041093 未加载

评论 #36039200 未加载

m3kw9almost 2 years ago

The problem with all these model releases is they have no demos or even video of it working. It’s all just download it and run it, like it’s an app.

评论 #36035911 未加载

评论 #36035916 未加载

评论 #36035770 未加载

omneityalmost 2 years ago

I was super excited at this, but digging through the release [0] one can see the following [1]. While using Bible translations is indeed better than nothing, I don't think the stylistic choices in the Bible are representative of how people actually speak the language, in any of the languages I can speak (i.e. that I am able to evaluate personally).Religious recordings tend to be liturgical, so even the pronunciation might be different to the everyday language. They do address something related, although more from a vocabulary perspective to my understanding [2].So one of their stated goals, to enable people to talk to AI in their preferred language [3], might be closer but certainly a stretch to achieve with their chosen dataset.[0]: <a href="https://about.fb.com/news/2023/05/ai-massively-multilingual-speech-technology/amp/" rel="nofollow">https://about.fb.com/news/2023/05/ai-massively-multilingual-...</a>[1]: > These translations have publicly available audio recordings of people reading these texts in different languages. As part of the MMS project, we created a dataset of readings of the New Testament in more than 1,100 languages, which provided on average 32 hours of data per language. By considering unlabeled recordings of various other Christian religious readings, we increased the number of languages available to more than 4,000. While this data is from a specific domain and is often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.[2]: > And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.[3]: > This kind of technology could be used for VR and AR applications in a person’s preferred language and that can understand everyone’s voice.

评论 #36037458 未加载

评论 #36040674 未加载

pruthvishettyalmost 2 years ago

This looks huge. Anyone know how this compares with Whisper in terms of quality and speed?

评论 #36035598 未加载

echelonalmost 2 years ago

> The MMS code and model weights are released under the CC-BY-NC 4.0 license.Huge bummer. Prevents almost everyone from using this and recouping their costs.I suppose motivated teams could reproduce the paper in a clean room, but that might also be subject to patents.

评论 #36035561 未加载

评论 #36035477 未加载

评论 #36035469 未加载

评论 #36036369 未加载

rvzalmost 2 years ago

So many so-called overnight AI gurus hyping about their snake-oil product and screaming about 'Meta is dying' [0] and 'It is over for Meta' but little of them actually do research in AI and drive the field forward and this once again shows that Meta has always been a consistent contributor to AI research, especially in vision systems.All we can just do is take, take, take the code. But this time, the code's license is CC-BY-NC 4.0. Which simply means:Take it, but no grifting allowed.[0] <a href="https://news.ycombinator.com/item?id=31832221" rel="nofollow">https://news.ycombinator.com/item?id=31832221</a>

评论 #36035621 未加载

评论 #36040152 未加载

OkGoDoItalmost 2 years ago

According to [1] on the accompanying blog post, this brings the Whisper 44.3 WER down to 18.7, although it’s unclear to me how much better this is at primarily English speech recognition. I’d love to see a full comparison of accuracy improvements as well as a proper writeup of how much more power it takes to run this in production or on mobile vs something like whisper.[1]: <a href="https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/346801894_261088246476193_7395499802717483754_n.png?_nc_cat=103&ccb=1-7&_nc_sid=6825c5&_nc_ohc=vHCf-i1COVcAX9ryJQX&_nc_ht=scontent-sjc3-1.xx&oh=00_AfAXj-l6r2rNadAc_0aMqQTpcUS_FrXzoO9Otxx_XglqXg&oe=6471A113" rel="nofollow">https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/346801894_...</a>

freediveralmost 2 years ago

Come to think about it, Meta is a much better name for an AI company than a VR company.

评论 #36038044 未加载

评论 #36035543 未加载

reapermanalmost 2 years ago

I assume this "competes" directly with <a href="https://sites.research.google/usm/" rel="nofollow">https://sites.research.google/usm/</a> -- would be cool to see side-by-side benchmarks sometime! Maybe I should make those. I requested access to USM but have not been granted any access yet.

评论 #36035622 未加载

PufPufPufalmost 2 years ago

1107 languages but no Czech or Slovak? Many languages with way fewer speakers made it to the list. I wonder what we did to Meta...

评论 #36044457 未加载

og_kalualmost 2 years ago

Meta on a roll. any demo on how good the text to speech is ?

评论 #36035522 未加载

评论 #36040247 未加载

cleverwebblealmost 2 years ago

Wow, I didn't even know there was 7,000 documented languages in the world!

评论 #36035463 未加载

评论 #36035844 未加载

sarabandealmost 2 years ago

I'm trying to use this on a 3M mp3 file to test ASR with language code deu, CPU only, and I keep getting this error -- are there limits to the MMS inference?<pre><code> File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, ( AssertionError: Sentences lengths should not exceed max_tokens=4000000 Traceback (most recent call last): File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 52, in <module> process(args) File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 44, in process</code></pre>

评论 #36041615 未加载

rllearneratworkalmost 2 years ago

Real Open AI lab.

评论 #36038743 未加载

feim_2022almost 2 years ago

Wonder how this compares with deepgram offering. has anyone used/tried/compared or even read enough literature to compare. The WER rates showed in deepgram are still better than the largest MMS and the specific use case based fine-tuned models (zoom meetings, financial calls etc) probably make a bigger difference. WDYT ?

neycodaalmost 2 years ago

Great, now the Terminators will be barking orders at each other in languages I can't understand.

sacnoradhqalmost 2 years ago

FYI: Another round of massive layoffs at Meta this Wednesday. Stay-at-home lockdown ensues.

dsrtslnd23almost 2 years ago

I tried the english TTS example and the result is quite underwhelming (compared to bark or polly/azure-tts ). It sounds like TTS systems one or two decades ago. Would those language-specific TTS models need to be finetuned?

EvgeniyZhalmost 2 years ago

Most of languages support only LID (language identification) task. Still impressive

ripvanwinklealmost 2 years ago

Anyone know what hardware it takes to run this? Asking as an enthusiastic newbie

评论 #36036087 未加载

vlugorillaalmost 2 years ago

Would it be possible to have this in somethin "a la" whisper.cpp?

评论 #36037613 未加载

Simon_O_Rourkealmost 2 years ago

Maybe one of those models has figured out a way to tell Zuck that the whole Metaverse concept is nonsense, hopefully it'll be graceful about letting him down.

lairvalmost 2 years ago

I don't have much knowledge about TTS models, is it possible/affordable to fine-tune those models on your own voice ?

gagabityalmost 2 years ago

I just want to translate a movie language audio from one to another, whats the easiest way to do this at home?

2Gkashmirialmost 2 years ago

i checked the language coverage for "kashmiri"<a href="https://www.ethnologue.com/language/kas/" rel="nofollow">https://www.ethnologue.com/language/kas/</a>"Himachal Pradesh state:"this is obviously wrong so i dont know what else is wrong

lekealmost 2 years ago

I was kind of hoping for Interlingue, but was surprised to not even see Esperanto on the list.

评论 #36035808 未加载

评论 #36036245 未加载

alienlidalmost 2 years ago

if I have an extra MBP 16" 2020 hanging around 16GB Ram, Quadcore i-7.... can I run this? I'd like to try TTS capabilities! LMK if you've got any guides or instructions online I can checkout:)

oarsalmost 2 years ago

Some interesting links and tools in this thread, e.g. datasette.io

sigstoatalmost 2 years ago

can any of these models be coerced into just doing straight phonetic transcription? like spitting out IPA?

richard___almost 2 years ago

meta is doing more for open ai than openai

评论 #36035901 未加载

评论 #36035751 未加载

评论 #36035685 未加载

评论 #36035729 未加载

评论 #36038109 未加载

评论 #36035726 未加载

throwme_123almost 2 years ago

fairseq is a fairly missed naming opportunity. C-3PO would have been better.

评论 #36039753 未加载

stanislavbalmost 2 years ago

I still hate the name “meta”. You?

kaycey2022almost 2 years ago

Why is Meta open sourcing their AI work like this? Is it because they don't have a great reputation even among tech companies?

评论 #36038444 未加载

评论 #36038390 未加载

egberts1almost 2 years ago

Does it do American Sign Language, US fifth largest language?I didn’t think so.

评论 #36035753 未加载

评论 #36035977 未加载

评论 #36038914 未加载