I would like to use stuff like this as a side-project. Buy a Nvidia Geforce GPU and stick it into my 24/7 server and play around with it in my free time, to see what can be done.<p>The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.<p>Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.
ASR: "Automatic Speech Recognition"; also known as "Speech to Text" (STT)<p>TTS: "Text to Speech"<p>LID: "Language Identification"<p>In case anyone else was confused about what the acronyms mean.
Imagine if we used these types of models for like 500 years and it locked their vocabulary in time, disallowing any further language blending; then somehow the servers turned off and nobody could communicate across language barriers anymore.<p>Someone should write that down in some sort of short-story involving a really tall structure.
I just wanted to test out the TTS locally on a powerful Ubuntu 22.04 machine, but the process for setting it up seems pretty broken and poorly documented. After 20 minutes of trying I finally gave up since I couldn't get the VITS dependency to build (despite having a fully updated machine with all required compilers). It seems like they never really bother to see if the stuff works on a fresh machine starting from scratch. Somehow for my own projects I'm always able to start from a fresh git clone and then directly install everything using this block of code:<p>```
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install wheel
pip install -r requirements.txt
```<p>But whenever I try using these complicated ML models, it's usually an exercise in futility and endless mucking around with conda and other nonsense. It ends up not being worth it and I just move on. But it does feel like it doesn't need to be like this.
I was super excited at this, but digging through the release [0] one can see the following [1]. While using Bible translations is indeed better than nothing, I don't think the stylistic choices in the Bible are representative of how people actually speak the language, in any of the languages I can speak (i.e. that I am able to evaluate personally).<p>Religious recordings tend to be liturgical, so even the pronunciation might be different to the everyday language. They do address something related, although more from a vocabulary perspective to my understanding [2].<p>So one of their stated goals, to enable people to talk to AI in their preferred language [3], might be closer but certainly a stretch to achieve with their chosen dataset.<p>[0]: <a href="https://about.fb.com/news/2023/05/ai-massively-multilingual-speech-technology/amp/" rel="nofollow">https://about.fb.com/news/2023/05/ai-massively-multilingual-...</a><p>[1]: > These translations have publicly available audio recordings of people reading these texts in different languages. As part of the MMS project, we created a dataset of readings of the New Testament in more than 1,100 languages, which provided on average 32 hours of data per language. By considering unlabeled recordings of various other Christian religious readings, we increased the number of languages available to more than 4,000. While this data is from a specific domain and is often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.<p>[2]: > And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.<p>[3]: > This kind of technology could be used for VR and AR applications in a person’s preferred language and that can understand everyone’s voice.
> The MMS code and model weights are released under the CC-BY-NC 4.0 license.<p>Huge bummer. Prevents almost everyone from using this and recouping their costs.<p>I suppose motivated teams could reproduce the paper in a clean room, but that might also be subject to patents.
So many so-called overnight AI gurus hyping about their snake-oil product and screaming about 'Meta is dying' [0] and 'It is over for Meta' but little of them actually do research in AI and drive the field forward and this once again shows that Meta has always been a consistent contributor to AI research, especially in vision systems.<p>All we can just do is take, take, take the code. But this time, the code's license is CC-BY-NC 4.0. Which simply means:<p>Take it, but no grifting allowed.<p>[0] <a href="https://news.ycombinator.com/item?id=31832221" rel="nofollow">https://news.ycombinator.com/item?id=31832221</a>
According to [1] on the accompanying blog post, this brings the Whisper 44.3 WER down to 18.7, although it’s unclear to me how much better this is at primarily English speech recognition. I’d love to see a full comparison of accuracy improvements as well as a proper writeup of how much more power it takes to run this in production or on mobile vs something like whisper.<p>[1]: <a href="https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/346801894_261088246476193_7395499802717483754_n.png?_nc_cat=103&ccb=1-7&_nc_sid=6825c5&_nc_ohc=vHCf-i1COVcAX9ryJQX&_nc_ht=scontent-sjc3-1.xx&oh=00_AfAXj-l6r2rNadAc_0aMqQTpcUS_FrXzoO9Otxx_XglqXg&oe=6471A113" rel="nofollow">https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/346801894_...</a>
I assume this "competes" directly with <a href="https://sites.research.google/usm/" rel="nofollow">https://sites.research.google/usm/</a> -- would be cool to see side-by-side benchmarks sometime! Maybe I should make those. I requested access to USM but have not been granted any access yet.
I'm trying to use this on a 3M mp3 file to test ASR with language code deu, CPU only, and I keep getting this error -- are there limits to the MMS inference?<p><pre><code> File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec
assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, (
AssertionError: Sentences lengths should not exceed max_tokens=4000000
Traceback (most recent call last):
File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 52, in <module>
process(args)
File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 44, in process</code></pre>
Wonder how this compares with deepgram offering. has anyone used/tried/compared or even read enough literature to compare. The WER rates showed in deepgram are still better than the largest MMS and the specific use case based fine-tuned models (zoom meetings, financial calls etc) probably make a bigger difference. WDYT ?
I tried the english TTS example and the result is quite underwhelming (compared to bark or polly/azure-tts ). It sounds like TTS systems one or two decades ago. Would those language-specific TTS models need to be finetuned?
Maybe one of those models has figured out a way to tell Zuck that the whole Metaverse concept is nonsense, hopefully it'll be graceful about letting him down.
i checked the language coverage for "kashmiri"<p><a href="https://www.ethnologue.com/language/kas/" rel="nofollow">https://www.ethnologue.com/language/kas/</a><p>"Himachal Pradesh state:"<p>this is obviously wrong so i dont know what else is wrong
if I have an extra MBP 16" 2020 hanging around 16GB Ram, Quadcore i-7.... can I run this? I'd like to try TTS capabilities! LMK if you've got any guides or instructions online I can checkout:)