TechEcho

5 comments

ktrnkaover 2 years ago

I'd suggest starting with just building a high quality data set with text from a variety of domains, and starting off by publishing that. Maybe even developing some related tech like adding the dialect to language id packages. Another key thing might be to build a nicely curated word list for the dialect, and make sure there's good documentation for researchers wanting to work in the language.Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here <a href="https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html?m=1" rel="nofollow">https://ai.googleblog.com/2023/01/google-research-2022-beyon...</a>But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.

评论 #34626573 未加载

LunarAuroraover 2 years ago

Sadly, the very best datasets that seem publicly available are for Gulf Arabic dialect (where the money is) [1]I suggest you contact <a href="https://www.icompass.tn/" rel="nofollow">https://www.icompass.tn/</a>, a (Tunisian) startup specialized in Natural Language Processing...that process Arabic dialects and African languagesOn a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI/NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly "oral" dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.[1] <a href="https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/computational-approaches-to-modeling-language-lab/research/arabic-natural-language-processing.html" rel="nofollow">https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/...</a>[2] <a href="https://news.ycombinator.com/item?id=34492572" rel="nofollow">https://news.ycombinator.com/item?id=34492572</a>

评论 #34625348 未加载

yorwbaover 2 years ago

If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny.If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter <a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a>

评论 #34625829 未加载

评论 #34627578 未加载

enoreyesover 2 years ago

<a href="https://huggingface.co/alger-ia/dziribert" rel="nofollow">https://huggingface.co/alger-ia/dziribert</a>There is this model which also has a paper describing their methods for a BERT-family model designed for the Algerian dialect.

tooltitudeover 2 years ago

You could do data augmentation. You could automatically translate (there're open source models to do so) to your language from close enough languages, and train your model on this data.

5 comments

ktrnkaover 2 years ago

评论 #34626573 未加载

LunarAuroraover 2 years ago

评论 #34625348 未加载

yorwbaover 2 years ago

评论 #34625829 未加载

评论 #34627578 未加载

enoreyesover 2 years ago

tooltitudeover 2 years ago

You could do data augmentation. You could automatically translate (there're open source models to do so) to your language from close enough languages, and train your model on this data.

Ask HN: I want to train a LM on my home country's dialect, how can I do it?

5 comments

Ask HN: I want to train a LM on my home country's dialect, how can I do it?

5 comments