TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: I want to train a LM on my home country's dialect, how can I do it?

24 pointsby the_generalistover 2 years ago
I&#x27;m from Algeria. The language spoken on a daily basis by almost everybody is a weird mix of different languages : french, arabic, english..etc.<p>I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it&#x27;s not gonna be much better than that. Just short-form text for the most part.<p>I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar.<p>I&#x27;m tech-savvy enough to make this work but I&#x27;d like some feedback from people more knowledgeable than me before I spend time and effort into this.<p>Thanks!

5 comments

ktrnkaover 2 years ago
I&#x27;d suggest starting with just building a high quality data set with text from a variety of domains, and starting off by publishing that. Maybe even developing some related tech like adding the dialect to language id packages. Another key thing might be to build a nicely curated word list for the dialect, and make sure there&#x27;s good documentation for researchers wanting to work in the language.<p>Partly I&#x27;m feeling inspired by Google&#x27;s machine translation paper about scaling to the next hundred or thousand languages. Some links in here <a href="https:&#x2F;&#x2F;ai.googleblog.com&#x2F;2023&#x2F;01&#x2F;google-research-2022-beyond-language.html?m=1" rel="nofollow">https:&#x2F;&#x2F;ai.googleblog.com&#x2F;2023&#x2F;01&#x2F;google-research-2022-beyon...</a><p>But also when it&#x27;s been successful, it&#x27;s an effort of many different researchers. And it usually starts with data.<p>Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.
评论 #34626573 未加载
LunarAuroraover 2 years ago
Sadly, the very best datasets that seem publicly available are for Gulf Arabic dialect (where the money is) [1]<p>I suggest you contact <a href="https:&#x2F;&#x2F;www.icompass.tn&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.icompass.tn&#x2F;</a>, a (Tunisian) <i>startup specialized in Natural Language Processing...that process Arabic dialects and African languages</i><p>On a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI&#x2F;NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly &quot;oral&quot; dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.<p>[1] <a href="https:&#x2F;&#x2F;nyuad.nyu.edu&#x2F;en&#x2F;research&#x2F;faculty-labs-and-projects&#x2F;computational-approaches-to-modeling-language-lab&#x2F;research&#x2F;arabic-natural-language-processing.html" rel="nofollow">https:&#x2F;&#x2F;nyuad.nyu.edu&#x2F;en&#x2F;research&#x2F;faculty-labs-and-projects&#x2F;...</a><p>[2] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=34492572" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=34492572</a>
评论 #34625348 未加载
yorwbaover 2 years ago
If all you want is a LM and it doesn&#x27;t need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn&#x27;t funny.<p>If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter <a href="https:&#x2F;&#x2F;commoncrawl.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;commoncrawl.org&#x2F;</a>
评论 #34625829 未加载
评论 #34627578 未加载
enoreyesover 2 years ago
<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;alger-ia&#x2F;dziribert" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;alger-ia&#x2F;dziribert</a><p>There is this model which also has a paper describing their methods for a BERT-family model designed for the Algerian dialect.
tooltitudeover 2 years ago
You could do data augmentation. You could automatically translate (there&#x27;re open source models to do so) to your language from close enough languages, and train your model on this data.