TechEcho

10 comments

tgvover 3 years ago

I see Dutch performs badly. I wouldn't be surprised if that's because of bad/noisy training data. Dutch web content contains an awful amount of English, which pollutes recognition. Cross-check the Dutch tokens with an English dictionary to be sure (although there is quite some overlap for frequent words, e.g. "is", "we", "are", "have", "bent", "had", "brief", etc., and rare ones like "keeshond").BTW, the test statistic for recognizing individual words isn't interesting, unless you sample/weigh by word frequency.

评论 #30311529 未加载

rippeltippelover 3 years ago

> This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore.Can this be a problem? If a text in Language_A includes names/words of Language_B, only relying on special characters would wrongly classify the entire text as Language_B.

评论 #30312008 未加载

kccqzyover 3 years ago

What's a good way to detect languages in mixed-language passages? What's the state of art here?For example, given "'I think, therefore I am' is the first principle of René Descartes's philosophy that was originally published in French as je pense, donc je suis.", is there a library that would tell me the main passage is in English, but contains fragments in French?

评论 #30311777 未加载

评论 #30311476 未加载

koengover 3 years ago

I appreciate how direct and clear what this library does and who it is for. I have no need for it now, but after 1 paragraph of reading I’ll be remembering its name for later.

评论 #30309928 未加载

wodenokotoover 3 years ago

> Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough.I only dabbled in language detection at a workshop at a conference years ago, but I was very impressed how well such models work on short text with only bigrams.Maybe once you expand to over 70 languages does bi- and tri-grams fall short, but I just wanted to say that this is a usecase here very simple models can get you really far.If you see a blog post where a language detection problem is solved with deep learning chances are the author doesn’t know what they are doing (towards datascience, I’m looking at you!) or it’s a tutorial for working with an NN framework.

评论 #30313513 未加载

评论 #30310831 未加载

doubtfuluserover 3 years ago

How does it actually compare to fasttext [1] in performance. Building an interface to that in GO shouldn’t be too complicated. The claim that all language identification (lid) relies on ngrams is bold and there has been a switch to pure neural network based approaches.[1] <a href="https://fasttext.cc/docs/en/language-identification.html" rel="nofollow">https://fasttext.cc/docs/en/language-identification.html</a>

评论 #30310848 未加载

评论 #30316465 未加载

Xeoncrossover 3 years ago

Great work, thanks for sharing!I see <a href="https://github.com/google/cld3" rel="nofollow">https://github.com/google/cld3</a>, but how does this compare with <a href="https://github.com/CLD2Owners/cld2" rel="nofollow">https://github.com/CLD2Owners/cld2</a> which is used by the large <a href="https://commoncrawl.org" rel="nofollow">https://commoncrawl.org</a> project to classify billions of samples from the whole internet?

评论 #30315671 未加载

pemistahlover 3 years ago

Hello everyone,I'm the author of Lingua. Thank you for sharing my work and making it known in the NLP world.Apart from the Go implementation, I've implemented the library in Kotlin, Python and Rust. Just take a look at my profile if you are interested: <a href="https://github.com/pemistahl" rel="nofollow">https://github.com/pemistahl</a>

nshmover 3 years ago

In general, language detection is surprisingly hard. There is LSTM-based implementation <a href="https://github.com/AU-DIS/LSTM_langid" rel="nofollow">https://github.com/AU-DIS/LSTM_langid</a> which should be better than ngrams.

debdutover 3 years ago

I was searching for *(this but for programming languages and frameworks)

10 comments

tgvover 3 years ago

评论 #30311529 未加载

rippeltippelover 3 years ago

评论 #30312008 未加载

kccqzyover 3 years ago

评论 #30311777 未加载

评论 #30311476 未加载

koengover 3 years ago

I appreciate how direct and clear what this library does and who it is for. I have no need for it now, but after 1 paragraph of reading I’ll be remembering its name for later.

Lingua-Go, the most accurate language detection for Go

10 comments

Lingua-Go, the most accurate language detection for Go

10 comments