TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Polyglot Word Embeddings Discover Language Clusters

27 pointsby shriphaniover 5 years ago

2 comments

pattuskover 5 years ago
I read the title and got excited thinking this would be using embeddings to gather insights about language family. As in, if you ran k-means on the same corpus of n languages with k &lt; n, how would, say, Finnish, Mongolian, Turkish and Japanese turn out in the clusters. Curious too as to whether it would be possible to interpret the results rigorously to gather scientifically valid linguistic conclusions.<p>Instead it looks like this just performs language detection. Is there a significant advantage to that method as opposed to just reusing one of the many existing open sources solutions based on simpler models such as [1] and retraining them with a corpus that includes the language(s) that weren&#x27;t supported? You offer a comparative table for FastText &amp; GCP, how do you explain FastText&#x27;s abysmal performance on English in terms of precision? The value just seems way too low to not be a bug of some sort?<p>[1] <a href="https:&#x2F;&#x2F;code.google.com&#x2F;archive&#x2F;p&#x2F;language-detection&#x2F;" rel="nofollow">https:&#x2F;&#x2F;code.google.com&#x2F;archive&#x2F;p&#x2F;language-detection&#x2F;</a>
评论 #22232243 未加载
评论 #22232779 未加载
nlover 5 years ago
This is nice, but the blog post should point out that FastText has language identification built in[1].<p>The authors knew this, because it compares it in the paper, but doesn&#x27;t call it out in the post!<p>Edit: just realised the link on <i>popular &quot;open source&quot;</i> goes to the FastText post I linked below. Still - I think it would have been good to explicitly note this!<p>[1] <a href="https:&#x2F;&#x2F;fasttext.cc&#x2F;blog&#x2F;2017&#x2F;10&#x2F;02&#x2F;blog-post.html" rel="nofollow">https:&#x2F;&#x2F;fasttext.cc&#x2F;blog&#x2F;2017&#x2F;10&#x2F;02&#x2F;blog-post.html</a>
评论 #22232051 未加载