TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Language detection as a service

38 pointsby mgaudinover 11 years ago

24 comments

davidjgraphover 11 years ago
I&#x27;ll ask plainly what others are hinting at : Is this actually your own built service, or are you a proxy for something like Google Translate API[1]?<p>If it&#x27;s your own built service, it&#x27;s critical how you explain the hows and whys of your forecast availability and scalability numbers for your chosen architecture, given who you are competing with.<p>[1]<a href="https://developers.google.com/translate/v2/using_rest#detect-language" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;translate&#x2F;v2&#x2F;using_rest#detect...</a>
beeringover 11 years ago
Alternatively, people can just download langid.py[1] and do language detection locally. This is not a particularly hard problem - I think it&#x27;s doable by undergrad ML or NLP classes.<p>The tricky parts are usually political - are users going to be angry if you confuse Indonesian with Malaysian, or so on?<p>[1] <a href="https://github.com/saffsd/langid.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;saffsd&#x2F;langid.py</a>
评论 #6761166 未加载
chrismorganover 11 years ago
The design is fine, but the language used on the page itself isn&#x27;t quite right.<p>I see three spelling errors in your language list:<p>- Panjabi should be Punjabi;<p>- Teligu should be Telugu;<p>- Ukraininan should be Ukrainian.<p>There are also a few grammar problems earlier in the document, and style problems (e.g. English doesn&#x27;t use a space before sentence-ending punctuation marks).
mdemareover 11 years ago
Hmm, it takes 5+ seconds to get a response, and it chokes on the same test phrase as Google, thinking &quot;Ik hou van vette lettertypes.&quot; is Norwegian...
评论 #6761337 未加载
diasks2over 11 years ago
Looks interesting. Why not have a input on the landing page where someone can try it out without even signing up? I think then people could give it a spin before they give away their email address. Otherwise, the user just has to trust your 99% figure, which it might be helpful to give some data around, even if it is a footnote (on a corpus of x, over x period of time, etc.)<p>Also, I think it would be clearer if it said &quot;A simple and scalable way to automatically classify text by language&quot; instead of &quot;A simple and scalable way to classify automatically text by language&quot;.<p>Design looks very clean though. Nice work.<p>EDIT: Also, your social media links at the bottom aren&#x27;t hooked up yet.
评论 #6761042 未加载
captn3m0over 11 years ago
For those who thought (like me) that this was a programming language detection service, you can take a look at github&#x2F;linguist.
danieldkover 11 years ago
Also, for those who would like to know how you can implement a language guesser (sources + link to paper):<p><a href="http://www.let.rug.nl/vannoord/TextCat/" rel="nofollow">http:&#x2F;&#x2F;www.let.rug.nl&#x2F;vannoord&#x2F;TextCat&#x2F;</a><p>Python version:<p><a href="http://thomas.mangin.com/data/source/ngram.py" rel="nofollow">http:&#x2F;&#x2F;thomas.mangin.com&#x2F;data&#x2F;source&#x2F;ngram.py</a><p>It&#x27;s something that is fun to implement and doesn&#x27;t take more than a few hours at most.
mdemareover 11 years ago
Why is this better than the Google or Bing translate APIs, which also offer language detection?
redox_over 11 years ago
You should also consider full-non-ambiguous words before trying with trigrams. &quot;marché&quot; is only available in French, whereas &quot;mar&quot;, &quot;arc&quot;, ... are available in lots of languages. This should drastically improve your results.
评论 #6760988 未加载
web64over 11 years ago
I&#x27;ve used detectlanguage.com[1] in the past, which seems like a very similar service to getlang.io. With both of them it is hard to know what is behind the scenes...<p>[1] <a href="http://detectlanguage.com/" rel="nofollow">http:&#x2F;&#x2F;detectlanguage.com&#x2F;</a>
alexottover 11 years ago
And it looks like that they are using the following library: <a href="http://code.google.com/p/language-detection/" rel="nofollow">http:&#x2F;&#x2F;code.google.com&#x2F;p&#x2F;language-detection&#x2F;</a> - at least the number &amp; list of languages is very similar :-)
评论 #6761363 未加载
jhullover 11 years ago
I wonder how this performs on short text posts like tweets. At my last gig where we did social media text analysis we used a few different packages (chromium, guess-language, and our own ngram classifier) and still had pretty low accuracy for tweets.
评论 #6761608 未加载
himalover 11 years ago
You guys might want to handle GET requests for &#x2F;try URL(<a href="https://getlang.io/try" rel="nofollow">https:&#x2F;&#x2F;getlang.io&#x2F;try</a>) as well.currently it&#x27;s returning &quot;Server Error (500)&quot; for GET requests.
martingordonover 11 years ago
Matthew Kirk spoke about a neural network language predictor at RubyConf a few weeks ago. Here are his slides and code: <a href="http://modulus7.com/rubyconf/" rel="nofollow">http:&#x2F;&#x2F;modulus7.com&#x2F;rubyconf&#x2F;</a>
efeamadasunover 11 years ago
I don&#x27;t know why I can&#x27;t stand this sentence &quot;A simple and scalable way to classify automatically text by language&quot;. &quot;Classify&quot; and &quot;automatically&quot; need to switch places.
alexottover 11 years ago
Apache Tika (<a href="http://tika.apache.org/" rel="nofollow">http:&#x2F;&#x2F;tika.apache.org&#x2F;</a>) also has language detector, although it maybe not so good as CLD...
razvvanover 11 years ago
If I were to implement this I&#x27;d rather use google&#x27;s prediction api. At least with that you get a bit of control over what goes into the training data.
bkamapantulaover 11 years ago
It&#x27;s Telugu not Teligu. By Panjabi, do you mean Punjabi?<p>As others already mentioned, it would be good to have users try examples before signup.
phpnodeover 11 years ago
how does this compare in accuracy to chromium&#x27;s Compact Language Detector?<p><a href="https://code.google.com/p/chromium-compact-language-detector/" rel="nofollow">https:&#x2F;&#x2F;code.google.com&#x2F;p&#x2F;chromium-compact-language-detector...</a><p><a href="https://github.com/mzsanford/cld" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mzsanford&#x2F;cld</a>
评论 #6761063 未加载
donutdan4114over 11 years ago
&quot;test it out&quot; comes back as french...
评论 #6760996 未加载
RBerenguelover 11 years ago
Some day I have to rewrite whatlanguageis.com (currently not working) with all the great ideas I had to improve it...
m4tthumphreyover 11 years ago
curl -XPOST -d &#x27;hello&#x27; &#x27;<a href="https://getlang.io/get?token=...&#x27;" rel="nofollow">https:&#x2F;&#x2F;getlang.io&#x2F;get?token=...&#x27;</a> { &quot;code&quot;: &quot;fi&quot;, &quot;name&quot;: &quot;suomi, suomen kieli&quot;, &quot;name_en&quot;: &quot;Finnish&quot; }<p>O_O
ssiddharthover 11 years ago
It might be mild OCD but it&#x27;d be great if the list of supported languages is ordered in some logical way.
ismaelcover 11 years ago
Where&#x27;s the login page? I need to get my token