Show HN: Language detection as a service

38 pointsby mgaudinover 11 years ago

24 comments

davidjgraphover 11 years ago

I'll ask plainly what others are hinting at : Is this actually your own built service, or are you a proxy for something like Google Translate API[1]?If it's your own built service, it's critical how you explain the hows and whys of your forecast availability and scalability numbers for your chosen architecture, given who you are competing with.[1]<a href="https://developers.google.com/translate/v2/using_rest#detect-language" rel="nofollow">https://developers.google.com/translate/v2/using_rest#detect...</a>

beeringover 11 years ago

Alternatively, people can just download langid.py[1] and do language detection locally. This is not a particularly hard problem - I think it's doable by undergrad ML or NLP classes.The tricky parts are usually political - are users going to be angry if you confuse Indonesian with Malaysian, or so on?[1] <a href="https://github.com/saffsd/langid.py" rel="nofollow">https://github.com/saffsd/langid.py</a>

评论 #6761166 未加载

chrismorganover 11 years ago

The design is fine, but the language used on the page itself isn't quite right.I see three spelling errors in your language list:- Panjabi should be Punjabi;- Teligu should be Telugu;- Ukraininan should be Ukrainian.There are also a few grammar problems earlier in the document, and style problems (e.g. English doesn't use a space before sentence-ending punctuation marks).

mdemareover 11 years ago

Hmm, it takes 5+ seconds to get a response, and it chokes on the same test phrase as Google, thinking "Ik hou van vette lettertypes." is Norwegian...

评论 #6761337 未加载

diasks2over 11 years ago

Looks interesting. Why not have a input on the landing page where someone can try it out without even signing up? I think then people could give it a spin before they give away their email address. Otherwise, the user just has to trust your 99% figure, which it might be helpful to give some data around, even if it is a footnote (on a corpus of x, over x period of time, etc.)Also, I think it would be clearer if it said "A simple and scalable way to automatically classify text by language" instead of "A simple and scalable way to classify automatically text by language".Design looks very clean though. Nice work.EDIT: Also, your social media links at the bottom aren't hooked up yet.

评论 #6761042 未加载

captn3m0over 11 years ago

For those who thought (like me) that this was a programming language detection service, you can take a look at github/linguist.

danieldkover 11 years ago

Also, for those who would like to know how you can implement a language guesser (sources + link to paper):<a href="http://www.let.rug.nl/vannoord/TextCat/" rel="nofollow">http://www.let.rug.nl/vannoord/TextCat/</a>Python version:<a href="http://thomas.mangin.com/data/source/ngram.py" rel="nofollow">http://thomas.mangin.com/data/source/ngram.py</a>It's something that is fun to implement and doesn't take more than a few hours at most.

mdemareover 11 years ago

Why is this better than the Google or Bing translate APIs, which also offer language detection?

redox_over 11 years ago

You should also consider full-non-ambiguous words before trying with trigrams. "marché" is only available in French, whereas "mar", "arc", ... are available in lots of languages. This should drastically improve your results.

评论 #6760988 未加载

web64over 11 years ago

I've used detectlanguage.com[1] in the past, which seems like a very similar service to getlang.io. With both of them it is hard to know what is behind the scenes...[1] <a href="http://detectlanguage.com/" rel="nofollow">http://detectlanguage.com/</a>

alexottover 11 years ago

And it looks like that they are using the following library: <a href="http://code.google.com/p/language-detection/" rel="nofollow">http://code.google.com/p/language-detection/</a> - at least the number & list of languages is very similar :-)

评论 #6761363 未加载

jhullover 11 years ago

I wonder how this performs on short text posts like tweets. At my last gig where we did social media text analysis we used a few different packages (chromium, guess-language, and our own ngram classifier) and still had pretty low accuracy for tweets.

评论 #6761608 未加载

himalover 11 years ago

You guys might want to handle GET requests for /try URL(<a href="https://getlang.io/try" rel="nofollow">https://getlang.io/try</a>) as well.currently it's returning "Server Error (500)" for GET requests.

martingordonover 11 years ago

Matthew Kirk spoke about a neural network language predictor at RubyConf a few weeks ago. Here are his slides and code: <a href="http://modulus7.com/rubyconf/" rel="nofollow">http://modulus7.com/rubyconf/</a>

efeamadasunover 11 years ago

I don't know why I can't stand this sentence "A simple and scalable way to classify automatically text by language". "Classify" and "automatically" need to switch places.

alexottover 11 years ago

Apache Tika (<a href="http://tika.apache.org/" rel="nofollow">http://tika.apache.org/</a>) also has language detector, although it maybe not so good as CLD...

razvvanover 11 years ago

If I were to implement this I'd rather use google's prediction api. At least with that you get a bit of control over what goes into the training data.

bkamapantulaover 11 years ago

It's Telugu not Teligu. By Panjabi, do you mean Punjabi?As others already mentioned, it would be good to have users try examples before signup.

phpnodeover 11 years ago

how does this compare in accuracy to chromium's Compact Language Detector?<a href="https://code.google.com/p/chromium-compact-language-detector/" rel="nofollow">https://code.google.com/p/chromium-compact-language-detector...</a><a href="https://github.com/mzsanford/cld" rel="nofollow">https://github.com/mzsanford/cld</a>

评论 #6761063 未加载

donutdan4114over 11 years ago

"test it out" comes back as french...

评论 #6760996 未加载

RBerenguelover 11 years ago

Some day I have to rewrite whatlanguageis.com (currently not working) with all the great ideas I had to improve it...

m4tthumphreyover 11 years ago

curl -XPOST -d 'hello' '<a href="https://getlang.io/get?token=...'" rel="nofollow">https://getlang.io/get?token=...'</a> { "code": "fi", "name": "suomi, suomen kieli", "name_en": "Finnish" }O_O

ssiddharthover 11 years ago

It might be mild OCD but it'd be great if the list of supported languages is ordered in some logical way.

ismaelcover 11 years ago

Where's the login page? I need to get my token

24 comments

davidjgraphover 11 years ago

beeringover 11 years ago

评论 #6761166 未加载

chrismorganover 11 years ago

mdemareover 11 years ago

Hmm, it takes 5+ seconds to get a response, and it chokes on the same test phrase as Google, thinking "Ik hou van vette lettertypes." is Norwegian...

评论 #6761337 未加载

diasks2over 11 years ago

评论 #6761042 未加载

captn3m0over 11 years ago

For those who thought (like me) that this was a programming language detection service, you can take a look at github/linguist.

danieldkover 11 years ago

mdemareover 11 years ago

Why is this better than the Google or Bing translate APIs, which also offer language detection?

redox_over 11 years ago

评论 #6760988 未加载

web64over 11 years ago

alexottover 11 years ago

评论 #6761363 未加载

jhullover 11 years ago

评论 #6761608 未加载

himalover 11 years ago

martingordonover 11 years ago

efeamadasunover 11 years ago

I don't know why I can't stand this sentence "A simple and scalable way to classify automatically text by language". "Classify" and "automatically" need to switch places.

alexottover 11 years ago

Apache Tika (<a href="http://tika.apache.org/" rel="nofollow">http://tika.apache.org/</a>) also has language detector, although it maybe not so good as CLD...

razvvanover 11 years ago

If I were to implement this I'd rather use google's prediction api. At least with that you get a bit of control over what goes into the training data.

bkamapantulaover 11 years ago

It's Telugu not Teligu. By Panjabi, do you mean Punjabi?As others already mentioned, it would be good to have users try examples before signup.

phpnodeover 11 years ago

评论 #6761063 未加载

donutdan4114over 11 years ago

"test it out" comes back as french...

评论 #6760996 未加载

RBerenguelover 11 years ago

Some day I have to rewrite whatlanguageis.com (currently not working) with all the great ideas I had to improve it...

m4tthumphreyover 11 years ago

curl -XPOST -d 'hello' '<a href="https://getlang.io/get?token=...'" rel="nofollow">https://getlang.io/get?token=...'</a> { "code": "fi", "name": "suomi, suomen kieli", "name_en": "Finnish" }O_O

ssiddharthover 11 years ago

It might be mild OCD but it'd be great if the list of supported languages is ordered in some logical way.

ismaelcover 11 years ago

Where's the login page? I need to get my token