TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How to implement an NLP grammar parser for a new natural language?

48 pointsby alnitakover 8 years ago
As a novice with NLP, having tinkered before with some basic and naive models to do NLP, I would like to learn it properly this time by creating a grammar parser for a language, for which no current model is available publicly. I can easily access a corpus of sentences for this language, and speaking it myself, I am motivated enough to produce any data needed for this.<p>Where would you recommend to start for such a project, both in terms of minimal theoretical and practical knowledge, but also the engineering aspect of it? What open source libraries and software are available out there to speed up this process?

5 comments

amiroucheover 8 years ago
Nobody mentionned SyntaxNet or LinkGrammar. If you did not read the article from Chomsky about the two ways of doing AI you should read it. Basically it says there is statistical methods and logic methods in AI. Most of NLP libraries of today use the statistic approach. The other, the logic rules based approach was the most popular before now. Anyway, that&#x27;s what does Link Grammar. I recommend you start with the introduction <a href="https:&#x2F;&#x2F;amirouche.github.io&#x2F;link-grammar-website&#x2F;&#x2F;documentation&#x2F;dictionary&#x2F;introduction.html" rel="nofollow">https:&#x2F;&#x2F;amirouche.github.io&#x2F;link-grammar-website&#x2F;&#x2F;documentat...</a> to get a feeling of what it means to deliver meanings to sentences.<p>Also nowdays, word2vec is unrelated to the understanding of grammatical constructs in natural languages. It&#x27;s simply said a coincidence or co-occurance of words. Grammartical interpretation of a sentence must be seens as a general graph whereas word2vec operate on the linear structure of sentences (one word after the other). If word2vec had to work on grammatical constructs it should be able to ingest graph data. word2vec works on matrices where the graphical representation of the grammar of sentence (POS tagging, dependencies, anaphora, probably others) is graph otherwise said a sparse matrix or a matrix with a big number of dimensions. (It seems to me machine learning is always about dimension reduction with some noise).<p>I am quite ignorant about the literature on the subject of machine learning operated to&#x2F;from graphical datastructures.
评论 #13691269 未加载
Bitcoincadreover 8 years ago
First, North African languages are called Arabic. The proper written form of Arabic is the same in every country. The Berber language never had a written language or letters and only confuses the matter. It is a tool used to divide the people. Can you imagine Palestinians demanding Caananite be included as an official language? The most common modern standard Arabic would be found in Syria, Lebanon,Jordan and Palestine, with the Egyptian and Iraqi dialects also well understood. The North African dialects need a major overhaul. In Morrocco, they have borrowed even German words and the pace is so fast half the words are mumbled. Use modern standard Arabic as your focus,and perhaps Latin letters to make it easier on non natives while being able to translate it back to Arabic letters.
web64over 8 years ago
I haven&#x27;t tried it yet, but Spacy has a guide[1] for adding a new languages to their python NLP framework. Maybe it can be of use to you.<p>[1] <a href="https:&#x2F;&#x2F;spacy.io&#x2F;docs&#x2F;usage&#x2F;adding-languages" rel="nofollow">https:&#x2F;&#x2F;spacy.io&#x2F;docs&#x2F;usage&#x2F;adding-languages</a>
评论 #13690033 未加载
probably_wrongover 8 years ago
If you want to go directly into coding, the Stanford NLP Parser lists in point 5 of their FAQ[1] some starting instructions for parsing a new language.<p>If you can deal with the math, some papers such as [2] use corpora for existing languages as a tool to parse new languages, for which there are not too many resources available.<p>In both cases, you can always contact the authors. They might know how to help with your project, and&#x2F;or direct you to the right people.<p>[1] <a href="http:&#x2F;&#x2F;nlp.stanford.edu&#x2F;software&#x2F;parser-faq.shtml#d" rel="nofollow">http:&#x2F;&#x2F;nlp.stanford.edu&#x2F;software&#x2F;parser-faq.shtml#d</a><p>[2] <a href="https:&#x2F;&#x2F;www.aclweb.org&#x2F;anthology&#x2F;Q&#x2F;Q16&#x2F;Q16-1022.pdf" rel="nofollow">https:&#x2F;&#x2F;www.aclweb.org&#x2F;anthology&#x2F;Q&#x2F;Q16&#x2F;Q16-1022.pdf</a>
franciscopover 8 years ago
Stanford&#x27;s NLP course is a good place to start learning about the theoretical knowledge: <a href="https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=nfoudtpBV68" rel="nofollow">https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=nfoudtpBV68</a><p>Then it highly depends on the language; for instance tokenization (split sentence into words) is really easy in English, Spanish, etc compared to Japanese, Chinese, etc. So I would say a good starting point would be to try using a NLP parser for a <i>similar</i> language. What language is it? What kind of NLP analysis do you want to perform?
评论 #13687666 未加载