NLTK has an active and growing developer community. We're grateful to Matthew Honnibal for permission to port his averaged perceptron tagger, and it's now included in NLTK 3.1.<p>Note that NLTK includes reference implementations for a range of NLP algorithms, supporting reproducibility and helping a diverse community to get into NLP. We provide interfaces for standard NLP tasks, and an easy way to switch from using pure Python implementations to using wrappers for external implementations such as the Stanford CoreNLP tools. We're adding "scaling up" sections to the NLTK book to show how this is done.<p><a href="https://github.com/nltk/nltk" rel="nofollow">https://github.com/nltk/nltk</a> | <a href="https://pypi.python.org/pypi/nltk" rel="nofollow">https://pypi.python.org/pypi/nltk</a> |
<a href="http://www.nltk.org/book_2ed/ch05.html#scaling-up" rel="nofollow">http://www.nltk.org/book_2ed/ch05.html#scaling-up</a>
I hate this genre of post that basically follows the line: "I went to <established project> and attempted to educate them. When they didn't listen I went and built something better. Now its clear they should have listened to me, and you should all abandon their software"<p>Almost always the scope of the new project is much smaller, different or much less mature than the project being bashed. Open source projects are not required to make changes to please any arbitrary user that wants to make changes, even if it's to bring technical improevements.<p>In NLTK's case, they have a whole book written around their project. Presumably significant changes to project structure and function would mean heavy documentation/writing work, and might not fit the goals of their project. Bashing them as a result just shows a complete lack of understanding of how/why people write and maintain software.
Very relatable post. Isn't NLTK primarily a teaching / demonstration tool though?<p>I just checked their website and the claim of <i>"NLTK is a leading platform for building Python programs to work with human language data... a suite of libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning"</i> does sound a little odd. But I think everyone in the industry knows NLTK's place and purpose -- you practically cannot avoid finding out quickly. NLTK's scope is clearly too broad to be meaningfully cutting edge at any one thing.<p>New libraries and implementations will always have an advantage. It's easier to tout "simplicity and leanness" when you don't have to carry over all the baggage and backward compatibility accumulated over the years.<p>For that reason, an occasional "complexity reset" is expected, and if a library would not or can not do it, another library will. Will SpaCy's fate be different, 10 years down the road?
Regarding dead (as in unused) code, I keep noticing the guys on my UI development team commenting out code and then committing it to Git. I remind them periodically that they can just delete the code and if they ever need it, they can use Git to pull up historical versions of the file for reference.
As someone who did contribute to NLTK quite a bit, it was quite useful back in the day especially when I had to teach NLP/CL to linguistics (non-CS) graduate students. I agree with Radim that NLTK has a purpose - and it's not to implement the latest and the greatest NLP algorithms. I'm glad NLTK exists and although it is not what I use today, I'm pretty sure whatever I do use today (CoreNLP, gensim, etc.) will all be superseded by the next best thing a decade from now.
I've updated the NLTK issue tracker with information about how the model for NLTK's built-in POS tagger was trained:
<a href="https://github.com/nltk/nltk/issues/1063#issuecomment-138005116" rel="nofollow">https://github.com/nltk/nltk/issues/1063#issuecomment-138005...</a><p>The second edition of the book will include a "scaling up" section in most chapters, which shows how to transition from NLTK's pure Python implementations to NLTK's wrappers for the Stanford tools.
I put all dead code in a file called "deadcode.c" and get done with it. If I need it again, I can always copy from there. Easier than searching through git history.
I like the gist of this post, but it feels somewhat incomplete: NTLK is Apache licensed and spaCy is a dual-licensed (AGPL or money) commercial product. It's a good idea and an honest business, and I hope he succeeds, but I think it would've been more honest if the article had reflected that.
Can someone explain the following comments, for someone with some knowledge of ML but none of NLP?
"First, it's really much better to use Averaged Perceptron, or some other method which can be trained in an error-driven way. You don't want to do batch learning. Batch learning makes it difficult to train from negative examples effectively, and this makes a very big difference to accuracy"
I thought that it was typical for suitably regularized batch methods to modestly outperform or at least match (in terms of accuracy) online methods, whose main advantage is their speed.
The theoretically "best" algorithm may not necessarily be the one that fits a particular task or set of constraints the best. It is presumptuous of the author to know what's best for every user of the toolkit.<p>I suggest that the author, being so wise in the ways of NLP science, channel this outrage and write "NLTK: The Good Parts" to save the rest of the world from stumbling blindly in the dark wilderness of ignorance.
So rather than jump in and start adding documentation you blast the developers, who are offering this stuff free and without warranty or implied fitness for any purpose?<p>You can contribute by adding documentation where you see it lacking, especially if you have domain specific knowledge that would help others.<p>Or you can blast the entire project, not help, and go write your own. The thing that bothers me is that if you know enough, and it's mostly a teaching tool (my understanding from other comments), you could greatly improve the situation for the next guy by providing your enlightened input on the subject in the form of documentation. So the whole damn community loses out on your hard-earned understanding.<p>Meanwhile, 10 years from now, your project will be replaced, and if NLTK is really a teaching tool, you won't even be a footnote (because teaching tools don't die unless a whole field dies).<p>This smacks of the kind of "bubble" Silicon Valley entitlement that I can't quite wrap my head around (I know, author isn't in SV, I just see this kind of crap coming from there).
NLTK would include state-of-art openly and "nicely" license implementation soon: <a href="https://github.com/nltk/nltk/issues/1110" rel="nofollow">https://github.com/nltk/nltk/issues/1110</a>
New to NLP we tried NLTK first for a toy project and it was very slow and inaccurate. Luckily we found spaCy, switched to it and sped things up 10x with better accuracy and it was easier to use. Based on this experience I tend to agree with the author.