Announcing SyntaxNet: The World’s Most Accurate Natural Language Parser

1083 pointsby cjdulbergerabout 9 years ago

36 comments

xigencyabout 9 years ago

Evidence that this is the most accurate parser is here; the previous approach mentioned is a March 2016 paper, "Globally Normalized Transition-Based Neural Networks," <a href="http://arxiv.org/abs/1603.06042" rel="nofollow">http://arxiv.org/abs/1603.06042</a>"On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old Penn Treebank), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already better than any previous approach."From the original paper, "Our model achieves state-of-the-art accuracy on all of these tasks, matching or outperforming LSTMs while being significantly faster. In particular for dependency parsing on the Wall Street Journal we achieve the best-ever published unlabeled attachment score of 94.41%."This seems like a narrower standard than described, specifically being better at parsing the Penn Treebank than the best natural language parser for English on the Wall Street Journal.The statistics listed on the project GitHub actually contradict these claims by showing the original March 2016 implementation has higher accuracy than Parsey McParseface.

评论 #11686396 未加载

评论 #11689769 未加载

评论 #11686515 未加载

评论 #11690089 未加载

评论 #11686506 未加载

评论 #11686486 未加载

评论 #11687265 未加载

teraflopabout 9 years ago

This is really cool, and props to Google for making it publicly available.The blog post says this can be used as a building block for natural language understanding applications. Does anyone have examples of how that might work? Parse trees are cool to look at, but what can I do with them?For instance, let's say I'm interested in doing text classification. I can imagine that the parse tree would convey more semantic information than just a bag of words. Should I be turning the edges and vertices of the tree into a feature vectors somehow? I can think of a few half-baked ideas off the top of my head, but I'm sure other people have already spent a lot of time thinking about this, and I'm wondering if there are any "best practices".

评论 #11686675 未加载

评论 #11686725 未加载

评论 #11686661 未加载

评论 #11687628 未加载

评论 #11686527 未加载

评论 #11686829 未加载

评论 #11686774 未加载

评论 #11688079 未加载

fpgaminerabout 9 years ago

One of the projects I'd love to develop is an automated peer editor for student essays. My wife is an english teacher and a large percentage of her time is taken up by grading papers. A large percentage of that time is then spent marking up grammar and spelling. What I envision is a website that handles that grammar/spelling bit. More importantly, I'd like it as a tool that the students use freely prior to submitting their essays to the teacher. I want them to have immediate feedback on how to improve the grammar in their essays, so they can iterate and learn. By the time the essays reach the teacher, the teacher should only have to grade for content, composition, style, plagiarism, citations, etc. Hopefully this also helps to reduce the amount of grammar that needs to be taught in-class, freeing time for more meaningful discussions.The problem is that while I have knowledge and experience in the computer vision side of machine learning, I lack experience in NLP. And to the best of my knowledge NLP as a field has not come as far as vision, to the extent that such an automated editor would have too many mistakes. To be student facing it would need to be really accurate. On top of that it wouldn't be dealing with well formed input. The input by definition is adversarial. So unlike SyntaxNet which is built to deal with comprehensible sentences, this tool would need to deal with incomprehensible sentences. According to the link, SyntaxNet only gets 90% accuracy on random sentences from the web.That said, I might give SyntaxNet a try. The idea would be to use SyntaxNet to extract meaning from a broken sentence, and then work backwards from the meaning to identify how the sentence can be modified to better match that meaning.Thank you Google for contributing this tool to the community at large.

评论 #11686926 未加载

评论 #11687307 未加载

评论 #11687199 未加载

评论 #11686964 未加载

评论 #11686912 未加载

评论 #11688266 未加载

jrgojabout 9 years ago

Now for the buffalo test[1]`echo 'Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo' | syntaxnet/demo.sh'<pre><code> buffalo NN ROOT +-- buffalo NN nn | +-- Buffalo NNP nn | | +-- Buffalo NNP nn | | +-- buffalo NNP nn | +-- buffalo NN nn +-- Buffalo NNP nn +-- buffalo NNP nn </code></pre> [1]: <a href="https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo" rel="nofollow">https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...</a>

评论 #11688856 未加载

deanclatworthyabout 9 years ago

It's really nice to have access to these kinds of tools. I am sure some folks from Google are checking this, so thank you.Analysis of the structure of a piece of text is the first step to understanding its meaning. IBM are doing some good work in this area. <a href="http://www.alchemyapi.com/products/demo/alchemylanguage" rel="nofollow">http://www.alchemyapi.com/products/demo/alchemylanguage</a>Anything in the pipeline for this project to help with classifying sentiment, emotion etc. from text?

评论 #11686497 未加载

feralabout 9 years ago

I'd love to hear Chomsky's reaction to this stuff (or someone in his camp on the Chomsky vs. Norvig debate [0]).My understanding is that Chomsky was against statistical approaches to AI, as being scientifically un-useful - eventual dead ends, which would reach a certain accuracy, and plateau - as opposed to the purer logic/grammar approaches, which reductionistically/generatively decompose things into constituent parts, in some interpretable way, which is hence more scientifically valuable, and composable - easier to build on.But now we're seeing these very successful blended approaches, where you've got a grammatical search, which is reductionist, and produces an interpretable factoring of the sentence - but its guided by a massive (comparatively uninterpretable) neural net.It's like AlphaGo - which is still doing search, in a very structured, rule based, reductionist way - but leveraging the more black-box statistical neural network to make the search actually efficient, and qualitatively more useful. Is this an emerging paradigm?I used to have a lot of sympathy for the Compsky argument, and thought Norvig et al. [the machine learning community] could be accused of talking up a more prosaic 'applied ML' agenda into being more scientifically worthwhile than it actually was.But I think systems like this are evidence that gradual, incremental, improvement of working statistical systems, can eventually yield more powerful reductionist/logical systems overall. I'd love to hear an opposing perspective from someone in the Chomsky camp, in the context of systems like this. (Which I am hopefully not strawmanning here.)[0]Norvig's article: <a href="http://norvig.com/chomsky.html" rel="nofollow">http://norvig.com/chomsky.html</a>

评论 #11689342 未加载

评论 #11688440 未加载

评论 #11688309 未加载

评论 #11709874 未加载

评论 #11688170 未加载

评论 #11692790 未加载

mdipabout 9 years ago

This looks fantastic. I've been fascinated with parsers ever since I got into programming in my teens (almost always centered around programming language parsing).Curious - The parsing work I've done with programming languages was never done via machine learning, just the usual strict classification rules (which are used to parse ... code written to a strict specification). I'm guessing source code could be fed as data to an engine like this as a training model but I'm not sure what the value would be. Does anyone more experienced/smarter than me have any insights on something like that?As a side-point:Parsy McParseface - Well done. They managed to lob a gag over at NERC (Boaty McBoatface) and let them know that the world won't end because a product has a goofy name. Every time Google does things like this they send an unconscious remind us that they're a company that's 'still just a bunch of people like our users'. They've always been good at marketing in a way that keeps that "touchy-feely" sense about them and they've taken a free opportunity to get attention for this product beyond just the small circle of programmers.As NERC found out, a lot of people paid attention when the winning name was Boaty McBoatface (among other, more obnoxous/less tasteful choices). A story about a new ship isn't going to hit the front page of any general news site normally and I always felt that NERC missed a prime opportunity to continue with that publicity and attention. It became a topic talked about by friends of mine who would otherwise have never paid attention to anything science related. It would have been comical, should the Boaty's mission turn up a major discovery, to hear 'serious newscasters' say the name of the ship in reference to the breakthrough. And it would have been refreshing to see that organization stick to the original name with a "Well, we tried, you spoke, it was a mistake to trust the pranksters on the web but we're not going to invoke the 'we get the final say' clause because that wasn't the spirit of the campaign. Our bad."

评论 #11688976 未加载

Someoneabout 9 years ago

For those wondering: the license appears to be Apache 2.0 (<a href="https://github.com/tensorflow/models" rel="nofollow">https://github.com/tensorflow/models</a>)

评论 #11690254 未加载

syncroabout 9 years ago

Dockerized version so you try without installing:<a href="https://hub.docker.com/r/brianlow/syntaxnet-docker/" rel="nofollow">https://hub.docker.com/r/brianlow/syntaxnet-docker/</a>

TeMPOraLabout 9 years ago

> Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence.Isn't the core observation about natural language that humans don't parse it at all? Grammar is a secondary, derived construct that we use to give language some stability; I doubt anyone reading "Alice drove down the street in her car" actually parsed the grammatical structure of that sentence, either explicitly or implicitly.Anyway, some impressive results here.

评论 #11686812 未加载

评论 #11688479 未加载

评论 #11686698 未加载

评论 #11698733 未加载

ohitsdomabout 9 years ago

I'm sure it's only a matter of time before someone puts this online in a format easily played with. Looking forward to that

评论 #11686246 未加载

评论 #11686472 未加载

rspeerabout 9 years ago

I'm glad they point out that we need to move on from Penn Treebank when measuring the performance of NLP tools. Most communication doesn't sound like the Penn Treebank, and the decisions that annotators made when labeling Penn Treebank shouldn't constrain us forever.Too many people mistake "we can't make taggers that are better at tagging Penn Treebank" for "we can't make taggers better", when there are so many ways that taggers could be improved in the real world. I look forward to experimenting with Parsey McParseface.

weinzierlabout 9 years ago

Say, I wanted to use this for English text with a large amount of jargon. Do have to train my own model from scratch or is it possible to retrain Parsey McParseface?How expensive is it to train a model like Parsey McParseface?

评论 #11687339 未加载

scarface74about 9 years ago

I started working on a parser as a side project that could parse simple sentences, create a knowledge graph, and then you could ask questions based on the graph. I used <a href="http://m.newsinlevels.com" rel="nofollow">http://m.newsinlevels.com</a> at level 1 to feed it news articles and then you could ask questions.It worked pretty well but I lost interest once I realized I would have to feed it tons of words. So could I use this to do something similar?What programming language would I need to use?

评论 #11700077 未加载

jventuraabout 9 years ago

As someone who has published work in the NLP area, I always take claimed results with a grain of salt. With that said, I still will have to read the paper to know the implementation details, although my problem with generic linguistic approaches such as this one seems to be is that it is usually hard to "port" to other languages.For instance, the way they parse sequences of words may or may not be too specific to the English language. It is somewhat similar to what we call "overfitting" in the data-mining area, and it may invalidate this technique for other languages.When I worked on this area (up to 2014), I worked mainly in language-independent statistical approaches. As with everything, it has its cons as you can extract information from more languages, but, in general, with less certainties.But in general, it is good to see that the NLP area is still alive somewhere, as I can't seem to find any NLP jobs where I live! :)Edit: I've read it in the diagonal, and it is based on a Neural Network, so in theory, if it was trained in other languages, it could return good enough results as well. It is normal for English/American authors to include only english datasets, but I would like to see an application to another language.. This is a very specialized domain of knowledge, so I'm quite limited on my analysis..

评论 #11687313 未加载

评论 #11687358 未加载

the_deciderabout 9 years ago

According to their paper (<a href="http://arxiv.org/pdf/1603.06042v1.pdf" rel="nofollow">http://arxiv.org/pdf/1603.06042v1.pdf</a>), the technique can also be applied to sentence compression. It would be cool if Google publishes that example code/training-data as well.

nevesabout 9 years ago

Shouldn't the title be renamed for "The World's Most Accurate Natural Language Parser For English"?It's impressive how Google's natural language features, since the simpler spell check, degrades when it work with languages different from English.

zodiacabout 9 years ago

> It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures.Does "possible" mean "syntactically valid" here? If so I'd be interested in a citation for it.Also, I wonder what kind of errors it makes wrt to the classification in <a href="http://nlp.cs.berkeley.edu/pubs/Kummerfeld-Hall-Curran-Klein_2012_Analysis_paper.pdf" rel="nofollow">http://nlp.cs.berkeley.edu/pubs/Kummerfeld-Hall-Curran-Klein...</a>

joostersabout 9 years ago

I don't see how a linguistic parser can cope with all the ambiguities in human speech or writing. It's more than a problem of semantics, you also have to know things about the world in which we live in order to make sense of which syntactic structure is correct.e.g. take a sentence like "The cat sat on the rug. It meowed." Did the cat meow, or did the rug meow? You can't determine that by semantics, you have to know that cats meow and rugs don't. So to parse language well, you need to know an awful lot about the real world. Simply training your parser on lots of text and throwing neural nets at the code isn't going to fix this problem.

评论 #11687612 未加载

评论 #11687592 未加载

评论 #11687606 未加载

评论 #11698799 未加载

aaron-santosabout 9 years ago

I'd love to see the failure modes especially relating to garden path sentences. [1][1] - <a href="https://en.wikipedia.org/wiki/Garden_path_sentence" rel="nofollow">https://en.wikipedia.org/wiki/Garden_path_sentence</a>

评论 #11690943 未加载

mindcrashabout 9 years ago

Anyone planning (or already busy) training Parsey with one of the alternative Treebanks available from Universal Dependencies [1]? Would love to know your results when you have any :)I am personally looking for a somewhat reliable NLP parser which can handle Dutch at the moment. Preferably one which can handle POS tagging without hacking it in myself.[1] <a href="http://universaldependencies.org/" rel="nofollow">http://universaldependencies.org/</a>

hartatorabout 9 years ago

> At Google, we spend a lot of time thinking about how computer systems can read and understand human language in order to process it in intelligent ways.There is 6 links in this sentence in the original text. I get it can help to get more context around it, but I think it's actually making the text harder to "human" parse. It also feels they have hired a cheap SEO consultant to do some backlink integrations.

评论 #11687462 未加载

jdp23about 9 years ago

Parsey McParseface is a great name.

评论 #11686210 未加载

评论 #11686202 未加载

评论 #11686608 未加载

评论 #11686197 未加载

评论 #11686189 未加载

sourcdabout 9 years ago

What would it take to build something like "wit.ai" using SyntaxNet ? i.e. to extract "intent" & related attributes from a sentence e.g.Input : "How's the weather today"Output : {"intent":"weather", "day":"Use wit-ai/duckling", "location":"..."}

ameliusabout 9 years ago

How would you feed a sentence to a neural net? As I understand, the inputs are usually just floating point numbers in a small range, so how is the mapping performed? And what if the sentence is longer than the number of input neurons? Can that even happen, and pose a problem?

评论 #11687704 未加载

评论 #11687293 未加载

评论 #11700082 未加载

w_t_payneabout 9 years ago

Cool - I reckon I'm going to try to use it to build a "linter" for natural language requirements specifications. (I'm a bit sad like that).

WWKongabout 9 years ago

Anyone know a tool that does Natural Language to SQL?

评论 #11688007 未加载

Animatsabout 9 years ago

This could lead to a fun WordPress plug-in. All postings must be parsable by this parser.Surprisingly, this thing is written in C++.

zemabout 9 years ago

one interesting use i can think of is new improved readability scores that can take into account words that are common or uncommon depending on part of speech. (e.g. a text that used "effect" as a noun would be lower-level than one that used "effect" as a verb)

vicayaabout 9 years ago

<pre><code> 1. WordNet 2. ImageNet 3. SyntaxNet ... n. SkyNet</code></pre>

instakillabout 9 years ago

What are some use cases for this for hobbyists?

degenerateabout 9 years ago

I'd love to let this loose on the comments section of worldstarhiphop or liveleak and see what it comes up with...

bertanabout 9 years ago

Parsey McParseface <3

jweirabout 9 years ago

Parsey McParseface? Nice touch Google.<a href="https://github.com/tensorflow/models/tree/master/syntaxnet/syntaxnet/models/parsey_mcparseface" rel="nofollow">https://github.com/tensorflow/models/tree/master/syntaxnet/s...</a>

评论 #11687049 未加载

评论 #11686821 未加载

评论 #11687884 未加载

评论 #11687649 未加载

评论 #11686983 未加载

评论 #11686775 未加载

scriptleabout 9 years ago

Did I just read it as Skynet ?

PaulHouleabout 9 years ago

Meh.This kind of parser isn't all that useful anyway. Parts of speech are one of those things people use to talk about language with, but you don't actually use them to understand language.

评论 #11687981 未加载