Apologies for the clunky question. We have a growing base of adult English students. Our teaching methodology is content-first, basically we find any material of interest to our students, based on their interests, sector and language goals and build lesson plans and discussions around that 'centerpiece'. A lot of the work is curation and creating reusable discussion questions.<p>I have been searching for a tool that can scan a paragraph and extract the grammar tenses and features (past simple, present continuous, passive voice, indirect question) as it's a recurring question with our students. We have tools to tell us the approximate level, suggested vocabulary and word count, but does this even exist (yet?). Thank you in advance.
Sadly, this doesn't exist yet.<p>You will find that most Natural Language Processing (NLP) tools conceptualize linguistic categories differently
from how teachers do (language teaching isn't linguistics,
there are often simplifications happening, and schoolbooks
get updated more slowly than linguistics evolves).<p>Examples:<p>* English verbs have only two tenses: PAST or NONPAST.
They can have PERFECTIVE aspect or not. They can have PROGRESSIVE aspect or not. Since these are 3 binary choices, there are at least 8 different ways how English verbs can be realized. I think there'd be less confusion in school if a more linguistically correct version was taught that separates out tense and aspects.<p>* "Future" or "Present Perfect" (something I still got taught in school) don't exist for a proper linguist.<p>To build what you suggest, existing tools could be combined, but there would have to be a mapping layer on top of syntactic parsers like Charniak Parser, Collins parser or MaltParser. Another mis-match between grammar in school and linguistics is single versus multiple theories: in school, people usually teach constituent trees, whereas in linguistics phrase structure (constituent) grammar is one theory among many, one alternative (valency and dependency grammar) that does not rely on trees but focuses on the relations between words has recently gained a lot of traction in linguistic circles.
I don't think there's an out-of-the box library for things like the detection of passive voice or indirect questions. But you should be able to build something that would let you do it based on the basic NLP toolkit: dependency parsing, POS tagging, lemmatization, named entity recognition.<p>I suggest you check out Spacy [0] for a quick and easy to use Python library providing the above features. The software produced by the Stanford NLP Group is also great [1].<p>If you do not want to get your hands dirty with code, there are a number of API providers that will offer you the same as the above libraries (TextRazor, Rosette Text Analytics...)<p>[0] <a href="https://spacy.io/" rel="nofollow">https://spacy.io/</a><p>[1] <a href="https://nlp.stanford.edu/software/" rel="nofollow">https://nlp.stanford.edu/software/</a>
There are several topics intertwined with solutions you seek: There's parts of speech (PoS) tagging, reducing to Lemma form, identifying end of sentence, etc.<p>After having faced a similar learning curve, I put what I know into a lengthy document[0] written in 2018 based upon explorations over 2016-17. That will get you deployed and operational quickly by following just the final section. The first section explains key concepts using conventional ideas as the means of introducing NLP jargon. In between covers theory and practice for getting the most out of any tool you're likely to use in the end.<p>More general tools are probably available today, such as add-ons for Elasticsearch. I'd start looking there. Interesting items came up when searching ddg for: NLP elasticsearch.<p>[0] <a href="http://play.org/articles/introduction-to-natural-language-processing" rel="nofollow">http://play.org/articles/introduction-to-natural-language-pr...</a>
I actually (sort of) wrote one of these a while back (though I don't think I ever got to implementing tenses - possibly this would be somewhat easy to implement on top of whatever I already built there, but maybe not idk). In any case:<p>> copulae verbs, linking verbs, terms that are often filtered (i.e. stop terms), question terms, time sensitive nouns, amplifiers, clauses, coordinating conjunctions, negations, conditionals (ORs), and contractions<p><a href="https://github.com/nyxtom/salient/" rel="nofollow">https://github.com/nyxtom/salient/</a>
As someone else alluded to, this is a task for multiple models. Fortunately, there are a lot of great NLP libraries that combine multiple pre-trained language models into a single pipeline you can interface with, like Stanza. From their docs, their vanilla pipeline breaks down the sentence "Barack Obama was born in Hawaii. He was elected president in 2008." as :<p>('Barack', '4', 'nsubj:pass')<p>('Obama', '1', 'flat')<p>('was', '4', 'aux:pass')<p>('born', '0', 'root')<p>('in', '6', 'case')<p>('Hawaii', '4', 'obl')<p>('.', '4', 'punct')<p>It should be very easy to deploy Stanza's pipeline as an API endpoint. Here is an example of such a NLP-library-as-API endpoint, albeit with Hugging Face's Transformers, deployed via Cortex: <a href="https://github.com/cortexlabs/cortex/blob/master/examples/pytorch/sentiment-analyzer/predictor.py" rel="nofollow">https://github.com/cortexlabs/cortex/blob/master/examples/py...</a>
Example of passive voice detection using spacy[1]: <a href="https://gist.github.com/armsp/30c2c1e19a0f1660944303cf079f831a" rel="nofollow">https://gist.github.com/armsp/30c2c1e19a0f1660944303cf079f83...</a><p>[1] <a href="https://spacy.io/" rel="nofollow">https://spacy.io/</a>
My IOCCC entry [0] detects English passive constructions. Do not feel discouraged by its looks; it is a solid tool. Since both ioccc.org and its official mirrors (in the same domain) are down now, you can look at its Wayback Machine cache [1].<p>[0] <a href="https://www.ioccc.org/2018/ciura/" rel="nofollow">https://www.ioccc.org/2018/ciura/</a>
[1] <a href="https://web.archive.org/web/20200224040340/https://www.ioccc.org/2018/ciura/" rel="nofollow">https://web.archive.org/web/20200224040340/https://www.ioccc...</a>
Some parser like Spacy can give some tense additional information for verbs. But it's probably not custom enough for what you want.<p>Maybe you can give GPT-3 a try.<p>If you want to go the custom route, the easy way but which consume a lot of processing power, is to use a neural network and necessitate a boring dataset phase construction.<p>You build a dataset corresponding to your problem. And you learn it with the neural network.<p>For inspiration you can see my colorify browser extension, which uses a neural network to learn at the same time to split sentences, predict POS tags, predict root of the sentence, predict the parse tree which are then used to decorate the webpage.<p>What I did was just programmatically build a dataset from the spacy parser to build a custom javascript parser which does what I want. If I wanted to add some additional information that Spacy doesn't provide like grammar tenses and features, I can complete my dataset manually and have the network predict all the decorations at the same time, which allows it to not need a lot of samples because the layers are shared.<p>You can probably build your dataset faster by interacting with your neural network (active learning).<p>For the model you can start with something simple like a convolution residual network architecture. And later use some transformers when you want to reach state of the art.
You should look into the English Resource Grammar:<p><a href="http://moin.delph-in.net/ErgTop" rel="nofollow">http://moin.delph-in.net/ErgTop</a><p>Online demo:<p><a href="http://erg.delph-in.net/logon" rel="nofollow">http://erg.delph-in.net/logon</a><p>It has all that information in the generated feature structure -- even more than you can view in the web interface. There's a development environment you can download, as well as a headless linux tool called ACE you can use on a server. The ERG is complex, but far and away the most sophisticated tool in this space.
Check out UDify for dependency parsing with universal parts of speech and features.<p><a href="https://github.com/Hyperparticle/udify" rel="nofollow">https://github.com/Hyperparticle/udify</a>
This isn't really a programming solution but you should check out this webapp:<p><a href="http://www.hemingwayapp.com/" rel="nofollow">http://www.hemingwayapp.com/</a><p>It may not be the exact thing you're looking for but it can probably be helpful to your students.<p>I would also look at Python NLTK. I've only dabbled in the toolkit, so I'm not sure if it has what you're looking for exactly, but it's worth a look.<p><a href="http://www.nltk.org/" rel="nofollow">http://www.nltk.org/</a>
To do the grammar tense analysis, you can use spaCy or another syntactic parser. The parse tree won't directly give you the exact grammar tense, you will need to do some simple analysis of the conjugational form of the root verb, and the auxiliary verbs that are attached to it.<p>I've done extensive work in this area, including developing my own statistical parser from scratch. I'd be happy to chat more about this project, my email is daniel dot burfoot at gmail.com.
Have you taken a look at Google's Natural Language API? Try out the demo and switch to the "Syntax" tab to see the output. More info: <a href="https://cloud.google.com/natural-language" rel="nofollow">https://cloud.google.com/natural-language</a>
I have played around with similar projects. A good starting point is google's NLP sentence parsing API. Be warned: the accuracy may not be good enough for your application.
I'm not sure if it has an API or any kind of integrations, but I know the Hemingway App[0] can detect passive voice, and possibly other features you're looking for.<p>[0]: <a href="http://www.hemingwayapp.com/" rel="nofollow">http://www.hemingwayapp.com/</a>
On recent debians/ubuntu, PoS tagging is just one apt away:<p><pre><code> $ sudo apt install -y apertium-eng
$ echo "I have been searching for a tool that can scan a paragraph" |apertium eng-disam|grep -v '^;'
"<I>"
"prpers" prn subj p1 mf sg
"<have>"
"have" vbhaver inf
"have" vbhaver pres
"<been>"
"be" vbser pp
"<searching for>"
"search# for" vblex ger SELECT:177
"<a>"
"a" det ind sg
"<tool>"
"tool" n sg
"<that>"
"that" cnjsub
"that" prn dem mf sg
"that" prn rel an mf sp
"<can>"
"can" vbmod pres SELECT:281
"<scan>"
"scan" vblex inf SELECT:140
"<a>"
"a" det ind sg
"<paragraph>"
"paragraph" n sg
"<.>"
"." sent
</code></pre>
(grepping out lines with ; since they just show what was <i>not</i> removed by the disambiguator, whereas SELECT/REMOVE are just trace info saying what rules applied. If there are multiple indented lines, then the disambiguator didn't manage to fully disambiguate the analysis.)<p>If you want to e.g. mark passive, it's easy to write a Constraint Grammar rule to do this. Put the following into rules.cg3:<p><pre><code> DELIMITERS = sent ;
ADD (&PASSIVE) ("be") # Add the tag "&PASSIVE" to the word with lemma "be"
IF
(1* (pp) # There is a participle to the right
BARRIER (*) - (adv) # with nothing in between except perhaps adverbs
);
</code></pre>
and pipe it in after the above pipeline:<p><pre><code> $ echo "The paper is not signed by me" |apertium eng-disam |grep -v '^;'|vislcg3 -g rules.cg3
"<The>"
"the" det def sp
"<paper>"
"paper" n sg
"<is>"
"be" vbser pres p3 sg &PASSIVE
"<not>"
"not" adv
"<signed>"
"sign" vblex pp
"sign" vblex past
"signed" adj
"<by>"
"by" pr SELECT:470
"<me>"
"prpers" prn obj p1 mf sg
"<.>"
"." sent
</code></pre>
( <a href="https://wiki.apertium.org/wiki/Constraint_Grammar" rel="nofollow">https://wiki.apertium.org/wiki/Constraint_Grammar</a> for more info on CG )
Some tough love:<p>There will always be a gap between "your judgement" and the "judgement baked into a model" -- worse yet, if the model is very general and oriented towards cheap computation and away from expensive people it will have vague and contradictory judgements inside it that make the results meaningless.<p>That is the language of failure: the structure of success looks like the following.<p>(1) The system works like a "magic magic marker", that is, you mark up a lot of text (say 20,000 sentences) the way you think it should be marked up. This might be a character-at-a-time or word-at-a-time. Character-at-a-time is real and eternal, word-at-a-time is not real because there is not really such a thing as a "word". (e.g. "red ball" can fill slots that take "ball", you can smash together subwords to make words, for that matter people violate punctuation rules "Amazon.com announced that...", people call themselves n3pg34r, ...) So if you segment the text up front and segment it the wrong way you may throw out essential information and choose to fail.<p>(2) You need some system to mark up the text manually and efficiently. It is a lot of work. A typical person can make about 2000 or so up/down judgements a day; if a sentence counts for 10 decisions then maybe you can annotate 200 sentences a day. If you can get students to do it and get teachers to review it you might make short work of it.<p>This annotator<p><a href="http://brat.nlplab.org/" rel="nofollow">http://brat.nlplab.org/</a><p>ticks the requirements, but most people find it terribly hard to use and wind up building "easy to use" systems that don't align things right at level (1) and... fail.<p>Assuming you do (1) and (2) the odds are in your favor, but you have to now<p>(3) build models; it does not matter if the model is a bunch of rules you cobbled together, or hidden markov, or LSTM, or convolutional. Off the top I would train an LSTM to 'predict the next character' on maybe 100M characters of text, then I would stick a simple model that takes the LSTM state as an input and labels characters at the output (could be SVM, Random Forest, Logit, or 3 layer on NN)<p>(4) Accept that the system is not going to be perfect, but have the ability to manually patch wrong results, improve the training data over time. I'd say this is a more important practice than any particular approach to (3)<p>Some tool could give you (1-4) tied up in a bow<p><a href="https://www.tagtog.net/" rel="nofollow">https://www.tagtog.net/</a><p>claims to. But (2) involves elbow grease that 90% of people aren't going to do. Some of the 10% of people who do that elbow grease will succeed, the other 90% will fail.