I haven't looked at the code, but glancing at the results leaves me thinking it might need more work.<p>The output seems to me around the level a Markov chain might produce. Karpathy's RNN code produces much, much better results[1].<p>I wonder if manually extracting features and training the RNN on that is a mistake? RNN's tend to work well on text because they encode understanding of the parse tree themselves.<p>[1] <a href="https://github.com/karpathy/char-rnn" rel="nofollow">https://github.com/karpathy/char-rnn</a>
this doesn't look like a neural net to me. from NeuralNetwork.py<p><pre><code> from sklearn.neighbors import KNeighborsClassifier
# Create a sperate neural network for each identifier
for index in range(0, len(NaturalLanguageObject._Identifiers)):
nn = KNeighborsClassifier()
self._Networks.append(nn)</code></pre>
I am afraid this author has no idea what he is doing - and is loosely throwing around terms he does not understand. What the hell was his normalization procedure. Dangerous to readers who do not know a lot and will get confused while reading.
Fun hack. If anything, it highlights how compelling deep learning and RNNs are: no messing with NLP, no messing with building other features or adding up classifiers, etc. The manual feature engineering means it might work better on a smaller dataset, but even then probably not.<p>For comparison with Andrej Karpathy's RNN code (<a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/" rel="nofollow">http://karpathy.github.io/2015/05/21/rnn-effectiveness/</a>) training on the "HarryPotter(xxlarge).txt" (76K) file using the default hyperparameters and a batch size of 25 gets me:<p><pre><code> > But Atfa the loom proset! No contarin — mibll,’s just pucking to live
> note left them hard and fitther, clooked of course little happered to
> trige on the fistpened. Their knew Harry mear from the shind-beas
> eveided, at Uncle Vernon’s thepped to spept were pelled and beadn
> Harry, distine dy use. Harry had in a amalout, into the fish sfary door.
</code></pre>
The difference here is tokenizing on words vs letters: the RNN code is trying to learn the structure of English from completely zero whereas the code here gets to work with well-formed words from the beginning. But otherwise, the results in the linked post are about as silly semantically:<p><pre><code> > Input: "Harry don't look"
> Output: "Harry don't look , incredibly that a year for been parents in .
> followers , Harry , and Potter was been curse . Harry was up a year ,
> Harry was been curse "
</code></pre>
EDIT: Updated the RNN output text. Was sampling from a checkpoint file for a different input corpus. Got confused by the long similar-looking filenames. Doesn't change the overall point though.
> I decided to use scikit's machine learning libraries. [...] The writer I create uses multiple SVM engines. One large neural network for the sentence structuring and multiple small networks for the algorithm which selects words from a vocabulary.<p>This person has no idea what they're talking about. sklearn has no neural network code whatsoever.<p>EDIT: this feels like a testament to sklearn's greatness, honestly.
I'd be interested to know if this could be turned into a tool that lets you know how well your writing (or coding) matches the "house style". (Mostly for technical documentation, requirements specs etc...)<p>I'd be even more interested if it could be turned into a sublime text plugin that highlights words / phrases that deviate most strongly from the house style.