It's worth keeping in mind that learning from few examples is not such a big
deal. What is really hard to do (and a long-standing problem in machine
learning) is learning a model that <i>generalises well to unseen data</i>.<p>So the question is: does the OP really show good generalisation?<p>It's hard to see how one would even begin to test this, in the case of the OP.
The OP describes an experiment where a few hundred instances were drawn from a
set of 50K, and used both for training and testing (by holding out a few,
rather than cross-validating, if I got that right).<p>I guess one way to go about it is to use the trained model to label your
unseen data (the rest of the 50k) and then go through that model-labelled data
by hand, and try to figure out how well the model did.<p>We're talking here about natural language, however, where the domain is so
vast that even the full 50k instances are very few to learn well. That doesn't
have to do anything with the model being trained, deep or shallow. It has
everything to do with the fact that you can say the same thing in 100k
different ways, and still not exhaust all the ways to say that one thing. So
50k examples are either not enough examples of different ways to say the same
thing, or not enough examples of the different things you can say, or, most
probably, both.<p>It's also worth remembering that deep nets can overfit much worse than other
methods, exactly because they are so good at memorising training data. It's very hard
to figure out what a deep net is really learning, but it would not be at all
surprising to find out that your "powerful" model is just a very expensive
alternative to Ctrl + C.<p>It's just memorised your examples, see?