Anecdotally, I've noticed the onion uses certain phrases over and over again in articles - "area man" comes to mind.<p>Did you really train it to detect satire, or just the onion writer's conventions? How does it perform when trained on onion articles and tested against some non-onion satire publication?
This has the same input data fidelity issues as the author's previous approach toward identifying fake news, which was flagged to death for being misleading: <a href="https://news.ycombinator.com/item?id=16128295" rel="nofollow">https://news.ycombinator.com/item?id=16128295</a><p>A sample size of 600 for <i>text data</i> is literally nothing for these types of models. (although atleast the classes are balanced this time)
As always it’s easy to apply this technology to differentiated content from a single publisher vs another publisher, as is the case here. In addition the onion is satire, but satire is the easier use case because not only is it single source content (as mentioned, but the larger the author pool and the more differentiated the model the higher difficulty accuracy becomes) but it doesn’t have to take into account less outrageous articles built on subtler genres like parody and sarcasm. Subtle crushes machine learning algs ime.<p>Love the concept, but it’s be great to see a deeper exploration as a demo. Keep going!
Without diminishing the author's efforts, I would say that he quickly teaches his AI how to recognize The Onion articles, instead of satire articles in general.<p>I would be curious to see results with 3-4 news sources for each groups.
This is what I loved about Google cloud machine learning API (or whatever mixture of the above nouns it's called now). I found it during my final project as a coding bootcamp student and got it up and running within a day, telling me whether a sentence was in one of three given languages.<p>Machine learning / ai things like this are so simple and approachable right now. Just fill a .CSV and upload it, boom, training model.