My Observations:<p>* Instance-based methods can adapt to any kind of pattern if you have enough data. This is a well-known result in machine learning -- the interesting question is simply how much data do you need?<p>* It's quite remarkable how well RBF SVMs do in almost all cases. They even outperformed a 2nd degree polynomial kernel SVM on a 2nd degree polynomial!<p>* Logistic regression is pretty terrible for anything except "easy" data -- simple linear data.<p>* Random forests are sometimes ok, but tend to prefer axis-aligned data (at least in his formulation).<p>* Meta-point: all the tests shown are with "clean" data...i.e., there are no "wrong" training examples. This is unrealistic, and in practice makes a HUGE difference for some of these methods. E.g., a lot of the rule-based methods get demolished by even a little bit of wrong data. In contrast, SVMs have a slack variable that can tolerate some amount of noise, and would probably shine even more on such data.
OT:<p>What do the cognoscenti recommend for doing automatic text classification/categorization? I have been looking at Spam filters, and they're mostly boolean type predicates that return Spam/NotSpam results along with a confidence number. I want to be able to do that same for a large number of categories.
It bothers me that these data sets are very low dimensional, without noise and pretty-picture-like. This pretty much excludes any interesting data to try to learn (after all one could easily manually code a classifier for most os these "concepts" that performs 100%)