Interesting. OP (if you're around): I noticed in the confusion matrix that everything was classified to the middle classes (5, 6, 7). That makes sense because the 3s, 4s, and 8s are rare and "true 8s" are still most likely to have a high probability on the 7 class, because there are far more 7s in the data. Did you analyze approximate correctness for the probabilities, or consider sampling from the computed probabilities rather than classifying to the highest one, to see where that led?