For anyone interested in the computer vision side of this topic, the author here is using a variant of color histograms, which was state of the art around 1990 [1][2]. Since 2003, bag of visual words approaches have usually meant extracting SIFT-like features from a database of images, quantizing the features down to a list of thousands or millions of "words", and then treating the images like documents containing those "visual words" [3][4]. (Nothing wrong with the approach he's using [simple and fast], but the bag of words terminology in the article usually suggests a different class of approaches.)<p>[1] <a href="https://staff.fnwi.uva.nl/r.vandenboomgaard/IPCV/_downloads/swainballard.pdf" rel="nofollow">https://staff.fnwi.uva.nl/r.vandenboomgaard/IPCV/_downloads/...</a><p>[2] <a href="https://www.cs.utexas.edu/users/dana/Swain1.pdf" rel="nofollow">https://www.cs.utexas.edu/users/dana/Swain1.pdf</a><p>[3] <a href="http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf" rel="nofollow">http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03....</a><p>[4] <a href="http://www-inst.eecs.berkeley.edu/~cs294-6/fa06/papers/nister_stewenius_cvpr2006.pdf" rel="nofollow">http://www-inst.eecs.berkeley.edu/~cs294-6/fa06/papers/niste...</a>